Editor’s Picks
Top papers matching your research interests in multimodal LLMs, audio and vision understanding/generation.
[1] Speed by Simplicity: A Single-Stream Architecture for Fast Audio-Video Generative Foundation Model
SII-GAIR, Sand. ai, :, Ethan Chern, Hansi Teng, Hanwen Sun, Hao Wang, Hong Pan, Hongyu Jia, Jiadi Su, Jin Li, Junjie Yu, Lijie Liu, Lingzhi Li, Lyumanshan Ye, Min Hu, Qiangang Wang, Quanwei Qi, Steffi Chern, Tao Bu, Taoran Wang, Teren Xu, Tianning Zhang, Tiantian Mi, Weixian Xu, Wenqiang Zhang, Wentai Zhang, Xianping Yi, Xiaojie Cai, Xiaoyang Kang, Yan Ma, Yixiu Liu, Yunbo Zhang, Yunpeng Huang, Yutong Lin, Zewei Tao, Zhaoliang Liu, Zheng Zhang, Zhiyao Cen, Zhixuan Yu, Zhongshu Wang, Zhulin Hu, Zijin Zhou, Zinan Guo, Yue Cao, Pengfei Liu
Main category: cs.CV
TL;DR: daVinci-MagiHuman is an open-source audio-video generative foundation model that jointly generates synchronized video and audio using a single-stream Transformer architecture, excelling in human-centric scenarios with multilingual support and efficient inference.
Details
Motivation: To create a unified audio-video generative model that avoids complex multi-stream or cross-attention architectures while maintaining strong performance in human-centric generation tasks, with efficient inference capabilities.Method: Uses a single-stream Transformer that processes text, video, and audio within a unified token sequence via self-attention only. Combines this with model distillation, latent-space super-resolution, and a Turbo VAE decoder for efficient inference.
Result: Achieves highest visual quality and text alignment among leading open models, with lowest word error rate (14.60%) for speech intelligibility. Wins 80.0% against Ovi 1.1 and 60.9% against LTX 2.3 in human evaluations. Generates 5-second 256p video in 2 seconds on single H100 GPU.
Conclusion: daVinci-MagiHuman demonstrates that a simple single-stream Transformer architecture can effectively generate synchronized audio-video content with strong human-centric performance, multilingual capabilities, and efficient inference.
Abstract: We present daVinci-MagiHuman, an open-source audio-video generative foundation model for human-centric generation. daVinci-MagiHuman jointly generates synchronized video and audio using a single-stream Transformer that processes text, video, and audio within a unified token sequence via self-attention only. This single-stream design avoids the complexity of multi-stream or cross-attention architectures while remaining easy to optimize with standard training and inference infrastructure. The model is particularly strong in human-centric scenarios, producing expressive facial performance, natural speech-expression coordination, realistic body motion, and precise audio-video synchronization. It supports multilingual spoken generation across Chinese (Mandarin and Cantonese), English, Japanese, Korean, German, and French. For efficient inference, we combine the single-stream backbone with model distillation, latent-space super-resolution, and a Turbo VAE decoder, enabling generation of a 5-second 256p video in 2 seconds on a single H100 GPU. In automatic evaluation, daVinci-MagiHuman achieves the highest visual quality and text alignment among leading open models, along with the lowest word error rate (14.60%) for speech intelligibility. In pairwise human evaluation, it achieves win rates of 80.0% against Ovi 1.1 and 60.9% against LTX 2.3 over 2000 comparisons. We open-source the complete model stack, including the base model, the distilled model, the super-resolution model, and the inference codebase.
Relevance: 10/10
[2] OmniCodec: Low Frame Rate Universal Audio Codec with Semantic-Acoustic Disentanglement
Jingbin Hu, Haoyu Zhang, Dake Guo, Qirui Zhan, Wenhao Li, Huakang Chen, Guobin Ma, Hanke Xie, Chengyou Wang, Pengyuan Xie, Chuan Xie, Qiang Zhang, Lei Xie
Main category: eess.AS
TL;DR: OmniCodec is a universal neural audio codec designed for low frame rate across diverse audio domains (speech, music, general sound) using hierarchical multi-codebook design with semantic-acoustic decoupling and self-guidance strategy.
Details
Motivation: Existing neural codecs focus primarily on speech and emphasize reconstruction fidelity, overlooking unified low frame rate modeling across diverse audio domains. High reconstruction quality doesn't necessarily yield semantically informative representations, limiting effectiveness in downstream generation tasks.Method: Proposes OmniCodec with hierarchical multi-codebook design featuring semantic-acoustic decoupling by leveraging pre-trained audio encoder from understanding models, plus self-guidance strategy to improve codebook utilization and reconstruction.
Result: Outperforms Mimi codec at same bitrate, delivering superior reconstruction quality while providing more semantically informative representations that benefit downstream generation tasks.
Conclusion: OmniCodec offers a universal neural audio codec solution for low frame rate across diverse audio domains with improved semantic representation quality for generation tasks.
Abstract: Large Language Models (LLMs) have advanced audio generation through discrete representation learning. However, most existing neural codecs focus on speech and emphasize reconstruction fidelity, overlooking unified low frame rate modeling across diverse audio domains, including speech, music, and general sound. Moreover, high reconstruction quality does not necessarily yield semantically informative representations, limiting effectiveness in downstream generation tasks. We propose OmniCodec, a universal neural audio codec tailored for low frame rate. It adopts a hierarchical multi-codebook design with semantic-acoustic decoupling by leveraging the audio encoder of the pre-trained understanding model, along with a self-guidance strategy to improve codebook utilization and reconstruction. Compared with the Mimi codec, experiments show that OmniCodec achieves outstanding performance at the same bitrate, delivering superior reconstruction quality while also providing more semantically informative representations that benefit downstream generation tasks. Our model and code will be open-sourced. Our demo page is available.
Relevance: 9/10
[3] AcoustEmo: Open-Vocabulary Emotion Reasoning via Utterance-Aware Acoustic Q-Former
Liyun Zhang, Xuanmeng Sha, Shuqiong Wu, Fengkai Liu
Main category: cs.MM
TL;DR: AcoustEmo: A time-sensitive multimodal LLM with Utterance-Aware Acoustic Q-Former for fine-grained emotion recognition by capturing local temporal acoustic dynamics instead of global audio representations.
Details
Motivation: Current MLLMs for emotion recognition use global audio encoders that fail to capture subtle local temporal dynamics like micro-prosody and intonation shifts within utterances, limiting fine-grained acoustic modeling.Method: Proposes AcoustEmo with a novel Utterance-Aware Acoustic Q-Former that uses timestamp-synchronized sliding windows to dynamically extract segment-level audio tokens, enabling explicit tracing of temporal evolution of acoustic clues and capturing deep contextual dependencies in dialogues.
Result: Experiments on Explainable Multimodal Emotion Recognition (EMER) show AcoustEmo significantly enhances complex emotion reasoning, outperforming baselines while maintaining robust contextual accuracy.
Conclusion: The proposed time-sensitive approach with fine-grained acoustic modeling improves emotion recognition in multimodal LLMs by better capturing local temporal dynamics.
Abstract: Multimodal Large Language Models (MLLMs) excel in Open-Vocabulary (OV) emotion recognition but often neglect fine-grained acoustic modeling. Existing methods typically use global audio encoders, failing to capture subtle, local temporal dynamics like micro-prosody and intonation shifts within individual utterances. To address this, we propose AcoustEmo, a time-sensitive MLLM featuring a novel Utterance-Aware Acoustic Q-Former. Our approach utilizes a timestamp-synchronized sliding window to dynamically extract segment-level audio tokens instead of coarse global representations. This enables the model to explicitly trace the temporal evolution of subtle acoustic clues and capture deep contextual dependencies in dialogues. Experiments on the Explainable Multimodal Emotion Recognition (EMER) task show that AcoustEmo significantly enhances complex emotion reasoning, outperforming baselines while maintaining robust contextual accuracy.
Relevance: 9/10
Today’s Research Highlights
AI-enhanced summaries of the latest research papers from arXiv.
Table of Contents
- cs.CL [Total: 177]
- cs.CV [Total: 404]
- cs.AI [Total: 198]
- cs.SD [Total: 14]
- cs.LG [Total: 294]
- cs.MA [Total: 11]
- cs.MM [Total: 4]
- eess.AS [Total: 10]
- eess.IV [Total: 12]
cs.CL
[1] Enhancing Safety of Large Language Models via Embedding Space Separation
Xu Zhao, Xiting Wang, Weiran Shen
Main category: cs.CL
TL;DR: ES2 improves LLM safety by fine-tuning to separate harmful and safe embeddings in representation space while preserving general capabilities via KL regularization.
Details
Motivation: LLMs have safety vulnerabilities where harmful and safe queries show linear separability in embedding space, which attackers exploit by perturbing harmful embeddings toward safe subspace. Need to improve safety without degrading general capabilities.Method: Embedding Space Separation (ES2) - representation-level fine-tuning that explicitly enlarges distance between harmful and safe representations in embedding space. Uses KL divergence regularization to constrain fine-tuned model’s logits to align with original base model on harmless inputs.
Result: Extensive experiments on open-source LLMs using standard safety benchmarks show substantial safety improvement while maintaining comparable general capabilities.
Conclusion: ES2 effectively improves LLM safety by separating harmful/safe representations in embedding space with minimal impact on general capabilities through KL regularization.
Abstract: Large language models (LLMs) have achieved impressive capabilities, yet ensuring their safety against harmful prompts remains a critical challenge. Recent work has revealed that the latent representations (embeddings) of harmful and safe queries in LLMs typically exhibit linear separability, a property that has been exploited to construct attacks by perturbing the embeddings of harmful queries towards the safe subspace. Motivated by this observation, we propose a representation-level fine-tuning approach, named Embedding Space Separation (ES2), which improves LLM safety by explicitly enlarging the distance between harmful and safe representations in the embedding space. To prevent degradation of model’s general capabilities, we introduce a Kullback-Leibler (KL) divergence regularization term into the loss function, which constrains the logits of the fine-tuned model to align with those of the original base model on harmless inputs. We evaluate our method on several open-source LLMs using standard safety benchmarks. Extensive experimental results demonstrate that our approach substantially improves model safety while maintaining comparable general capabilities.
[2] RedacBench: Can AI Erase Your Secrets?
Hyunjun Jeon, Kyuyoung Kim, Jinwoo Shin
Main category: cs.CL
TL;DR: RedacBench: A comprehensive benchmark for evaluating policy-conditioned redaction across domains, measuring models’ ability to selectively remove sensitive information while preserving semantics.
Details
Motivation: Existing redaction benchmarks focus on predefined categories like PII or evaluate specific techniques, lacking comprehensive evaluation of policy-conditioned redaction across domains and strategies.Method: Created RedacBench from 514 human-authored texts spanning individual, corporate, and government sources, paired with 187 security policies. Used 8,053 annotated propositions to measure both security (removal of sensitive information) and utility (preservation of non-sensitive information).
Result: Experiments show that while more advanced models can improve security, preserving utility remains challenging. The benchmark enables systematic evaluation of redaction strategies across different model capabilities.
Conclusion: RedacBench provides a comprehensive framework for evaluating policy-conditioned redaction, addressing limitations of existing benchmarks and facilitating future research in data security through selective information removal.
Abstract: Modern language models can readily extract sensitive information from unstructured text, making redaction – the selective removal of such information – critical for data security. However, existing benchmarks for redaction typically focus on predefined categories of data such as personally identifiable information (PII) or evaluate specific techniques like masking. To address this limitation, we introduce RedacBench, a comprehensive benchmark for evaluating policy-conditioned redaction across domains and strategies. Constructed from 514 human-authored texts spanning individual, corporate, and government sources, paired with 187 security policies, RedacBench measures a model’s ability to selectively remove policy-violating information while preserving the original semantics. We quantify performance using 8,053 annotated propositions that capture all inferable information in each text. This enables assessment of both security – the removal of sensitive propositions – and utility – the preservation of non-sensitive propositions. Experiments across multiple redaction strategies and state-of-the-art language models show that while more advanced models can improve security, preserving utility remains a challenge. To facilitate future research, we release RedacBench along with a web-based playground for dataset customization and evaluation. Available at https://hyunjunian.github.io/redaction-playground/.
[3] Children’s Intelligence Tests Pose Challenges for MLLMs? KidGym: A 2D Grid-Based Reasoning Benchmark for MLLMs
Hengwei Ye, Yuanting Guan, Yuxuan Ge, Tianying Zhu, Zhenhan Guan, Yijia Zhong, Yijing Zhang, Han Zhang, Yingna Wu, Zheng Tian
Main category: cs.CL
TL;DR: KidGym is a comprehensive 2D grid-based benchmark inspired by children’s cognitive development tests to evaluate five essential capabilities of Multimodal Large Language Models: Execution, Perception Reasoning, Learning, Memory, and Planning.
Details
Motivation: Current MLLM evaluation lacks comprehensive assessment of core cognitive capabilities. The authors take inspiration from established children's intelligence tests (Wechsler Intelligence Scales) to create a benchmark that evaluates MLLMs' adaptability and developmental potential, mirroring human cognitive growth stages.Method: Developed KidGym - a 2D grid-based benchmark with 12 unique tasks targeting five core capabilities. Features diverse scenarios with randomly generated layouts for robust evaluation. The benchmark is fully customizable and extensible, allowing researchers to create new scenarios and adjust difficulty levels.
Result: Evaluation of state-of-the-art MLLMs using KidGym revealed significant insights into model capabilities and identified several limitations of current models. The benchmark provides a more accurate assessment of MLLMs’ cognitive abilities compared to existing evaluations.
Conclusion: KidGym offers a comprehensive, customizable benchmark for evaluating MLLM capabilities, filling a gap in current evaluation methodologies. It enables systematic assessment of MLLMs’ cognitive development potential and reveals important limitations in current models.
Abstract: Multimodal Large Language Models (MLLMs) combine the linguistic strengths of LLMs with the ability to process multimodal data, enbaling them to address a broader range of visual tasks. Because MLLMs aim at more general, human-like competence than language-only models, we take inspiration from the Wechsler Intelligence Scales - an established battery for evaluating children by decomposing intelligence into interpretable, testable abilities. We introduce KidGym, a comprehensive 2D grid-based benchmark for assessing five essential capabilities of MLLMs: Execution, Perception Reasoning, Learning, Memory and Planning. The benchmark comprises 12 unique tasks, each targeting at least one core capability, specifically designed to guage MLLMs’ adaptability and developmental potential, mirroring the stages of children’s cognitive growth. Additionally, our tasks encompass diverse scenarios and objects with randomly generated layouts, ensuring a more accurate and robust evluation of MLLM capabilities. KidGym is designed to be fully user-customizable and extensible, allowing researchers to create new evaluation scenarios and adjust difficuly levels to accommodate the rapidly growing MLLM community. Through the evaluation of state-of-the-art MLLMs using KidGym, we identified significant insights into model capabilities and revealed several limitations of current models. We release our benchmark at: https://kidgym.github.io/KidGym-Website/.
[4] CRoCoDiL: Continuous and Robust Conditioned Diffusion for Language
Roy Uziel, Omer Belhasin, Itay Levi, Akhiad Bercovich, Ran El-Yaniv, Ran Zilberstein, Michael Elad
Main category: cs.CL
TL;DR: CRoCoDiL introduces a continuous sentence-level semantic space for masked diffusion models to improve token dependencies and semantic coherence in text generation, with two novel unconditional synthesis algorithms achieving 10x faster sampling.
Details
Motivation: Masked Diffusion Models (MDMs) offer efficient non-causal text generation but suffer from token dependency issues and semantic incoherence due to their reliance on discrete marginal distributions, limiting their effectiveness compared to autoregressive methods.Method: Proposes CRoCoDiL, a unified fine-tuning approach that shifts diffusion to continuous sentence-level semantic space using an encoder-demasker architecture. Introduces two unconditional synthesis algorithms: Continuous-Then-Discrete (ConThenDisc) for hybrid diffusion, and Continuous-Within-Discrete (ConWithinDisc) for multi-diffusion refinement.
Result: Experiments with LLaDA show superior generation quality and more than 10x faster sampling speeds in unconditional settings compared to baseline methods.
Conclusion: Moving diffusion to continuous semantic space significantly improves MDM performance for text generation, offering both quality gains and substantial speed improvements through novel continuous-discrete hybrid approaches.
Abstract: Masked Diffusion Models (MDMs) provide an efficient non-causal alternative to autoregressive generation but often struggle with token dependencies and semantic incoherence due to their reliance on discrete marginal distributions. We address these limitations by shifting the diffusion process into a continuous sentence-level semantic space. We propose CRoCoDiL (Continuous and Robust Conditioned Diffusion for Language), a unified fine-tuning approach that jointly trains an encoder-demasker architecture, grounding the MDM demasking in continuous latent representations. This leads to the formation of a novel autoencoder in which decoding is obtained by an MDM algorithm. Relying on the same framework, we introduce two unconditional text synthesis algorithms: Continuous-Then-Discrete (ConThenDisc), a hybrid-diffusion approach that first generates latent representations in continuous space and then decodes these to tokens via an MDM, and Continuous-Within-Discrete (ConWithinDisc), a multi-diffusion strategy that refines latent representations throughout the discrete sampling process. Experiments using LLaDA show that our methods achieve superior generation quality and more than 10x faster sampling speeds in an unconditional setting.
[5] Fast-Slow Thinking RM: Efficient Integration of Scalar and Generative Reward Models
Jiayun Wu, Peixu Hou, Shan Qu, Peng Zhang, Ning Gu, Tun Lu
Main category: cs.CL
TL;DR: F/S-RM: A hybrid reward model architecture combining fast scalar scoring with slow chain-of-thought reasoning, achieving better performance with lower computational cost.
Details
Motivation: Current reward models for RLHF face a trade-off: Generative Reward Models (GRMs) with chain-of-thought reasoning are accurate but computationally expensive, while Scalar Reward Models (SRMs) are efficient but less performant and adaptable in complex scenarios.Method: Introduces Fast-Slow Thinking Reward Models (F/S-RM) inspired by Dual Process Theory. Trains a single model to integrate two reward paradigms: fast thinking (first-token scalar prediction) and slow thinking (CoT-based judgment), regulated by a dual-confidence activation mechanism that determines when to activate slow thinking.
Result: Achieves 1.2% relative performance improvement over state-of-the-art models while reducing token consumption by 20.8%.
Conclusion: F/S-RM provides an effective hybrid approach that balances accuracy and efficiency in reward modeling for RLHF, offering better performance with reduced computational cost.
Abstract: Reward models (RMs) are critical for aligning Large Language Models via Reinforcement Learning from Human Feedback (RLHF). While Generative Reward Models (GRMs) achieve superior accuracy through chain-of-thought (CoT) reasoning, they incur substantial computational costs. Conversely, Scalar Reward Models (SRMs) offer efficiency but suffer from limited performance and adaptability in complex scenarios. We introduce Fast-Slow Thinking Reward Models (F/S-RM), a hybrid RM architecture inspired by Dual Process Theory. It trains a single model to integrate two distinct reward paradigms: first-token prediction as a scalar score (fast thinking) and CoT-based judgment (slow thinking), regulated by a dual-confidence activation mechanism that determines when to activate slow thinking. F/S-RM achieves a 1.2% relative performance improvement over state-of-the-art models while reducing token consumption by 20.8%. Code and data will be publicly available.
[6] Multi-Agent Debate with Memory Masking
Hongduan Tian, Xiao Feng, Ziyuan Zhao, Xiangyu Zhu, Rolan Yan, Bo Han
Main category: cs.CL
TL;DR: MAD-M² improves multi-agent debate reasoning by masking erroneous memories from previous debate rounds to enhance robustness and performance.
Details
Motivation: Multi-agent debate (MAD) improves LLM reasoning but is vulnerable to erroneous memories from previous debate rounds, which can degrade performance. The authors observed that MAD's effectiveness depends heavily on memory quality.Method: Proposes MAD-M² (multi-agent debate with memory masking), where LLM agents can mask erroneous memories from previous debate rounds at the beginning of each round, preserving informative memories while discarding erroneous ones.
Result: Extensive experiments on mathematical and logical reasoning benchmarks show MAD-M² can identify erroneous memories and achieve better reasoning performance than standard MAD.
Conclusion: Memory masking effectively improves the robustness of multi-agent debate frameworks by filtering out erroneous memories, leading to enhanced reasoning capabilities in LLMs.
Abstract: Large language models (LLMs) have recently demonstrated impressive capabilities in reasoning tasks. Currently, mainstream LLM reasoning frameworks predominantly focus on scaling up inference-time sampling to enhance performance. In particular, among all LLM reasoning frameworks, multi-agent debate (MAD), which employs multiple LLMs as agents to perform reasoning in the way of multi-round debate, has emerged as a powerful reasoning paradigm since it allows agents to access previous memories to alleviate fallacious content and refine their reasoning iteratively in each debate round. However, although MAD significantly improves the reasoning capabilities of LLMs, in this paper, we observe that there remain erroneous memories, and LLM agents are vulnerable to these erroneous memories. To explore this phenomenon, we provide a theoretical insight that the performance of MAD is highly dependent on the quality of memories derived from the previous debate, indicating that the existence of erroneous memories poses a threat to the performance of MAD. To address this problem, we introduce a simple yet effective multi-agent debate framework, multi-agent debate with memory masking (MAD-M$^2$), to improve the robustness of MAD by allowing LLM agents to mask erroneous memories from the previous debate round at the beginning of each debate round. In this way, MAD-M$^2$ can polish the contextual information before each debate round by preserving informative and meaningful memories while discarding the erroneous memories. Extensive experiments and analyses on mainstream mathematical and logical reasoning benchmarks demonstrate that MAD-M$^2$ can identify the erroneous memories and achieve better performance in reasoning than MAD.
[7] Locally Coherent Parallel Decoding in Diffusion Language Models
Michael Hersche, Nicolas Menet, Ronan Tanios, Abbas Rahimi
Main category: cs.CL
TL;DR: CoDiLA combines diffusion language models with local autoregression for coherent parallel token generation in code tasks
Details
Motivation: Diffusion language models offer sub-linear generation latency but struggle with joint dependencies when predicting multiple tokens in parallel, leading to syntactic inconsistencies in code generationMethod: CoDiLA uses a small auxiliary autoregressive model (0.6B parameters) operating on diffusion latents to handle local decoding within blocks, while maintaining parallel block generation and bidirectional modeling across blocks
Result: Eliminates coherence artifacts and establishes new Pareto frontier for accuracy and speed in code generation benchmarks
Conclusion: Hybrid approach combining diffusion models with local autoregression effectively reconciles parallel sampling with dependency modeling for improved code generation
Abstract: Diffusion language models (DLMs) have emerged as a promising alternative to autoregressive (AR) models, offering sub-linear generation latency and bidirectional capabilities that are particularly appealing for code generation and editing. Achieving sub-linear latency in discrete DLMs requires predicting multiple tokens in parallel. However, standard DLMs sample tokens independently from conditional marginal distributions, failing to capture the joint dependencies among concurrently generated tokens. As a result, they often lead to syntactic inconsistencies and break multi-token structures. In this work, we introduce CoDiLA (Coherent Diffusion with Local Autoregression), a method that reconciles parallel sampling with local dependency modeling. Rather than forcing the DLM to resolve fine-grained syntax, CoDiLA delegates local decoding to a small, auxiliary AR model operating on the diffusion latents. This design allows for parallel block generation while ensuring sequential validity within each block and maintaining core DLM capabilities, including bidirectional modeling across blocks. We demonstrate that using a highly compact auxiliary AR model (e.g., 0.6B parameters) effectively eliminates coherence artifacts, establishing a new Pareto frontier for accuracy and speed in code generation benchmarks.
[8] Expected Reward Prediction, with Applications to Model Routing
Kenan Hasanaliyev, Silas Alberti, Jenny Hamer, Dheeraj Rajagopal, Kevin Robinson, Jasper Snoek, Victor Veitch, Alexander Nicholas D’Amour
Main category: cs.CL
TL;DR: Expected Reward Prediction (ERP) enables routing prompts to LLMs based on predicted reward scores before response generation, optimizing computational cost while maximizing reward.
Details
Motivation: Current reward models only score responses after generation, but predicting expected rewards before generation could enable smarter model routing to optimize computational resources while maintaining quality.Method: Develop Expected Reward Prediction (ERP) that estimates the reward an LLM would earn under repeated sampling, then use this for model routing protocols that select which model should handle each prompt at inference time.
Result: ERP routing outperforms baselines that route based on average category performance, demonstrates precision and discriminative power, and works effectively with model pools including Llama3.1-Instruct and Gemma models.
Conclusion: Expected reward prediction enables effective model routing to maximize reward while controlling computational cost, with trivial extensibility to new models and explaining the success of more complex routing protocols.
Abstract: Reward models are a standard tool to score responses from LLMs. Reward models are built to rank responses to a fixed prompt sampled from a single model, for example to choose the best of n sampled responses. In this paper, we study whether scores from response-level reward models lifted to score a model’s suitability for a prompt, prior to seeing responses from that model. Specifically, we show that it is straightforward to predict the expected reward that an LLM would earn from the reward model under repeated sampling. Further, we show that these expected reward predictions are precise and discriminative enough to support an application to a model routing protocol that routes prompts to models at inference time to maximize reward while controlling computational cost. We demonstrate the performance of this routing procedure on the open-perfectblend dataset, using a model pool composed of Llama3.1-Instruct 8B/70B, Gemma2-IT 9B/27B, and Gemma1-IT 7B models. Our simple expected reward prediction–based routing (ERP) outperforms baselines that route prompts to models with the best average performance within each prompt’s category, and explains the success of more complex routing protocols that implicitly estimate an expected reward. Our approach has the added advantage of being trivially extensible as new models are added to the pool.
[9] An experimental study of KV cache reuse strategies in chunk-level caching systems
Samuel Cestola, Tianxiang Xia, Zheng Weiyan, Zheng Pengfei, Diego Didona
Main category: cs.CL
TL;DR: The paper analyzes limitations of chunk-level caching (CLC) for retrieval-augmented generation and proposes a new design combining complementary techniques to improve accuracy.
Details
Motivation: Retrieval-augmented generation improves LLM accuracy but chunk-level caching misses cross-attention dependencies between chunks, reducing output quality. Existing CLC approaches have fundamental limitations in accuracy or applicability.Method: Conducts extensive experimental evaluation of CLC systems, identifies that existing CLC techniques are complementary, and proposes a new CLC design that carefully combines these techniques.
Result: Shows existing CLC approaches have fundamental limitations, and the proposed combined CLC design achieves better accuracy than individual techniques.
Conclusion: By recognizing the complementary nature of existing CLC techniques and combining them thoughtfully, it’s possible to overcome fundamental limitations and achieve improved accuracy in retrieval-augmented generation systems.
Abstract: Retrieval-augmented generation improves large language models’ accuracy by adding relevant retrieved text to the prompt. Chunk level caching (CLC) accelerates inference by precomputing KV caches for these retrieved chunks and reusing them. However, these caches miss cross-attention dependencies between chunks, which can reduce output quality. Several methods try to improve CLC accuracy using different techniques. We make two main contributions. First, we show that existing CLC approaches have fundamental limitations that limit their accuracy or their applicability. We back this conclusion with an extensive CLC system experimental evaluation. Second, we observe that existing CLC techniques are complementary. We leverage this insight to propose a new CLC design that carefully combines them and achieves better accuracy.
[10] Thinking into the Future: Latent Lookahead Training for Transformers
Lorenzo Noci, Gregor Bachmann, Seyed-Mohsen Moosavi-Dezfooli, Moin Nabi
Main category: cs.CL
TL;DR: Latent lookahead training enables language models to perform multi-step lookahead in latent space before generating tokens, improving performance on planning tasks requiring foresight.
Details
Motivation: Standard autoregressive models generate tokens one at a time without exploring multiple continuations, and allocate uniform compute per token, limiting expressiveness for difficult tokens that may require more computation.Method: Introduces latent lookahead training where at selected positions, before committing to next token, the model performs τ-step lookahead in latent space by recursively feeding hidden states back into context, supervised against next τ ground-truth tokens.
Result: Latent lookahead substantially outperforms both autoregressive and non-autoregressive baselines on planning tasks like maze solving, Sudoku, and ProsQA where foresight is essential.
Conclusion: The approach enables models to “think” before generating, addressing limitations of standard next-token prediction by allowing exploration of multiple continuations and adaptive compute allocation.
Abstract: Autoregressive language models trained with next-token prediction generate text by sampling one discrete token at a time. Although very scalable, this objective forces the model to commit at every step, preventing it from exploring or reflecting upon multiple plausible continuations. Furthermore, the compute allocation across tokens is uniform; every token is formed based on a single forward-pass, potentially limiting the model’s expressiveness in cases where difficult tokens require inherently more compute. Towards addressing these limitations, we introduce latent lookahead, a training strategy that enables models to “think” before generating: at selected positions in the sequence, before committing to the next token, the model performs a multi-step lookahead in latent space. More precisely, instead of sampling future tokens, we leverage the network’s latent space by recursively feeding its hidden states back into the context for $τ$ steps, investing more compute on predicting that token. This produces $τ$ latent predictions that are supervised against the next $τ$ ground-truth tokens, encouraging the model to “lookahead” and refine its prediction. We show that latent lookahead substantially outperforms both autoregressive and non-autoregressive baselines on planning tasks such as maze solving, Sudoku, and ProsQA, where foresight is essential.
[11] Linguistic Signatures for Enhanced Emotion Detection
Florian Lecourt, Madalina Croitoru, Konstantin Todorov
Main category: cs.CL
TL;DR: Linguistic features extracted from emotion datasets improve transformer-based emotion recognition performance when combined with neural representations.
Details
Motivation: While transformer models have advanced emotion detection in NLP, little is known about the linguistic regularities that characterize emotion expression across different corpora and labels. The study aims to determine if linguistic features can serve as reliable interpretable signals for emotion recognition in text.Method: Extracted emotion-specific linguistic signatures from 13 English datasets and evaluated how incorporating these features into transformer models impacts performance. Used RoBERTa-based models enriched with high-level linguistic features.
Result: Models achieved consistent performance gains of up to +2.4 macro F1 on the GoEmotions benchmark, showing that explicit lexical cues can complement neural representations and improve robustness in predicting emotion categories.
Conclusion: Linguistic features provide reliable interpretable signals for emotion recognition and can effectively complement transformer-based neural representations to improve emotion detection performance.
Abstract: Emotion detection is a central problem in NLP, with recent progress driven by transformer-based models trained on established datasets. However, little is known about the linguistic regularities that characterize how emotions are expressed across different corpora and labels. This study examines whether linguistic features can serve as reliable interpretable signals for emotion recognition in text. We extract emotion-specific linguistic signatures from 13 English datasets and evaluate how incorporating these features into transformer models impacts performance. Our RoBERTa-based models enriched with high level linguistic features achieve consistent performance gains of up to +2.4 macro F1 on the GoEmotions benchmark, showing that explicit lexical cues can complement neural representations and improve robustness in predicting emotion categories.
[12] Beyond Test-Time Compute Strategies: Advocating Energy-per-Token in LLM Inference
Patrick Wilhelm, Thorsten Wittkopp, Odej Kao
Main category: cs.CL
TL;DR: Analysis of energy-accuracy trade-offs in LLMs vs SLMs with reasoning strategies, proposing energy efficiency metrics and controlled reasoning for sustainable AI deployment.
Details
Motivation: LLMs have high energy and computational costs, while SLMs with reasoning strategies can approach LLM performance but create energy-accuracy trade-offs. Need to balance accuracy with sustainability in AI deployment.Method: Analyzes trade-offs in test-time compute strategies comparing SLMs with reasoning (CoT, Majority Voting) vs LLMs using MMLU benchmark. Examines transformer input-output token dynamics and proposes energy efficiency metrics (Energy-per-Token). Introduces controlled reasoning with operating curves to dynamically regulate reasoning depth.
Result: Shows SLMs with reasoning strategies can approach LLM performance but with energy trade-offs. Transformer architectures have nonlinear hardware energy operation curves. Energy efficiency metrics complement traditional accuracy benchmarks.
Conclusion: Proposes energy-aware routing mechanism integrating model selection and inference strategies to balance accuracy with sustainable AI deployment through controlled reasoning and energy efficiency metrics.
Abstract: Large Language Models (LLMs) demonstrate exceptional performance across diverse tasks but come with substantial energy and computational costs, particularly in request-heavy scenarios. In many real-world applications, the full scale and capabilities of LLMs are often unnecessary, as Small Language Models (SLMs) can provide accurate responses for simpler text generation tasks. When enhanced with advanced reasoning strategies, such as Chain-of-Thought (CoT) prompting or Majority Voting, SLMs can approach the performance of larger models while reducing overall computational requirements. However, these strategies can also introduce additional energy costs, creating an energy-accuracy trade-off. Our analysis examines these trade-offs in test-time compute strategies for smaller models compared to larger ones, using the MMLU benchmark. Additionally, we explore the input-output token dynamics of transformer architectures, which result in nonlinear hardware energy operation curves for LLMs. To bridge AI research with its physical impact, we propose \textit{energy efficiency metrics}, including Energy-per-Token, as complements to traditional accuracy benchmarks. Beyond model selection, we propose controlled reasoning in CoT token generation, using operating curves to regulate reasoning depth dynamically. This vision integrates a energy-aware routing mechanism, ensuring that model selection and inference strategies balance accuracy for sustainable AI deployment.
[13] Abjad-Kids: An Arabic Speech Classification Dataset for Primary Education
Abdul Aziz Snoubara, Baraa Al_Maradni, Haya Al_Naal, Malek Al_Madrmani, Roaa Jdini, Seedra Zarzour, Khloud Al Jallad
Main category: cs.CL
TL;DR: Arabic children’s speech dataset (Abjad-Kids) with 46K audio samples for educational content classification using hierarchical CNN-LSTM models.
Details
Motivation: Addressing the lack of publicly available children's speech datasets for low-resource languages like Arabic, particularly for educational applications involving alphabets, numbers, and colors.Method: Created Abjad-Kids dataset with 46,397 audio samples from children aged 3-12, then proposed hierarchical CNN-LSTM classification with two grouping strategies: static linguistic-based and dynamic clustering-based, using data augmentation and regularization.
Result: Static linguistic-based grouping achieved superior performance, CNN-LSTM with data augmentation outperformed traditional ML, but experiments showed overfitting challenges due to limited samples despite augmentation.
Conclusion: Abjad-Kids addresses the gap in Arabic children’s speech datasets and shows promise for educational applications, though more data collection is needed to overcome overfitting issues.
Abstract: Speech-based AI educational applications have gained significant interest in recent years, particularly for children. However, children speech research remains limited due to the lack of publicly available datasets, especially for low-resource languages such as Arabic.This paper presents Abjad-Kids, an Arabic speech dataset designed for kindergarten and primary education, focusing on fundamental learning of alphabets, numbers, and colors. The dataset consists of 46397 audio samples collected from children aged 3 - 12 years, covering 141 classes. All samples were recorded under controlled specifications to ensure consistency in duration, sampling rate, and format. To address high intra-class similarity among Arabic phonemes and the limited samples per class, we propose a hierarchical audio classification based on CNN-LSTM architectures. Our proposed methodology decomposes alphabet recognition into a two-stage process: an initial grouping classification model followed by specialized classifiers for each group. Both strategies: static linguistic-based grouping and dynamic clustering-based grouping, were evaluated. Experimental results demonstrate that static linguistic-based grouping achieves superior performance. Comparisons between traditional machine learning with deep learning approaches, highlight the effectiveness of CNN-LSTM models combined with data augmentation. Despite achieving promising results, most of our experiments indicate a challenge with overfitting, which is likely due to the limited number of samples, even after data augmentation and model regularization. Thus, future work may focus on collecting additional data to address this issue. Abjad-Kids will be publicly available. We hope that Abjad-Kids enrich children representation in speech dataset, and be a good resource for future research in Arabic speech classification for kids.
[14] Decoding the decoder: Contextual sequence-to-sequence modeling for intracortical speech decoding
Michal Olak, Tommaso Boccato, Matteo Ferrante
Main category: cs.CL
TL;DR: Transformer-based sequence-to-sequence model for speech BCI decodes phonemes and words from intracortical recordings, achieving state-of-the-art performance with Neural Hammer Scalpel calibration for day-to-day variability.
Details
Motivation: Speech brain-computer interfaces need robust decoders that handle limited data and day-to-day variability in neural recordings. While prior systems used framewise phoneme decoding with language models, it's unclear what contextual sequence-to-sequence decoding contributes to neural readout, robustness, and interpretability.Method: Multitask Transformer-based sequence-to-sequence model that jointly predicts phoneme sequences, word sequences, and auxiliary acoustic features from area 6v intracortical recordings. Introduced Neural Hammer Scalpel (NHS) calibration module combining global alignment with feature-wise modulation to address day-to-day nonstationarity.
Result: Achieved state-of-the-art phoneme error rate of 14.3% on Willett et al. dataset. Word decoding reached 25.6% WER with direct decoding and 19.4% WER with candidate generation and rescoring. NHS substantially improved both phoneme and word decoding compared to linear or no day-specific transform.
Conclusion: Contextual sequence-to-sequence modeling improves neural-to-phoneme readout fidelity from intracortical speech signals. Attention-based analyses provide insights into how neural speech evidence is segmented and accumulated over time, offering interpretability benefits.
Abstract: Speech brain–computer interfaces require decoders that translate intracortical activity into linguistic output while remaining robust to limited data and day-to-day variability. While prior high-performing systems have largely relied on framewise phoneme decoding combined with downstream language models, it remains unclear what contextual sequence-to-sequence decoding contributes to sublexical neural readout, robustness, and interpretability. We evaluated a multitask Transformer-based sequence-to-sequence model for attempted speech decoding from area 6v intracortical recordings. The model jointly predicts phoneme sequences, word sequences, and auxiliary acoustic features. To address day-to-day nonstationarity, we introduced the Neural Hammer Scalpel (NHS) calibration module, which combines global alignment with feature-wise modulation. We further analyzed held-out-day generalization and attention patterns in the encoder and decoders. On the Willett et al. dataset, the proposed model achieved a state-of-the-art phoneme error rate of 14.3%. Word decoding reached 25.6% WER with direct decoding and 19.4% WER with candidate generation and rescoring. NHS substantially improved both phoneme and word decoding relative to linear or no day-specific transform, while held-out-day experiments showed increasing degradation on unseen days with temporal distance. Attention visualizations revealed recurring temporal chunking in encoder representations and distinct use of these segments by phoneme and word decoders. These results indicate that contextual sequence-to-sequence modeling can improve the fidelity of neural-to-phoneme readout from intracortical speech signals and suggest that attention-based analyses can generate useful hypotheses about how neural speech evidence is segmented and accumulated over time.
[15] TaigiSpeech: A Low-Resource Real-World Speech Intent Dataset and Preliminary Results with Scalable Data Mining In-the-Wild
Kai-Wei Chang, Yi-Cheng Lin, Huang-Cheng Chou, Wenze Ren, Yu-Han Huang, Yun-Shao Tsai, Chien-Cheng Chen, Yu Tsao, Yuan-Fu Liao, Shrikanth Narayanan, James Glass, Hung-yi Lee
Main category: cs.CL
TL;DR: TaigiSpeech: A real-world speech intent dataset for low-resource Taiwanese Taigi language collected from older adults, with 3k utterances for healthcare/home assistant applications, using data mining strategies including LLM pseudo-labeling and audio-visual frameworks.
Details
Motivation: Many languages remain underrepresented in speech technologies due to limited resources. Taiwanese Taigi is a low-resource, primarily spoken language that needs practical intent detection datasets for applications like healthcare and home assistants.Method: Created TaigiSpeech dataset with 21 speakers and 3k utterances. Explored two data mining strategies: 1) keyword match with LLM pseudo-labeling via intermediate language, and 2) audio-visual framework leveraging multimodal cues with minimal textual supervision.
Result: Developed a scalable dataset construction approach for low-resource and unwritten spoken languages. The dataset will be released under CC BY 4.0 license to facilitate research on underrepresented languages.
Conclusion: TaigiSpeech addresses the scarcity of labeled data for low-resource languages and enables practical intent detection applications, with potential for broader adoption in speech technology research for underrepresented languages.
Abstract: Speech technologies have advanced rapidly and serve diverse populations worldwide. However, many languages remain underrepresented due to limited resources. In this paper, we introduce \textbf{TaigiSpeech}, a real-world speech intent dataset in Taiwanese Taigi (aka Taiwanese Hokkien/Southern Min), which is a low-resource and primarily spoken language. The dataset is collected from older adults, comprising 21 speakers with a total of 3k utterances. It is designed for practical intent detection scenarios, including healthcare and home assistant applications. To address the scarcity of labeled data, we explore two data mining strategies with two levels of supervision: keyword match data mining with LLM pseudo labeling via an intermediate language and an audio-visual framework that leverages multimodal cues with minimal textual supervision. This design enables scalable dataset construction for low-resource and unwritten spoken languages. TaigiSpeech will be released under the CC BY 4.0 license to facilitate broad adoption and research on low-resource and unwritten languages. The project website and the dataset can be found on https://kwchang.org/taigispeech.
[16] FinReflectKG – HalluBench: GraphRAG Hallucination Benchmark for Financial Question Answering Systems
Mahesh Kumar, Bhaskarjit Sarmah, Stefano Pasquali
Main category: cs.CL
TL;DR: FinBench-QA-Hallucination benchmark evaluates hallucination detection methods in KG-augmented financial QA systems using SEC 10-K filings, showing LLM judges and embedding methods perform best but degrade with noisy triplets.
Details
Motivation: Ensuring factual accuracy in AI-powered financial QA systems is critical for compliance, risk assessment, and decision support, as hallucinations can lead to regulatory violations and flawed decisions. Current KG-augmented QA systems lack systematic hallucination detection mechanisms.Method: Created FinBench-QA-Hallucination benchmark with 755 annotated examples from 300 SEC 10-K filing pages, using conservative evidence-linkage protocol requiring support from both textual chunks and extracted relational triplets. Evaluated six detection approaches: LLM judges, fine-tuned classifiers, NLI models, span detectors, and embedding-based methods under clean and noisy triplet conditions.
Result: LLM-based judges and embedding approaches achieved highest performance (F1: 0.82-0.86) under clean conditions. Most methods degraded significantly with noisy triplets (MCC dropping 44-84%), while embedding methods remained relatively robust (only 9% degradation). Statistical tests confirmed significant performance differences (p < 0.001).
Conclusion: The benchmark reveals vulnerabilities in current KG-augmented systems and provides insights for building reliable financial information systems. It offers a framework for integrating AI reliability evaluation into high-stakes domains beyond finance, including healthcare, legal, and government applications.
Abstract: As organizations increasingly integrate AI-powered question-answering systems into financial information systems for compliance, risk assessment, and decision support, ensuring the factual accuracy of AI-generated outputs becomes a critical engineering challenge. Current Knowledge Graph (KG)-augmented QA systems lack systematic mechanisms to detect hallucinations - factually incorrect outputs that undermine reliability and user trust. We introduce FinBench-QA-Hallucination, a benchmark for evaluating hallucination detection methods in KG-augmented financial QA over SEC 10-K filings. The dataset contains 755 annotated examples from 300 pages, each labeled for groundedness using a conservative evidence-linkage protocol requiring support from both textual chunks and extracted relational triplets. We evaluate six detection approaches - LLM judges, fine-tuned classifiers, Natural Language Inference (NLI) models, span detectors, and embedding-based methods under two conditions: with and without KG triplets. Results show that LLM-based judges and embedding approaches achieve the highest performance (F1: 0.82-0.86) under clean conditions. However, most methods degrade significantly when noisy triplets are introduced, with Matthews Correlation Coefficient (MCC) dropping 44-84 percent, while embedding methods remain relatively robust with only 9 percent degradation. Statistical tests (Cochran’s Q and McNemar) confirm significant performance differences (p < 0.001). Our findings highlight vulnerabilities in current KG-augmented systems and provide insights for building reliable financial information systems, where hallucinations can lead to regulatory violations and flawed decisions. The benchmark also offers a framework for integrating AI reliability evaluation into information system design across other high-stakes domains such as healthcare, legal, and government.
[17] SciNav: A General Agent Framework for Scientific Coding Tasks
Tianshu Zhang, Huan Sun
Main category: cs.CL
TL;DR: SciNav is an LLM-based science agent framework that uses pairwise relative judgments in tree search to efficiently explore and select high-quality solutions for scientific coding tasks under constrained search budgets.
Details
Motivation: Current LLM-based science agents focus on open-ended problems with subjective outputs, while scientific coding tasks offer executable outputs for objective assessment. Existing approaches are engineering-driven pipelines lacking structured, end-to-end frameworks for scientific coding tasks.Method: SciNav framework uses pairwise relative judgments within tree search to select top-K promising solution branches, prune low-potential ones, and progressively narrow down solution candidates guided by relative comparisons, operating under constrained search budgets.
Result: SciNav significantly outperforms direct prompting and prior agents (OpenHands, Self-Debug) across different base models, task types, and difficulty levels, and exceeds frontier comparators like random selection and LLM absolute scoring.
Conclusion: The framework demonstrates the effectiveness of relative judgment-guided top-K search for high-quality scientific coding, marking progress toward more practical science agents for objective, executable tasks.
Abstract: Autonomous science agents built on large language models (LLMs) are increasingly used to generate hypotheses, design experiments, and produce reports. However, prior work mainly targets open-ended scientific problems with subjective outputs that are difficult to evaluate. Scientific coding benchmarks, by contrast, provide executable outputs for objective assessment. Existing approaches remain engineering-driven pipelines, revealing the need for structured, end-to-end science agent frameworks for scientific coding tasks. We address this gap by focusing on scientific coding tasks, where evaluation can be made rigorously, and introducing an agent framework SciNav (Scientific Navigator) that enables more effective solution exploration. Our framework is designed to operate under constrained search budgets, moving beyond reliance on pre-defined success metrics and prolonged search cycles. Inspired by findings that comparative judgments often reveal finer-grained quality differences and therefore provide greater discriminative power than absolute scoring, our framework leverages pairwise relative judgments within a tree search process to select top-K promising solution branches, prune low-potential ones, and progressively narrow down the solution candidates on the selected branches guided by relative comparisons. We demonstrate our agent’s effectiveness across different types of tasks on two benchmarks. Experiments show that SciNav significantly outperforms direct prompting and prior agents like OpenHands and Self-Debug across different base models, task types, and difficulty levels, and exceeds different frontier comparators such as random selection and LLM absolute scoring. These results confirm the strength of our agent design and highlight the effectiveness of relative judgment-guided top-K search for high-quality scientific coding, marking a step toward more practical science agents.
[18] TiCo: Time-Controllable Training for Spoken Dialogue Models
Kai-Wei Chang, Wei-Chih Chen, En-Pei Hu, Hung-yi Lee, James Glass
Main category: cs.CL
TL;DR: TiCo is a post-training method that enables spoken dialogue models to follow time-constrained instructions and generate responses with controllable duration using Spoken Time Markers.
Details
Motivation: Real-world spoken language systems like voice assistants need to control response duration for better interaction quality, but existing spoken dialogue models lack time awareness and struggle with duration-related instructions.Method: TiCo uses Spoken Time Markers (e.g., <10.6 seconds>) to help models estimate elapsed speaking time during generation. It requires minimal data and no additional QA pairs, using self-generation and reinforcement learning for efficient training.
Result: TiCo significantly improves adherence to duration constraints while preserving response quality, addressing the time-control limitations of existing spoken dialogue models.
Conclusion: TiCo provides a simple and effective solution for enabling time-aware spoken dialogue generation, enhancing the practical utility of spoken language systems.
Abstract: We propose TiCo, a simple post-training method for enabling spoken dialogue models (SDMs) to follow time-constrained instructions and generate responses with controllable duration. This capability is valuable for real-world spoken language systems such as voice assistants and interactive agents, where controlling response duration can improve interaction quality. However, despite their strong ability to generate natural spoken responses, existing models lack time awareness and struggle to follow duration-related instructions (e.g., “Please generate a response lasting about 15 seconds”). Through an empirical evaluation of both open-source and commercial SDMs, we show that they frequently fail to satisfy such time-control requirements. TiCo addresses this limitation by enabling models to estimate elapsed speaking time during generation through Spoken Time Markers (STM) (e.g., <10.6 seconds>). These markers help the model maintain awareness of time and adjust the remaining content to meet the target duration. TiCo is simple and efficient: it requires only a small amount of data and no additional question-answer pairs, relying instead on self-generation and reinforcement learning. Experimental results show that TiCo significantly improves adherence to duration constraints while preserving response quality.
[19] Assessing the Ability of Neural TTS Systems to Model Consonant-Induced F0 Perturbation
Tianle Yang, Chengzhe Sun, Phil Rose, Cassandra L. Jacobs, Siwei Lyu
Main category: cs.CL
TL;DR: A segmental-level prosodic probing framework evaluates neural TTS models’ ability to reproduce consonant-induced f0 perturbation, revealing limitations in generalization beyond high-frequency words.
Details
Motivation: To evaluate neural TTS models' ability to reproduce fine-grained segmental-prosodic effects (consonant-induced f0 perturbation) that reflect local articulatory mechanisms, and to understand whether TTS systems rely on lexical memorization or abstract prosodic encoding.Method: Proposes a segmental-level prosodic probing framework comparing synthetic and natural speech realizations for thousands of words stratified by lexical frequency. Uses Tacotron 2 and FastSpeech 2 trained on LJ Speech corpus, complemented by large-scale evaluation across multiple advanced TTS systems.
Result: TTS models accurately reproduce f0 perturbation for high-frequency words but show poor generalization to low-frequency items, suggesting reliance on lexical-level memorization rather than abstract segmental-prosodic encoding.
Conclusion: The study reveals a limitation in TTS systems’ ability to generalize prosodic detail beyond seen data, and proposes a linguistically informed diagnostic framework for future TTS evaluation with implications for interpretability and authenticity assessment.
Abstract: This study proposes a segmental-level prosodic probing framework to evaluate neural TTS models’ ability to reproduce consonant-induced f0 perturbation, a fine-grained segmental-prosodic effect that reflects local articulatory mechanisms. We compare synthetic and natural speech realizations for thousands of words, stratified by lexical frequency, using Tacotron 2 and FastSpeech 2 trained on the same speech corpus (LJ Speech). These controlled analyses are then complemented by a large-scale evaluation spanning multiple advanced TTS systems. Results show accurate reproduction for high-frequency words but poor generalization to low-frequency items, suggesting that the examined TTS architectures rely more on lexical-level memorization than on abstract segmental-prosodic encoding. This finding highlights a limitation in such TTS systems’ ability to generalize prosodic detail beyond seen data. The proposed probe offers a linguistically informed diagnostic framework that may inform future TTS evaluation methods, and has implications for interpretability and authenticity assessment in synthetic speech.
[20] The production of meaning in the processing of natural language
Christopher J. Agostino, Quan Le Thien, Nayan D’Souza, Louis van der Elst
Main category: cs.CL
TL;DR: The paper explores quantum-like contextuality in large language models, measuring CHSH |S| parameter violations across model scales and finding these metrics orthogonal to traditional benchmarks like MMLU and hallucination rates.
Details
Motivation: To understand how meaning production in language processing exhibits quantum-like contextuality rather than classical Boolean mechanisms, and to investigate this phenomenon in large language models as it relates to human-agent interaction safety and manipulation.Method: Measure CHSH |S| parameter (Bell inequality violations) across inference parameter space of models spanning four orders of magnitude in scale, cross-reference with MMLU, hallucination rate, and nonsense detection benchmarks, and analyze variations with sampling parameters and word order.
Result: The interquartile range of |S| distribution is completely orthogonal to all external benchmarks, while violation rate shows weak anticorrelation with benchmarks that doesn’t reach significance. Contextuality metrics provide different insights than traditional performance measures.
Conclusion: Quantum-like contextuality in LLMs represents a fundamental aspect of semantic processing that is distinct from traditional performance metrics, with implications for prompt injection defenses and social manipulation through context shaping.
Abstract: Understanding the fundamental mechanisms governing the production of meaning in the processing of natural language is critical for designing safe, thoughtful, engaging, and empowering human-agent interactions. Experiments in cognitive science and social psychology have demonstrated that human semantic processing exhibits contextuality more consistent with quantum logical mechanisms than classical Boolean theories, and recent works have found similar results in large language models – in particular, clear violations of the Bell inequality in experiments of contextuality during interpretation of ambiguous expressions. We explore the CHSH $|S|$ parameter – the metric associated with the inequality – across the inference parameter space of models spanning four orders of magnitude in scale, cross-referencing it with MMLU, hallucination rate, and nonsense detection benchmarks. We find that the interquartile range of the $|S|$ distribution – the statistic that most sharply differentiates models from one another – is completely orthogonal to all external benchmarks, while violation rate shows weak anticorrelation with all three benchmarks that does not reach significance. We investigate how $|S|$ varies with sampling parameters and word order, and discuss the information-theoretic constraints that genuine contextuality imposes on prompt injection defenses and its human analogue, whereby careful construction and maintenance of social contextuality can be carried out at scale – manufacturing not consent but contextuality itself, a subtler and more fundamental form of manipulation that shapes the space of possible interpretations before any particular one is reached.
[21] Coding Agents are Effective Long-Context Processors
Weili Cao, Xunjian Yin, Bhuwan Dhingra, Shuyan Zhou
Main category: cs.CL
TL;DR: Coding agents using file systems and native tools outperform traditional LLMs on long-context tasks by externalizing context processing from latent attention to executable interactions.
Details
Motivation: LLMs struggle with long contexts due to uninterpretable attention mechanisms and performance degradation with increasing context length. The paper explores whether long-context processing can be externalized from latent attention into explicit, executable interactions.Method: Use coding agents to organize text in file systems and manipulate it using native tools (executable code and terminal commands). Evaluate off-the-shelf frontier coding agents on long-context reasoning, retrieval-augmented generation, and open-domain QA with up to three trillion tokens.
Result: Coding agents outperform published state-of-the-art by 17.3% on average across multiple benchmarks. Key factors: native tool proficiency (using executable code rather than passive semantic queries) and file system familiarity (navigating massive corpora as directory structures).
Conclusion: Delegating long-context processing to coding agents offers an effective alternative to semantic search or context window scaling, opening new directions for long-context processing in LLMs.
Abstract: Large Language Models (LLMs) have demonstrated remarkable progress in scaling to access massive contexts. However, the access is via the latent and uninterpretable attention mechanisms, and LLMs fail to effective process long context, exhibiting significant performance degradation as context length increases. In this work, we study whether long-context processing can be externalized from latent attention into explicit, executable interactions, by allowing coding agents to organize text in file systems and manipulate it using its native tools. We evaluate off-the-shelf frontier coding agents as the general interface for tasks that require processing long contexts, including long-context reasoning, retrieval-augmented generation, and open-domain question answering with large-scale corpus contains up to three trillion tokens. Across multiple benchmarks, these agents outperform published state-of-the-art by 17.3% on average. We attribute this efficacy to two key factors: native tool proficiency, which enables agents to leverage executable code and terminal commands rather than passive semantic queries, and file system familiarity, which allows them to navigate massive text corpora as directory structures. These findings suggest that delegating long-context processing to coding agents offers an effective alternative to semantic search or context window scaling, opening new directions for long-context processing in LLMs.
[22] Adapting Self-Supervised Speech Representations for Cross-lingual Dysarthria Detection in Parkinson’s Disease
Abner Hernandez, Eunjung Yeo, Kwanghee Choi, Chin-Jou Li, Zhengjun Yue, Rohan Kumar Das, Jan Rusz, Mathew Magimai Doss, Juan Rafael Orozco-Arroyave, Tomás Arias-Vergara, Andreas Maier, Elmar Nöth, David R. Mortensen, David Harwath, Paula Andrea Perez-Toro
Main category: cs.CL
TL;DR: Proposes representation-level language shift (LS) method to align self-supervised speech representations across languages for cross-lingual dysarthria detection in Parkinson’s disease speech
Details
Motivation: Cross-lingual dysarthria detection is challenging due to limited dysarthric speech data and language-dependent structure in speech representations that can confound detectionMethod: Uses centroid-based vector adaptation estimated from healthy-control speech to align source-language self-supervised speech representations with target-language distribution
Result: LS substantially improves sensitivity and F1 in cross-lingual settings, with smaller but consistent gains in multilingual settings; reduces language identity in embedding space
Conclusion: The proposed language shift method effectively removes language-dependent structure from speech representations, enabling better cross-lingual dysarthria detection
Abstract: The limited availability of dysarthric speech data makes cross-lingual detection an important but challenging problem. A key difficulty is that speech representations often encode language-dependent structure that can confound dysarthria detection. We propose a representation-level language shift (LS) that aligns source-language self-supervised speech representations with the target-language distribution using centroid-based vector adaptation estimated from healthy-control speech. We evaluate the approach on oral DDK recordings from Parkinson’s disease speech datasets in Czech, German, and Spanish under both cross-lingual and multilingual settings. LS substantially improves sensitivity and F1 in cross-lingual settings, while yielding smaller but consistent gains in multilingual settings. Representation analysis further shows that LS reduces language identity in the embedding space, supporting the interpretation that LS removes language-dependent structure.
[23] A Training-Free Regeneration Paradigm: Contrastive Reflection Memory Guided Self-Verification and Self-Improvement
Yuran Li, Di Wu, Benoit Boulet
Main category: cs.CL
TL;DR: Training-free self-improvement method using offline-curated Reflection Memory for verification-guided regeneration, avoiding iterative correction while improving accuracy
Details
Motivation: Existing verification-guided self-improvement methods face trade-offs between inference efficiency and accuracy - iterative verification-rectification is computationally expensive and prone to faulty reasoning traps, while best-of-N selection requires extensive sampling without addressing internal model flawsMethod: Proposes a training-free regeneration paradigm using offline-curated contrastive Reflection Memory (RM) to provide corrective guidance. At inference: RM-guided self-verification followed by single RM-guided regeneration, avoiding both iterative correction and multi-sample selection
Result: Method outperforms prior methods on nine benchmarks spanning algorithmic, reasoning, symbolic, and domain-specific tasks in both small- and large-scale LLMs while maintaining low computational cost
Conclusion: The proposed RM-guided regeneration approach provides an effective balance between accuracy and efficiency for LLM self-improvement, addressing limitations of existing verification methods
Abstract: Verification-guided self-improvement has recently emerged as a promising approach to improving the accuracy of large language model (LLM) outputs. However, existing approaches face a trade-off between inference efficiency and accuracy: iterative verification-rectification is computationally expensive and prone to being trapped in faulty reasoning, while best-of-N selection requires extensive sampling without addressing internal model flaws. We propose a training-free regeneration paradigm that leverages an offline-curated contrastive Reflection Memory (RM) to provide corrective guidance, while regenerating from scratch helps break out of faulty reasoning. At inference time, the method performs RM-guided self-verification followed by a single RM-guided regeneration, avoiding both iterative correction and multi-sample selection. We evaluated our method on nine benchmarks that span algorithmic, reasoning, symbolic, and domain-specific tasks in both small- and large-scale LLMs. Experiment results show that our method outperforms prior methods while maintaining low computational cost.
[24] Policies Permitting LLM Use for Polishing Peer Reviews Are Currently Not Enforceable
Rounak Saha, Gurusha Juneja, Dayita Chaudhuri, Naveeja Sajeevan, Nihar B Shah, Danish Pruthi
Main category: cs.CL
TL;DR: Current AI-text detectors struggle to accurately identify LLM-polished peer reviews, risking false accusations of academic misconduct and potentially overstating AI policy violations in scientific publishing.
Details
Motivation: Many scientific conferences and journals have banned LLM usage in peer reviews except for polishing/grammar correction, but these policies may be unenforceable due to limitations in current AI-text detection technology.Method: Created a dataset of peer reviews simulating various levels of human-AI collaboration, then evaluated five state-of-the-art AI-text detectors (including two commercial systems). Investigated whether peer-review-specific signals (access to manuscript, scientific writing domain) could improve detection accuracy.
Result: All detectors misclassified a significant fraction of LLM-polished reviews as AI-generated, risking false misconduct accusations. While incorporating peer-review-specific signals yielded some improvements, no approach met accuracy standards needed for reliable AI-use detection in peer reviews.
Conclusion: Current AI-text detectors are insufficient for enforcing LLM usage policies in peer review, and recent public estimates of AI use based on these detectors should be interpreted cautiously as they likely overstate policy violations.
Abstract: A number of scientific conferences and journals have recently enacted policies that prohibit LLM usage by peer reviewers, except for polishing, paraphrasing, and grammar correction of otherwise human-written reviews. But, are these policies enforceable? To answer this question, we assemble a dataset of peer reviews simulating multiple levels of human-AI collaboration, and evaluate five state-of-the-art detectors, including two commercial systems. Our analysis shows that all detectors misclassify a non-trivial fraction of LLM-polished reviews as AI-generated, thereby risking false accusations of academic misconduct. We further investigate whether peer-review-specific signals, including access to the paper manuscript and the constrained domain of scientific writing, can be leveraged to improve detection. While incorporating such signals yields measurable gains in some settings, we identify limitations in each approach and find that none meets the accuracy standards required for identifying AI use in peer reviews. Importantly, our results suggest that recent public estimates of AI use in peer reviews through the use of AI-text detectors should be interpreted with caution, as current detectors misclassify mixed reviews (collaborative human-AI outputs) as fully AI generated, potentially overstating the extent of policy violations.
[25] Diffutron: A Masked Diffusion Language Model for Turkish Language
Şuayp Talha Kocabay, Talha Rüzgar Akkuş
Main category: cs.CL
TL;DR: Diffutron: A masked diffusion language model for Turkish that uses resource-efficient training and progressive instruction-tuning to achieve competitive performance despite compact size.
Details
Motivation: Masked diffusion language models show promise as non-autoregressive alternatives to standard LLMs, but their application to morphologically rich languages like Turkish remains limited. The authors aim to address this gap by creating a specialized model for Turkish.Method: 1) Resource-efficient training pipeline starting with LoRA-based continual pre-training of a multilingual encoder on large-scale corpus. 2) Progressive instruction-tuning strategy: sequential adaptation on general and task-specific instruction sets to enable generative capabilities.
Result: Despite its compact size, the model achieves competitive performance compared to existing multi-billion-parameter baselines across comprehensive benchmarks for Turkish language tasks.
Conclusion: Masked diffusion modeling combined with multi-stage tuning is effective for non-autoregressive text generation in morphologically rich languages like Turkish, validating this approach for specialized language applications.
Abstract: Masked Diffusion Language Models (MDLMs) have emerged as a compelling non-autoregressive alternative to standard large language models; however, their application to morphologically rich languages remains limited. In this paper, we introduce $\textit{Diffutron}$, a masked diffusion language model specifically designed for Turkish. Our approach leverages a resource-efficient training pipeline, starting with LoRA-based continual pre-training of a multilingual encoder on a large-scale corpus. To enable generative capabilities, we employ a progressive instruction-tuning strategy, sequentially adapting the model on general and task-specific instruction sets. Experimental results across comprehensive benchmarks demonstrate that, despite its compact size, our model achieves competitive performance compared to existing multi-billion-parameter baselines. These findings validate the effectiveness of masked diffusion modeling combined with multi-stage tuning for non-autoregressive text generation in Turkish.
[26] PARHAF, a human-authored corpus of clinical reports for fictitious patients in French
Xavier Tannier, Salam Abbara, Rémi Flicoteaux, Youness Khalil, Aurélie Névéol, Pierre Zweigenbaum, Emmanuel Bacry
Main category: cs.CL
TL;DR: PARHAF is a large open-source French clinical corpus of synthetic patient reports created by medical experts, designed to be privacy-preserving and freely shareable for clinical NLP development.
Details
Motivation: Clinical NLP development is hindered by privacy regulations that restrict sharing of real medical records, creating a need for synthetic but realistic clinical data that can be freely shared while preserving patient privacy.Method: Created a corpus of 7,394 clinical reports covering 5,009 patient cases using a structured protocol: medical residents across 18 specialties authored realistic but entirely fictitious patient reports following predefined clinical scenarios and templates, with epidemiological guidance from French National Health Data System.
Result: Produced PARHAF corpus with general-purpose component approximating real-world hospitalization distributions and four specialized subsets for oncology, infectious diseases, and diagnostic coding use cases, released under CC-BY open license.
Conclusion: PARHAF provides a valuable privacy-preserving resource for French clinical NLP development and establishes a replicable methodology for creating shareable synthetic clinical corpora in other languages and health systems.
Abstract: The development of clinical natural language processing (NLP) systems is severely hampered by the sensitive nature of medical records, which restricts data sharing under stringent privacy regulations, particularly in France and the broader European Union. To address this gap, we introduce PARHAF, a large open-source corpus of clinical documents in French. PARHAF comprises expert-authored clinical reports describing realistic yet entirely fictitious patient cases, making it anonymous and freely shareable by design. The corpus was developed using a structured protocol that combined clinician expertise with epidemiological guidance from the French National Health Data System (SNDS), ensuring broad clinical coverage. A total of 104 medical residents across 18 specialties authored and peer-reviewed the reports following predefined clinical scenarios and document templates. The corpus contains 7394 clinical reports covering 5009 patient cases across a wide range of medical and surgical specialties. It includes a general-purpose component designed to approximate real-world hospitalization distributions, and four specialized subsets that support information-extraction use cases in oncology, infectious diseases, and diagnostic coding. Documents are released under a CC-BY open license, with a portion temporarily embargoed to enable future benchmarking under controlled conditions. PARHAF provides a valuable resource for training and evaluating French clinical language models in a fully privacy-preserving setting, and establishes a replicable methodology for building shareable synthetic clinical corpora in other languages and health systems.
[27] Evaluating Large Language Models on Historical Health Crisis Knowledge in Resource-Limited Settings: A Hybrid Multi-Metric Study
Mohammed Rakibul Hasan
Main category: cs.CL
TL;DR: Evaluation of LLMs (GPT-4, Gemini Pro, Llama 3, Mistral-7B) on health crisis queries in Bangladesh context shows mixed reliability for low-resource settings.
Details
Motivation: While LLMs show potential for health information delivery, their reliability in low-resource contexts remains uncertain, particularly for health crisis information in developing countries like Bangladesh.Method: Constructed question-answer dataset from authoritative sources on COVID-19, dengue, Nipah virus, and Chikungunya. Evaluated four LLMs using semantic similarity, expert-model cross-evaluation, and Natural Language Inference (NLI).
Result: Findings reveal both strengths and limitations of LLMs in representing epidemiological history and health crisis knowledge, with mixed performance across different evaluation metrics.
Conclusion: LLMs show promise but also risks for informing health policy in resource-constrained environments, highlighting need for careful implementation and validation.
Abstract: Large Language Models (LLMs) offer significant potential for delivering health information. However, their reliability in low-resource contexts remains uncertain. This study evaluates GPT-4, Gemini Pro, Llama~3, and Mistral-7B on health crisis-related enquiries concerning COVID-19, dengue, the Nipah virus, and Chikungunya in the low-resource context of Bangladesh. We constructed a question–answer dataset from authoritative sources and assessed model outputs through semantic similarity, expert-model cross-evaluation, and Natural Language Inference (NLI). Findings highlight both the strengths and limitations of LLMs in representing epidemiological history and health crisis knowledge, underscoring their promise and risks for informing policy in resource-constrained environments.
[28] Permutation-Consensus Listwise Judging for Robust Factuality Evaluation
Tianyi Huang, Nathan Huang, Justin Tang, Wenqian Chen, Elsa Fan
Main category: cs.CL
TL;DR: PCFJudge addresses LLM instability in factuality evaluation by aggregating judgments across multiple candidate orderings to reduce order sensitivity.
Details
Motivation: LLMs used as judges exhibit instability where their decisions change under presentation choices that should be irrelevant, specifically candidate-order sensitivity in listwise factuality evaluation where multiple answers may appear similar but differ in hallucination risk.Method: PCFJudge is an inference-time method that reruns the same factuality-first listwise prompt over multiple orderings of the same candidate set and aggregates the resulting scores, ranks, and uncertainty signals into a single consensus decision.
Result: On RewardBench 2 Factuality, PCFJudge improves over direct judging by up to 7 absolute points, with development ablations showing dominant gains from permutation consensus rather than heavier arbitration layers.
Conclusion: A meaningful share of factuality-judging error arises from order instability, and averaging over this nuisance variation is a simple and effective way to make LLM evaluation more reliable.
Abstract: Large language models (LLMs) are now widely used as judges, yet their decisions can change under presentation choices that should be irrelevant. We study one such source of instability: candidate-order sensitivity in listwise factuality evaluation, where several answers can look similarly polished while differing sharply in hallucination risk. We introduce PCFJudge, an inference-time method that reruns the same factuality-first listwise prompt over multiple orderings of the same candidate set and aggregates the resulting scores, ranks, and uncertainty signals into a single consensus decision. On RewardBench 2 Factuality, PCFJudge improves over direct judging by up to 7 absolute points. Development ablations show that the dominant gain comes from permutation consensus itself rather than from heavier arbitration layers. These results suggest that a meaningful share of factuality-judging error arises from order instability, and that averaging over this nuisance variation is a simple and effective way to make LLM evaluation more reliable.
[29] JUBAKU: An Adversarial Benchmark for Exposing Culturally Grounded Stereotypes in Japanese LLMs
Taihei Shiotani, Masahiro Kaneko, Ayana Niwa, Yuki Maruyama, Daisuke Oba, Masanari Ohi, Naoaki Okazaki
Main category: cs.CL
TL;DR: JUBAKU is a Japanese culture-specific bias benchmark using adversarial dialogue scenarios to evaluate social biases in Japanese LLMs, revealing significant biases that English-translated benchmarks miss.
Details
Motivation: Existing bias evaluations for non-English LLMs rely on translated English benchmarks that fail to capture local cultural norms and stereotypes, such as Japan-specific hierarchical relationships, regional dialects, and traditional gender roles.Method: Created JUBAKU benchmark with adversarial construction across ten cultural categories, featuring dialogue scenarios hand-crafted by native Japanese annotators specifically designed to trigger latent social biases in Japanese LLMs.
Result: All nine Japanese LLMs performed poorly on JUBAKU with average accuracy of 23% (range 13-33%), well below the 50% random baseline, despite higher accuracy on translated English benchmarks. Human annotators achieved 91% accuracy confirming benchmark reliability.
Conclusion: Culture-specific bias benchmarks like JUBAKU are essential for properly evaluating social biases in LLMs, as translated English benchmarks fail to capture local cultural nuances and stereotypes.
Abstract: Social biases reflected in language are inherently shaped by cultural norms, which vary significantly across regions and lead to diverse manifestations of stereotypes. Existing evaluations of social bias in large language models (LLMs) for non-English contexts, however, often rely on translations of English benchmarks. Such benchmarks fail to reflect local cultural norms, including those found in Japanese. For instance, Western benchmarks may overlook Japan-specific stereotypes related to hierarchical relationships, regional dialects, or traditional gender roles. To address this limitation, we introduce Japanese cUlture adversarial BiAs benchmarK Under handcrafted creation (JUBAKU), a benchmark tailored to Japanese cultural contexts. JUBAKU uses adversarial construction to expose latent biases across ten distinct cultural categories. Unlike existing benchmarks, JUBAKU features dialogue scenarios hand-crafted by native Japanese annotators, specifically designed to trigger and reveal latent social biases in Japanese LLMs. We evaluated nine Japanese LLMs on JUBAKU and three others adapted from English benchmarks. All models clearly exhibited biases on JUBAKU, performing below the random baseline of 50% with an average accuracy of 23% (ranging from 13% to 33%), despite higher accuracy on the other benchmarks. Human annotators achieved 91% accuracy in identifying unbiased responses, confirming JUBAKU’s reliability and its adversarial nature to LLMs.
[30] A Modular LLM Framework for Explainable Price Outlier Detection
Shadi Sartipi, John Wu, Sina Ghotbi, Nikhita Vedula, Shervin Malmasi
Main category: cs.CL
TL;DR: An agentic LLM framework for detecting product price outliers through semantic reasoning about product attributes and comparisons with similar products.
Details
Motivation: Traditional price outlier detection methods use simple thresholds but ignore semantic relationships between product attributes, which can lead to poor detection of erroneous or unexpectedly high prices that harm competitiveness, revenue, and consumer trust.Method: Three-stage agentic LLM framework: (1) relevance classification selects price-relevant similar products using descriptions/attributes, (2) relative utility assessment compares target vs. similar products along price-influencing dimensions (brand, size, features), (3) reasoning-based decision aggregates justifications into explainable outlier judgments.
Result: Achieves over 75% agreement with human auditors, outperforms zero-shot and retrieval-based LLM techniques. Ablation studies show sensitivity to key hyperparameters and flexibility for different accuracy requirements and auditor agreements.
Conclusion: The proposed agentic LLM framework effectively detects price outliers through semantic reasoning, providing explainable judgments that outperform traditional methods and offer flexibility for various application scenarios.
Abstract: Detecting product price outliers is important for retail and e-commerce stores as erroneous or unexpectedly high prices adversely affect competitiveness, revenue, and consumer trust. Classical techniques offer simple thresholds while ignoring the rich semantic relationships among product attributes. We propose an agentic Large Language Model (LLM) framework that treats outlier price flagging as a reasoning task grounded in related product detection and comparison. The system processes the prices of target products in three stages: (i) relevance classification selects price-relevant similar products using product descriptions and attributes; (ii) relative utility assessment evaluates the target product against each similar product along price influencing dimensions (e.g., brand, size, features); (iii) reasoning-based decision aggregates these justifications into an explainable price outlier judgment. The framework attains over 75% agreement with human auditors on a test dataset, and outperforms zero-shot and retrieval based LLM techniques. Ablation studies show the sensitivity of the method to key hyper-parameters and testify on its flexibility to be applied to cases with different accuracy requirement and auditor agreements.
[31] Hear Both Sides: Efficient Multi-Agent Debate via Diversity-Aware Message Retention
Manh Nguyen, Anh Nguyen, Dung Nguyen, Svetha Venkatesh, Hung Le
Main category: cs.CL
TL;DR: DAR is a lightweight multi-agent debate framework that selectively broadcasts only the most diverse agent responses to reduce noise and improve reasoning quality.
Details
Motivation: Current multi-agent debate frameworks broadcast all agent messages at every round, introducing noise and redundancy that degrades debate quality and wastes computational resources. Uncertainty-based filtering approaches are unreliable due to miscalibrated confidence scores.Method: Proposes Diversity-Aware Retention (DAR), which at each debate round selects the subset of agent responses that maximally disagree with each other and with the majority vote before broadcasting. Uses explicit index-based retention to preserve original messages without modification.
Result: Experiments on diverse reasoning and question answering benchmarks show that selective message propagation consistently improves debate performance, particularly as the number of agents scales where noise accumulation is most severe.
Conclusion: DAR demonstrates that what agents hear is as important as what agents say in multi-agent reasoning systems, and selective retention of diverse perspectives improves debate outcomes.
Abstract: Multi-Agent Debate has emerged as a promising framework for improving the reasoning quality of large language models through iterative inter-agent communication. However, broadcasting all agent messages at every round introduces noise and redundancy that can degrade debate quality and waste computational resources. Current approaches rely on uncertainty estimation to filter low-confidence responses before broadcasting, but this approach is unreliable due to miscalibrated confidence scores and sensitivity to threshold selection. To address this, we propose Diversity-Aware Retention (DAR), a lightweight debate framework that, at each debate round, selects the subset of agent responses that maximally disagree with each other and with the majority vote before broadcasting. Through an explicit index-based retention mechanism, DAR preserves the original messages without modification, ensuring that retained disagreements remain authentic. Experiments on diverse reasoning and question answering benchmarks demonstrate that our selective message propagation consistently improves debate performance, particularly as the number of agents scales, where noise accumulation is most severe. Our results highlight that what agents hear is as important as what agents say in multi-agent reasoning systems.
[32] Agentic Automation of BT-RADS Scoring: End-to-End Multi-Agent System for Standardized Brain Tumor Follow-up Assessment
Mohamed Sobhi Jabal, Jikai Zhang, Dominic LaBella, Jessica L. Houk, Dylan Zhang, Jeffrey D. Rudie, Kirti Magudia, Maciej A. Mazurowski, Evan Calabrese
Main category: cs.CL
TL;DR: Multi-agent LLM system with CNN segmentation automates BT-RADS classification for glioma MRI, outperforming initial clinical assessments with 76% accuracy.
Details
Motivation: BT-RADS standardizes glioma MRI response assessment but requires complex integration of imaging trends, medication effects, and radiation timing, making manual application challenging and potentially inconsistent.Method: End-to-end system combining multi-agent LLM with CNN-based tumor segmentation. Extractor agent identifies clinical variables from unstructured notes, scorer agent applies BT-RADS decision logic integrating extracted variables with volumetric measurements. Evaluated on 509 post-treatment glioma MRI exams.
Result: System achieved 76.0% accuracy vs 57.5% for initial clinical assessments (+18.5 percentage points; P<.001). High sensitivity for context-dependent categories (BT-1b 100%, BT-1a 92.7%, BT-3a 87.5%), moderate sensitivity for threshold-dependent categories. High positive predictive value for BT-4 detection (92.9%).
Conclusion: Multi-agent LLM system achieved higher BT-RADS classification agreement with expert reference standard than initial clinical scoring, with high accuracy for context-dependent scores and high positive predictive value for BT-4 detection.
Abstract: The Brain Tumor Reporting and Data System (BT-RADS) standardizes post-treatment MRI response assessment in patients with diffuse gliomas but requires complex integration of imaging trends, medication effects, and radiation timing. This study evaluates an end-to-end multi-agent large language model (LLM) and convolutional neural network (CNN) system for automated BT-RADS classification. A multi-agent LLM system combined with automated CNN-based tumor segmentation was retrospectively evaluated on 509 consecutive post-treatment glioma MRI examinations from a single high-volume center. An extractor agent identified clinical variables (steroid status, bevacizumab status, radiation date) from unstructured clinical notes, while a scorer agent applied BT-RADS decision logic integrating extracted variables with volumetric measurements. Expert reference standard classifications were established by an independent board-certified neuroradiologist. Of 509 examinations, 492 met inclusion criteria. The system achieved 374/492 (76.0%; 95% CI, 72.1%-79.6%) accuracy versus 283/492 (57.5%; 95% CI, 53.1%-61.8%) for initial clinical assessments (+18.5 percentage points; P<.001). Context-dependent categories showed high sensitivity (BT-1b 100%, BT-1a 92.7%, BT-3a 87.5%), while threshold-dependent categories showed moderate sensitivity (BT-3c 74.8%, BT-2 69.2%, BT-4 69.3%, BT-3b 57.1%). For BT-4, positive predictive value was 92.9%. The multi-agent LLM system achieved higher BT-RADS classification agreement with expert reference standard compared to initial clinical scoring, with high accuracy for context-dependent scores and high positive predictive value for BT-4 detection.
[33] Weber’s Law in Transformer Magnitude Representations: Efficient Coding, Representational Geometry, and Psychophysical Laws in Language Models
Jon-Paul Cacioli
Main category: cs.CL
TL;DR: Transformer language models develop log-compressive magnitude representations similar to human Weber-Fechner law, but this geometry doesn’t guarantee behavioral competence in magnitude discrimination tasks.
Details
Motivation: To resolve conflicting findings about how transformer language models represent magnitude - whether they use logarithmic spacing, linear encoding, or per-digit circular representations - by applying psychophysical methods to understand magnitude representation in LLMs.Method: Used four converging psychophysical paradigms: representational similarity analysis, behavioral discrimination tests, precision gradients, and causal interventions across three magnitude domains (numerical, temporal, spatial) in three 7-9B instruction-tuned models (Llama, Mistral, Qwen) spanning different architecture families.
Result: 1) Representational geometry is consistently log-compressive across all models (RSA correlations 0.68-0.96 with Weber-law matrix). 2) Geometry dissociates from behavior - one model shows human-range Weber fraction while others don’t, and all perform poorly on temporal/spatial discrimination despite having logarithmic geometry. 3) Causal intervention shows early layers (4.1x specificity) are functionally important for magnitude processing, while later layers with strongest geometry are not causally engaged (1.2x).
Conclusion: Training data statistics alone produce log-compressive magnitude geometry in LLMs (similar to human Weber-Fechner law), but this geometry doesn’t guarantee behavioral competence in magnitude tasks, revealing a dissociation between representation and function.
Abstract: How do transformer language models represent magnitude? Recent work disagrees: some find logarithmic spacing, others linear encoding, others per-digit circular representations. We apply the formal tools of psychophysics to resolve this. Using four converging paradigms (representational similarity analysis, behavioural discrimination, precision gradients, causal intervention) across three magnitude domains in three 7-9B instruction-tuned models spanning three architecture families (Llama, Mistral, Qwen), we report three findings. First, representational geometry is consistently log-compressive: RSA correlations with a Weber-law dissimilarity matrix ranged from .68 to .96 across all 96 model-domain-layer cells, with linear geometry never preferred. Second, this geometry is dissociated from behaviour: one model produces a human-range Weber fraction (WF = 0.20) while the other does not, and both models perform at chance on temporal and spatial discrimination despite possessing logarithmic geometry. Third, causal intervention reveals a layer dissociation: early layers are functionally implicated in magnitude processing (4.1x specificity) while later layers where geometry is strongest are not causally engaged (1.2x). Corpus analysis confirms the efficient coding precondition (alpha = 0.77). These results suggest that training data statistics alone are sufficient to produce log-compressive magnitude geometry, but geometry alone does not guarantee behavioural competence.
[34] PAVE: Premise-Aware Validation and Editing for Retrieval-Augmented LLMs
Tianyi Huang, Caden Yang, Emily Yin, Eric Wang, Michael Zhang
Main category: cs.CL
TL;DR: PAVE adds an inference-time validation layer to retrieval-augmented language models that extracts atomic facts from retrieved context, drafts answers, scores support from premises, and revises low-support outputs before finalization.
Details
Motivation: Retrieval-augmented language models often commit to answers without properly checking if retrieved evidence actually supports their conclusions, leading to potential inconsistencies in evidence-grounded question answering.Method: PAVE decomposes retrieved context into question-conditioned atomic facts, drafts an answer, scores how well the draft is supported by extracted premises, and revises low-support outputs before finalization, creating an auditable trace of reasoning.
Result: PAVE outperforms simpler post-retrieval baselines in evidence-grounded QA settings, with gains up to 32.7 accuracy points on a span-grounded benchmark, showing improved evidence-grounded consistency.
Conclusion: Explicit premise extraction plus support-gated revision can strengthen evidence-grounded consistency in retrieval-augmented LLM systems, providing auditable reasoning traces.
Abstract: Retrieval-augmented language models can retrieve relevant evidence yet still commit to answers before explicitly checking whether the retrieved context supports the conclusion. We present PAVE (Premise-Grounded Answer Validation and Editing), an inference-time validation layer for evidence-grounded question answering. PAVE decomposes retrieved context into question-conditioned atomic facts, drafts an answer, scores how well that draft is supported by the extracted premises, and revises low-support outputs before finalization. The resulting trace makes answer commitment auditable at the level of explicit premises, support scores, and revision decisions. In controlled ablations with a fixed retriever and backbone, PAVE outperforms simpler post-retrieval baselines in two evidence-grounded QA settings, with the largest gain reaching 32.7 accuracy points on a span-grounded benchmark. We view these findings as proof-of-concept evidence that explicit premise extraction plus support-gated revision can strengthen evidence-grounded consistency in retrieval-augmented LLM systems.
[35] Can I guess where you are from? Modeling dialectal morphosyntactic similarities in Brazilian Portuguese
Manoel Siqueira, Raquel Freitag
Main category: cs.CL
TL;DR: Study examines morphosyntactic covariation in Brazilian Portuguese to infer dialectal origin from linguistic variables using correlation and clustering methods.
Details
Motivation: To assess whether dialectal origin can be inferred from combined behavior of linguistic variables in Brazilian Portuguese, bridging sociolinguistics and computational approaches.Method: Focuses on four grammatical phenomena related to pronouns, applies correlation and clustering methods to model covariation and dialectal distribution.
Result: Correlation captures only limited pairwise associations, while clustering reveals speaker groupings reflecting regional dialectal patterns.
Conclusion: Despite methodological constraints from sample size differences between fields, interdisciplinary research is important for developing fair, inclusive language technologies that respect dialectal diversity.
Abstract: This paper investigates morphosyntactic covariation in Brazilian Portuguese (BP) to assess whether dialectal origin can be inferred from the combined behavior of linguistic variables. Focusing on four grammatical phenomena related to pronouns, correlation and clustering methods are applied to model covariation and dialectal distribution. The results indicate that correlation captures only limited pairwise associations, whereas clustering reveals speaker groupings that reflect regional dialectal patterns. Despite the methodological constraints imposed by differences in sample size requirements between sociolinguistics and computational approaches, the study highlights the importance of interdisciplinary research. Developing fair and inclusive language technologies that respect dialectal diversity outweighs the challenges of integrating these fields.
[36] Reasoning Topology Matters: Network-of-Thought for Complex Reasoning Tasks
Fan Huang
Main category: cs.CL
TL;DR: Network-of-Thought (NoT) proposes modeling LLM reasoning as directed graphs with typed nodes/edges instead of linear chains or trees, showing advantages for complex multi-hop reasoning tasks.
Details
Motivation: Current prompting paradigms like Chain-of-Thought (linear) and Tree-of-Thought (branching) are limited for complex reasoning that requires merging intermediate results, revisiting hypotheses, and integrating evidence from multiple sources.Method: NoT models reasoning as a directed graph with typed nodes and edges, guided by a heuristic-based controller policy. Evaluated across four benchmarks (GSM8K, Game of 24, HotpotQA, ProofWriter) and three models (GPT-4o-mini, Llama-3.3-70B-Instruct, Qwen2.5-72B-Instruct).
Result: NoT surpasses ToT on multi-hop reasoning (91.0% vs. 88.0% on HotpotQA with LLM-as-Judge). With 72B open-source models, NoT achieves highest accuracy on GSM8K (91.5%), and Qwen2.5-72B achieves best multi-hop QA overall (91.7% on HotpotQA). Self-generated controller heuristics outperform fixed/random strategies on logical reasoning.
Conclusion: Network topology reasoning outperforms chain/tree structures for complex multi-hop tasks, LLM-generated heuristics can effectively guide graph-based reasoning, and evaluation methodology significantly impacts method rankings (string-match underestimates open-ended QA performance).
Abstract: Existing prompting paradigms structure LLM reasoning in limited topologies: Chain-of-Thought (CoT) produces linear traces, while Tree-of-Thought (ToT) performs branching search. Yet complex reasoning often requires merging intermediate results, revisiting hypotheses, and integrating evidence from multiple sources. We propose Network-of-Thought (NoT), a framework that models reasoning as a directed graph with typed nodes and edges, guided by a heuristic-based controller policy. Across four benchmarks (GSM8K, Game of 24, HotpotQA, ProofWriter) and three models (GPT-4o-mini, Llama-3.3-70B-Instruct, Qwen2.5-72B-Instruct), we investigate when network topology outperforms chain or tree structures, whether LLM-generated heuristics can guide graph-based reasoning search, and the computation-accuracy tradeoff across topologies, evaluating each method on accuracy, topology simplicity, and token efficiency. Our results show that CoT remains effective for sequential tasks with GPT-4o-mini (89.5% on GSM8K), while NoT surpasses ToT on multi-hop reasoning (91.0% vs.\ 88.0% on HotpotQA with LLM-as-Judge). With 72B open-source models, NoT achieves the highest accuracy on GSM8K (91.5%), and Qwen2.5-72B achieves the best multi-hop QA result overall (91.7% on HotpotQA). Self-generated controller heuristics outperform fixed and random strategies on logical reasoning, with uncertainty-only weighting achieving 57.0% on ProofWriter. We also find that evaluation methodology significantly impacts method rankings: string-match underestimates all methods on open-ended QA, with the largest gap for NoT, a pattern consistent across all three models (14–18 percentage point gap on HotpotQA).
[37] MzansiText and MzansiLM: An Open Corpus and Decoder-Only Language Model for South African Languages
Anri Lombard, Simbarashe Mawere, Temi Aina, Ethan Wolff, Sbonelo Gumede, Elan Novick, Francois Meyer, Jan Buys
Main category: cs.CL
TL;DR: MzansiLM: A 125M-parameter decoder-only language model trained from scratch on South African languages, with evaluation showing strong performance on supervised NLU/NLG tasks but limited few-shot reasoning capabilities.
Details
Motivation: To address the lack of publicly available decoder-only models for South Africa's 11 official languages (9 of which are low-resource), and to understand how instruction finetuning generalizes at small scale for low-resource languages.Method: Created MzansiText (curated multilingual pretraining corpus with reproducible filtering pipeline) and trained MzansiLM (125M-parameter decoder-only model). Evaluated using three adaptation regimes: monolingual task-specific finetuning, multilingual task-specific finetuning, and general multi-task instruction finetuning.
Result: Monolingual finetuning achieved strong performance on data-to-text generation (20.65 BLEU on isiXhosa). Multilingual finetuning benefited closely related languages on topic classification (78.5% macro-F1 on isiXhosa news classification). However, few-shot reasoning remained challenging even for larger models.
Conclusion: Decoder-only models can effectively adapt to supervised NLU/NLG tasks for low-resource languages at small scale, but few-shot reasoning requires larger models. The released resources provide reproducible baselines and adaptation guidance for South African languages.
Abstract: Decoder-only language models can be adapted to diverse tasks through instruction finetuning, but the extent to which this generalizes at small scale for low-resource languages remains unclear. We focus on the languages of South Africa, where we are not aware of a publicly available decoder-only model that explicitly targets all eleven official written languages, nine of which are low-resource. We introduce MzansiText, a curated multilingual pretraining corpus with a reproducible filtering pipeline, and MzansiLM, a 125M-parameter language model trained from scratch. We evaluate MzansiLM on natural language understanding and generation using three adaptation regimes: monolingual task-specific finetuning, multilingual task-specific finetuning, and general multi-task instruction finetuning. Monolingual task-specific finetuning achieves strong performance on data-to-text generation, reaching 20.65 BLEU on isiXhosa and competing with encoder-decoder baselines over ten times larger. Multilingual task-specific finetuning benefits closely related languages on topic classification, achieving 78.5% macro-F1 on isiXhosa news classification. While MzansiLM adapts effectively to supervised NLU and NLG tasks, few-shot reasoning remains challenging at this model size, with performance near chance even for much larger decoder-only models. We release MzansiText and MzansiLM to provide a reproducible decoder-only baseline and clear guidance on adaptation strategies for South African languages at small scale.
[38] Code-MIE: A Code-style Model for Multimodal Information Extraction with Scene Graph and Entity Attribute Knowledge Enhancement
Jiang Liu, Ge Qiu, Hao Fei, Dongdong Xie, Jinbo Li, Fei Li, Chong Teng, Donghong Ji
Main category: cs.CL
TL;DR: Code-MIE: A framework that formalizes multimodal information extraction as code understanding and generation using Python-style templates, achieving state-of-the-art results on multiple datasets.
Details
Motivation: Existing multimodal information extraction methods use natural language templates that mismatch with structured IE tasks, and while some use code-style templates, they focus on text-only IE and require complex separate templates for each task.Method: Proposes Code-MIE framework that: (1) extracts entity attributes from text to guide context understanding, (2) converts images to scene graphs and visual features, (3) uses Python function as input template with entity attributes, scene graphs, and text as parameters, and (4) outputs Python dictionaries containing extraction results.
Result: Achieves state-of-the-art performance: 61.03% and 60.49% on English/Chinese M³D datasets, and 76.04%, 88.07%, 73.94% on Twitter-15, Twitter-17, and MNRE datasets respectively.
Conclusion: Code-MIE effectively formalizes multimodal information extraction as code understanding and generation, outperforming existing methods by better aligning with structured IE tasks through unified code-style templates.
Abstract: With the rapid development of large language models (LLMs), more and more researchers have paid attention to information extraction based on LLMs. However, there are still some spaces to improve in the existing related methods. First, existing multimodal information extraction (MIE) methods usually employ natural language templates as the input and output of LLMs, which mismatch with the characteristics of information tasks that mostly include structured information such as entities and relations. Second, although a few methods have adopted structured and more IE-friendly code-style templates, they just explored their methods on text-only IE rather than multimodal IE. Moreover, their methods are more complex in design, requiring separate templates to be designed for each task. In this paper, we propose a Code-style Multimodal Information Extraction framework (Code-MIE) which formalizes MIE as unified code understanding and generation. Code-MIE has the following novel designs: (1) Entity attributes such as gender, affiliation are extracted from the text to guide the model to understand the context and role of entities. (2) Images are converted into scene graphs and visual features to incorporate rich visual information into the model. (3) The input template is constructed as a Python function, where entity attributes, scene graphs and raw text compose of the function parameters. In contrast, the output template is formalized as Python dictionaries containing all extraction results such as entities, relations, etc. To evaluate Code-MIE, we conducted extensive experiments on the M$^3$D, Twitter-15, Twitter-17, and MNRE datasets. The results show that our method achieves state-of-the-art performance compared to six competing baseline models, with 61.03% and 60.49% on the English and Chinese datasets of M$^3$D, and 76.04%, 88.07%, and 73.94% on the other three datasets.
[39] The Anatomy of an Edit: Mechanism-Guided Activation Steering for Knowledge Editing
Yuan Cao, Mingyang Wang, Hinrich Schütze
Main category: cs.CL
TL;DR: The paper introduces MEGA, a mechanism-guided activation steering method for knowledge editing in LLMs that uses post-edit attribution analysis to identify where edits take hold and performs targeted interventions without weight modification.
Details
Motivation: Large language models need knowledge editing to stay current, but it's unclear how edits are implemented internally. The authors want to understand the mechanistic changes that occur when knowledge editing succeeds versus fails, moving beyond pre-edit analysis to post-edit attribution.Method: The paper uses neuron-level knowledge attribution (NLKA) to contrast successful and failed edits, identifying consistent patterns in attention and FFN modules. Based on these findings, they propose MEGA (MEchanism-Guided Activation steering), which performs attention-residual interventions in attribution-aligned regions without modifying model weights.
Result: Across representative KE methods, they find consistent patterns: mid-to-late attention promotes new targets while attention and FFN modules cooperate to suppress original facts. MEGA achieves strong editing performance on CounterFact and Popular datasets for GPT2-XL and LLaMA2-7B across KE metrics.
Conclusion: Post-edit attribution can be elevated from analysis to engineering signal, pinpointing where and how edits take hold to enable reliable, architecture-agnostic knowledge editing through targeted activation steering without weight modification.
Abstract: Large language models (LLMs) are increasingly used as knowledge bases, but keeping them up to date requires targeted knowledge editing (KE). However, it remains unclear how edits are implemented inside the model once applied. In this work, we take a mechanistic view of KE using neuron-level knowledge attribution (NLKA). Unlike prior work that focuses on pre-edit causal tracing and localization, we use post-edit attribution – contrasting successful and failed edits – to isolate the computations that shift when an edit succeeds. Across representative KE methods, we find a consistent pattern: mid-to-late attention predominantly promotes the new target, while attention and FFN modules cooperate to suppress the original fact. Motivated by these findings, we propose MEGA, a MEchanism-Guided Activation steering method that performs attention-residual interventions in attribution-aligned regions without modifying model weights. On CounterFact and Popular, MEGA achieves strong editing performance across KE metrics on GPT2-XL and LLaMA2-7B. Overall, our results elevate post-edit attribution from analysis to engineering signal: by pinpointing where and how edits take hold, it powers MEGA to deliver reliable, architecture-agnostic knowledge edits.
[40] RLVR Training of LLMs Does Not Improve Thinking Ability for General QA: Evaluation Method and a Simple Solution
Kaiyuan Li, Jing-Cheng Pang, Yang Yu
Main category: cs.CL
TL;DR: RLVR improves LLM reasoning on verifiable tasks but doesn’t automatically transfer to general QA; START method trains thinking separately to avoid shortcuts and improves both thinking quality and final answers.
Details
Motivation: While RLVR enhances LLM reasoning on verifiable tasks, it's unclear if similar gains transfer to general question answering (GQA). The paper investigates whether RLVR automatically improves GQA performance and addresses potential shortcuts in GQA tasks.Method: Proposes Cross-Generation evaluation to measure intermediate reasoning quality, then introduces Separated Thinking And Response Training (START) which first trains only the thinking process using rewards defined on final answers to avoid shortcuts.
Result: RLVR’s efficacy on GQA is much lower than on verifiable tasks; direct RL training on GQA is less effective than RLVR; START improves both thinking quality and final answers across multiple GQA benchmarks and RL algorithms.
Conclusion: Explicit training on GQA remains necessary beyond verifiable tasks; START effectively avoids shortcuts in GQA by separating thinking training from response generation, leading to improved reasoning and answer quality.
Abstract: Reinforcement learning from verifiable rewards (RLVR) stimulates the thinking processes of large language models (LLMs), substantially enhancing their reasoning abilities on verifiable tasks. It is often assumed that similar gains should transfer to general question answering (GQA), but this assumption has not been thoroughly validated. To assess whether RLVR automatically improves LLM performance on GQA, we propose a Cross-Generation evaluation framework that measures the quality of intermediate reasoning by feeding the generated thinking context into LLMs of varying capabilities. Our evaluation leads to a discouraging finding: the efficacy of the thinking process on GQA tasks is markedly lower than on verifiable tasks, suggesting that explicit training on GQA remains necessary in addition to training on verifiable tasks. We further observe that direct RL training on GQA is less effective than RLVR. Our hypothesis is that, whereas verifiable tasks demand robust logical chains to obtain high rewards, GQA tasks often admit shortcuts to high rewards without cultivating high-quality thinking. To avoid possible shortcuts, we introduce a simple method, Separated Thinking And Response Training (START), which first trains only the thinking process, using rewards defined on the final answer. We show that START improves both the quality of thinking and the final answer across several GQA benchmarks and RL algorithms.
[41] BenchBench: Benchmarking Automated Benchmark Generation
Yandan Zheng, Haoran Luo, Zhenghong Lin, Wenjin Liu, Luu Anh Tuan
Main category: cs.CL
TL;DR: BenchBench: A framework for evaluating LLMs’ ability to design benchmarks, not just answer them, using a three-stage pipeline to generate and validate test items across multiple domains.
Details
Motivation: Static benchmarks for LLMs saturate quickly, are vulnerable to contamination, and are costly to refresh. Current evaluation focuses on how well models answer benchmarks rather than how well they can design them, which is an important but overlooked capability.Method: Three-stage pipeline: (1) extract structured domain cards from seed benchmarks, (2) prompt multiple designer LLMs to generate quota-controlled test suites, (3) validate items using multi-model answerer panels with exact/numeric/symbolic verifiers or rubric-guided judging, producing designer-answerer matrices with quality flags.
Result: Generated 16.7K items across 9 variants (computer science, mathematics, medicine, theory-of-mind reasoning, including multilingual and multimodal settings), retained ~15K core items after filtering, produced ~152K graded model-item responses. Found benchmark-design ability only moderately correlated with answer-time performance (Spearman rho 0.37), invalidity negatively associated with discrimination (Pearson r0.62).
Conclusion: BenchBench enables scalable evaluation of LLMs’ benchmark-design capabilities, revealing that design ability is distinct from answer-time performance. The framework supports audits of format/modality/language fidelity and suite-dependent interactions, providing a new dimension for assessing LLM capabilities.
Abstract: Benchmarks are the de facto standard for tracking progress in large language models (LLMs), yet static test sets can rapidly saturate, become vulnerable to contamination, and are costly to refresh. Scalable evaluation of open-ended items often relies on LLM judges, introducing additional sources of bias and prompt sensitivity. We argue that evaluation must extend beyond how well models answer benchmarks to how well models design them. We introduce BenchBench, a three-stage pipeline and dataset for benchmarking automated benchmark generation: (i) extract structured domain cards from seed benchmarks, (ii) prompt multiple designer LLMs to generate quota-controlled suites, and (iii) validate items with a multi-model answerer panel using exact/numeric/symbolic verifiers when possible and rubric-guided judging otherwise, yielding designer–answerer matrices with item-level quality flags and psychometric diagnostics. Across nine variants spanning computer science, mathematics, medicine, and theory-of-mind reasoning (including multilingual and multimodal settings), we generate 16.7K items, retain ~15K core items post-filtering, and produce ~152K graded model–item responses. BenchBench shows that benchmark-design ability is only moderately correlated with answer-time strength (Spearman rho 0.37), invalidity is negatively associated with discrimination (Pearson r0.62), and the resulting designer–answerer matrices enable scalable audits of format/modality/language fidelity and suite-dependent self/family interactions. The project is available at: https://github.com/koanatakiyo/BenchBench.
[42] HiCI: Hierarchical Construction-Integration for Long-Context Attention
Xiangyu Zeng, Qi Xu, Yunke Wang, Chang Xu
Main category: cs.CL
TL;DR: HiCI introduces a hierarchical attention module for long-context language modeling that constructs segment-level representations and integrates them into global context, enabling efficient extension of context length with minimal parameter overhead.
Details
Motivation: Current long-context language modeling focuses primarily on scalability challenges of token-level attention, but lacks explicit mechanisms for local-to-global information structuring. The paper draws on cognitive theories of discourse comprehension to address this gap.Method: Proposes HiCI (Hierarchical Construction-Integration), a hierarchical attention module that: 1) constructs segment-level representations, 2) integrates them into a shared global context, and 3) broadcasts both to condition segment-level attention. Implemented through parameter-efficient adaptation of LLaMA-2 with only <5.5% additional parameters.
Result: Successfully extends context from 4K to 100K tokens (7B model) and 64K tokens (13B model). Shows consistent improvements across language modeling, retrieval, and instruction-following benchmarks. Matches proprietary models on topic retrieval and surpasses GPT-3.5-Turbo-16K on code comprehension.
Conclusion: Explicit hierarchical structuring serves as an effective inductive bias for long-context modeling, demonstrating that cognitive-inspired approaches can enhance language model performance on extended contexts with minimal parameter overhead.
Abstract: Long-context language modeling is commonly framed as a scalability challenge of token-level attention, yet local-to-global information structuring remains largely implicit in existing approaches. Drawing on cognitive theories of discourse comprehension, we propose HiCI (Hierarchical Construction–Integration), a hierarchical attention module that constructs segment-level representations, integrates them into a shared global context, and broadcasts both to condition segment-level attention. We validate HiCI through parameter-efficient adaptation of LLaMA-2 with only <5.5% additional parameters, extending context from 4K to 100K tokens (7B) and 64K tokens (13B). Across language modeling, retrieval, and instruction-following benchmarks, HiCI yields consistent improvements over strong baselines, including matching proprietary models on topic retrieval and surpassing GPT-3.5-Turbo-16K on code comprehension. These results demonstrate the effectiveness of explicit hierarchical structuring as an inductive bias for long-context modeling.
[43] Can ChatGPT Really Understand Modern Chinese Poetry?
Shanshan Wang, Derek F. Wong, Jingming Yao, Lidia S. Chao
Main category: cs.CL
TL;DR: ChatGPT shows strong alignment (73%+) with poet intent in modern Chinese poetry interpretation but struggles with capturing poeticity, revealing both capabilities and limitations in LLM poetry understanding.
Details
Motivation: While ChatGPT demonstrates capabilities in poetry generation and translation, its true understanding of poetry remains unexplored. Previous work only analyzed experimental outcomes without addressing fundamental comprehension issues, creating a gap in evaluating LLMs' poetic understanding.Method: Developed a comprehensive evaluation framework in collaboration with professional poets to assess ChatGPT’s interpretation of modern Chinese poems across multiple dimensions, comparing its interpretations with original poets’ intents.
Result: ChatGPT’s interpretations aligned with original poets’ intents in over 73% of cases, but showed less satisfactory performance in capturing poeticity and certain other dimensions of poetic understanding.
Conclusion: The study establishes an effective framework for evaluating LLM poetry understanding, revealing ChatGPT’s strengths in interpretation alignment but limitations in capturing poetic essence, providing foundation for future poetry-related LLM research.
Abstract: ChatGPT has demonstrated remarkable capabilities on both poetry generation and translation, yet its ability to truly understand poetry remains unexplored. Previous poetry-related work merely analyzed experimental outcomes without addressing fundamental issues of comprehension. This paper introduces a comprehensive framework for evaluating ChatGPT’s understanding of modern poetry. We collaborated with professional poets to evaluate ChatGPT’s interpretation of modern Chinese poems by different poets along multiple dimensions. Evaluation results show that ChatGPT’s interpretations align with the original poets’ intents in over 73% of the cases. However, its understanding in certain dimensions, particularly in capturing poeticity, proved to be less satisfactory. These findings highlight the effectiveness and necessity of our proposed framework. This study not only evaluates ChatGPT’s ability to understand modern poetry but also establishes a solid foundation for future research on LLMs and their application to poetry-related tasks.
[44] SozKZ: Training Efficient Small Language Models for Kazakh from Scratch
Saken Tukenov
Main category: cs.CL
TL;DR: SozKZ: A family of dedicated Kazakh language models (50M-600M params) trained from scratch with language-appropriate tokenizer, outperforming larger multilingual models on Kazakh benchmarks.
Details
Motivation: Kazakh is underserved by existing multilingual models due to minimal capacity allocation for low-resource languages and tokenizers ill-suited to its agglutinative morphology.Method: Train Llama-architecture models from scratch on 9B Kazakh tokens with dedicated 50K BPE tokenizer, evaluate on Kazakh cultural QA, reading comprehension, and topic classification benchmarks.
Result: 600M model achieves 30.3% on Kazakh cultural QA (approaching Llama-3.2-1B’s 32.0%) and 25.5% on topic classification, surpassing multilingual models up to 2B parameters. Shows consistent scaling from 50M to 600M.
Conclusion: Small dedicated models with language-appropriate tokenizers offer viable path for low-resource language technology, achieving competitive performance at fraction of computational cost.
Abstract: Kazakh, a Turkic language spoken by over 22 million people, remains underserved by existing multilingual language models, which allocate minimal capacity to low-resource languages and employ tokenizers ill-suited to agglutinative morphology. We present SozKZ, a family of Llama-architecture language models (50M-600M parameters) trained entirely from scratch on 9 billion tokens of Kazakh text with a dedicated 50K BPE tokenizer. We evaluate all models on three Kazakh benchmarks – multiple-choice cultural QA, reading comprehension (Belebele), and topic classification (SIB-200) – alongside five multilingual baselines ranging from 500M to 3B parameters. Our 600M model achieves 30.3% accuracy on Kazakh cultural QA, approaching the 32.0% of Llama-3.2-1B (2x larger), and 25.5% on SIB-200 topic classification, surpassing all evaluated multilingual models up to 2B parameters. We observe consistent scaling from 50M to 600M, with MC QA accuracy rising from 22.8% to 30.3%, suggesting that further scaling remains beneficial. These results demonstrate that small, dedicated models trained from scratch with a language-appropriate tokenizer offer a viable path for low-resource language technology, achieving competitive performance at a fraction of the computational cost. All models and the tokenizer are released under open licenses.
[45] NoveltyAgent: Autonomous Novelty Reporting Agent with Point-wise Novelty Analysis and Self-Validation
Jiajun Hou, Hexuan Deng, Wenxiang Jiao, Xuebo Liu, Xiaopeng Ke, Min Zhang
Main category: cs.CL
TL;DR: NoveltyAgent is a multi-agent system that generates comprehensive novelty reports for academic papers by decomposing manuscripts into discrete novelty points for fine-grained retrieval and comparison, outperforming existing methods by 10.15%.
Details
Motivation: The exponential growth of academic publications has increased paper screening costs, and current approaches lack domain-specific mechanisms for quality novelty assessment, leading to lower-quality results.Method: Introduces a multi-agent system that decomposes manuscripts into discrete novelty points for fine-grained retrieval and comparison, builds a comprehensive related-paper database, and cross-references claims to ensure faithfulness. Also proposes a checklist-based evaluation framework for open-ended generation tasks.
Result: Extensive experiments show NoveltyAgent achieves state-of-the-art performance, outperforming GPT-5 DeepResearch by 10.15%.
Conclusion: NoveltyAgent provides reliable, high-quality novelty analysis to help researchers quickly identify novel papers, with code and demo available.
Abstract: The exponential growth of academic publications has led to a surge in papers of varying quality, increasing the cost of paper screening. Current approaches either use novelty assessment within general AI Reviewers or repurpose DeepResearch, which lacks domain-specific mechanisms and thus delivers lower-quality results. To bridge this gap, we introduce NoveltyAgent, a multi-agent system designed to generate comprehensive and faithful novelty reports, enabling thorough evaluation of a paper’s originality. It decomposes manuscripts into discrete novelty points for fine-grained retrieval and comparison, and builds a comprehensive related-paper database while cross-referencing claims to ensure faithfulness. Furthermore, to address the challenge of evaluating such open-ended generation tasks, we propose a checklist-based evaluation framework, providing an unbiased paradigm for building reliable evaluations. Extensive experiments show that NoveltyAgent achieves state-of-the-art performance, outperforming GPT-5 DeepResearch by 10.15%. We hope this system will provide reliable, high-quality novelty analysis and help researchers quickly identify novel papers. Code and demo are available at https://github.com/SStan1/NoveltyAgent.
[46] LLM Router: Prefill is All You Need
Tanay Varshney, Annie Surla, Michelle Xu, Gomathy Venkata Krishnan, Maximilian Jeblick, David Austin, Neal Vaidya, Davide Onofrio
Main category: cs.CL
TL;DR: Proposes Encoder-Target Decoupling for LLM routing using internal prefill activations and mathematical probes (Fisher Separability, Effective Dimensionality) to create SharedTrunkNet architecture that captures 45.58% of Oracle accuracy gap with 74.31% cost savings.
Details
Motivation: Current LLM routers rely on fragile semantic signals, while an Oracle router with perfect foresight could significantly surpass standalone model accuracy by leveraging complementary performance across task subsets. Need better routing mechanisms that can navigate model-specific strengths.Method: Encoder-Target Decoupling separates the model providing predictive signals (Encoder) from the model being estimated (Target), enabling optimized heterogeneous pairing. Uses Fisher Separability (J) and Effective Dimensionality (d_eff) as mathematical probes to isolate optimal layer-wise signals from internal prefill activations. Implements SharedTrunkNet architecture based on these insights.
Result: SharedTrunkNet captures up to 45.58% of the accuracy gap between the strongest standalone model and the Oracle router while achieving 74.31% cost savings relative to the highest-cost model.
Conclusion: Internal prefill activations provide robust signals for LLM routing, and mathematical probes can effectively isolate optimal layer-wise information. Encoder-Target Decoupling enables efficient heterogeneous model pairing for improved routing performance.
Abstract: LLMs often share comparable benchmark accuracies, but their complementary performance across task subsets suggests that an Oracle router–a theoretical selector with perfect foresight–can significantly surpass standalone model accuracy by navigating model-specific strengths. While current routers rely on fragile semantic signals, we propose using internal prefill activations via Encoder-Target Decoupling–a functional separation between the model providing the predictive signal (the Encoder) and the model whose performance is being estimated (the Target). This allows optimized heterogeneous pairing between unique encoders and target models. We utilize Fisher Separability (J) and Effective Dimensionality (d_eff) as mathematical probes to isolate optimal layer-wise signals, providing the predictive foundation for our SharedTrunkNet architecture. SharedTrunkNet captures up to 45.58% of the accuracy gap between the strongest standalone model and the Oracle while achieving 74.31% cost savings relative to the highest-cost model.
[47] Mitigating Shortcut Reasoning in Language Models: A Gradient-Aware Training Approach
Hongyu Cao, Kunpeng Liu, Dongjie Wang, Yanjie Fu
Main category: cs.CL
TL;DR: SART is a gradient-aware framework that detects and mitigates shortcut-promoting samples in LLM reasoning training to improve genuine logical inference.
Details
Motivation: Large language models often rely on shortcuts like surface pattern matching and answer memorization rather than genuine logical inference, limiting their reasoning capabilities and generalization.Method: Proposes Shortcut-Aware Reasoning Training (SART) with ShortcutScore to detect shortcut-promoting samples via gradient misalignment with validation objectives and answer-token concentration, then applies gradient surgery to modify training dynamics.
Result: Achieves +16.5% accuracy and +40.2% robustness over strongest baselines on controlled reasoning benchmarks, significantly improving generalization under distribution shifts.
Conclusion: SART effectively mitigates shortcut learning in LLMs, enhancing genuine reasoning capabilities and generalization performance.
Abstract: Large language models exhibit strong reasoning capabilities, yet often rely on shortcuts such as surface pattern matching and answer memorization rather than genuine logical inference. We propose Shortcut-Aware Reasoning Training (SART), a gradient-aware framework that detects and mitigates shortcut-promoting samples via ShortcutScore and gradient surgery. Our method identifies shortcut signals through gradient misalignment with validation objectives and answer-token concentration, and modifies training dynamics accordingly. Experiments on controlled reasoning benchmarks show that SART achieves +16.5% accuracy and +40.2% robustness over the strongest baseline, significantly improving generalization under distribution shifts. Code is available at: https://github.com/fuyanjie/short-cut-aware-data-centric-reasoning.
[48] The Hidden Puppet Master: A Theoretical and Real-World Account of Emotional Manipulation in LLMs
Jocelyn Shen, Amina Luvsanchultem, Jessica Kim, Kynnedy Smith, Valdemar Danry, Kantwon Rogers, Sharifa Alghowinem, Hae Won Park, Maarten Sap, Cynthia Breazeal
Main category: cs.CL
TL;DR: PUPPET introduces a taxonomy for personalized emotional manipulation in LLM-human dialogues focusing on incentive morality, showing harmful hidden incentives cause larger belief shifts than prosocial ones, with LLMs moderately predicting but underestimating these shifts.
Details
Motivation: As users increasingly rely on LLMs for personal advice, they become vulnerable to manipulation driven by hidden incentives misaligned with their interests. Prior work lacks real-world correlation with human belief shifts and overlooks the morality of hidden incentives.Method: Introduces PUPPET taxonomy for personalized emotional manipulation centered on incentive morality. Conducts human study with 1,035 participants across realistic everyday queries, varying personalization and incentive direction (harmful vs prosocial). Benchmarks LLMs on belief prediction task.
Result: Harmful hidden incentives produce significantly larger belief shifts than prosocial ones. LLMs show moderate predictive ability for belief change (r=0.3-0.5) but systematically underestimate the magnitude of belief shift.
Conclusion: Establishes a theoretically grounded and behaviorally validated foundation for studying and combating incentive-driven manipulation in LLMs during everyday user queries, highlighting the importance of considering incentive morality in manipulation research.
Abstract: As users increasingly turn to LLMs for practical and personal advice, they become vulnerable to being subtly steered toward hidden incentives misaligned with their own interests. Prior works have benchmarked persuasion and manipulation detection, but these efforts rely on simulated or debate-style settings, remain uncorrelated with real human belief shifts, and overlook a critical dimension: the morality of hidden incentives driving the manipulation. We introduce PUPPET, a theoretical taxonomy of personalized emotional manipulation in LLM-human dialogues that centers around incentive morality, and conduct a human study with N=1,035 participants across realistic everyday queries, varying personalization and incentive direction (harmful versus prosocial). We find that harmful hidden incentives produce significantly larger belief shifts than prosocial ones. Finally, we benchmark LLMs on the task of belief prediction, finding that models exhibit moderate predictive ability of belief change based on conversational contexts (r=0.3 - 0.5), but they also systematically underestimate the magnitude of belief shift. Together, this work establishes a theoretically grounded and behaviorally validated foundation for studying, and ultimately combatting, incentive-driven manipulation in LLMs during everyday, practical user queries.
[49] User Preference Modeling for Conversational LLM Agents: Weak Rewards from Retrieval-Augmented Interaction
Yuren Hao, Shuhaib Mehri, ChengXiang Zhai, Dilek Hakkani-Tür
Main category: cs.CL
TL;DR: VARS is a framework for personalizing LLM assistants using long-term and short-term preference vectors that bias retrieval scoring, updated from user feedback without fine-tuning.
Details
Motivation: Current LLM assistants lack persistent user models, forcing users to repeatedly restate preferences across sessions. There's a need for personalization that adapts to user preferences without per-user fine-tuning.Method: Vector-Adapted Retrieval Scoring (VARS) represents users with long-term and short-term vectors in a shared preference space, using these to bias retrieval scoring over structured preference memory. Vectors are updated online from weak scalar rewards from user feedback, enabling personalization with frozen backbones.
Result: On MultiSessionCollab benchmark (math and code tasks), VARS achieves strongest overall performance, matches Reflection baseline in task success, reduces timeout rate and user effort. Learned vectors show interpretability: long-term vectors align with cross-user preference overlap, short-term vectors capture session-specific adaptation.
Conclusion: VARS enables effective personalization of LLM assistants through preference-aware retrieval scoring without fine-tuning, improving interaction efficiency rather than raw task accuracy. The dual-vector design is interpretable and effective for multi-session collaboration.
Abstract: Large language models are increasingly used as personal assistants, yet most lack a persistent user model, forcing users to repeatedly restate preferences across sessions. We propose Vector-Adapted Retrieval Scoring (VARS), a pipeline-agnostic, frozen-backbone framework that represents each user with long-term and short-term vectors in a shared preference space and uses these vectors to bias retrieval scoring over structured preference memory. The vectors are updated online from weak scalar rewards from users’ feedback, enabling personalization without per-user fine-tuning. We evaluate on \textsc{MultiSessionCollab}, an online multi-session collaboration benchmark with rich user preference profiles, across math and code tasks. Under frozen backbones, the main benefit of user-aware retrieval is improved interaction efficiency rather than large gains in raw task accuracy: our full VARS agent achieves the strongest overall performance, matches a strong Reflection baseline in task success, and reduces timeout rate and user effort. The learned long-term vectors also align with cross-user preference overlap, while short-term vectors capture session-specific adaptation, supporting the interpretability of the dual-vector design. Code, model, and data are available at https://github.com/YurenHao0426/VARS.
[50] Alignment Whack-a-Mole : Finetuning Activates Verbatim Recall of Copyrighted Books in Large Language Models
Xinyue Liu, Niloofar Mireshghallah, Jane C. Ginsburg, Tuhin Chakrabarty
Main category: cs.CL
TL;DR: Finetuning LLMs on plot summaries causes them to verbatim reproduce copyrighted books, bypassing safety measures and revealing latent memorization in model weights.
Details
Motivation: To investigate whether frontier LLMs actually store copies of training data despite companies' claims, and whether safety alignment measures (RLHF, system prompts, filters) can be bypassed through finetuning techniques.Method: Finetuned models (GPT-4o, Gemini-2.5-Pro, DeepSeek-V3.1) to expand plot summaries into full text, a task suited for commercial writing assistants, then prompted them with semantic descriptions to test verbatim reproduction of copyrighted books.
Result: Models reproduced up to 85-90% of held-out copyrighted books with single verbatim spans exceeding 460 words. Finetuning on one author’s works unlocked recall of books from over 30 unrelated authors. Three different models memorized the same books in the same regions (r ≥ 0.90).
Conclusion: Model weights store copies of copyrighted works, and finetuning bypasses safety protections, undermining legal defenses based on adequate prevention measures. This represents an industry-wide vulnerability.
Abstract: Frontier LLM companies have repeatedly assured courts and regulators that their models do not store copies of training data. They further rely on safety alignment strategies via RLHF, system prompts, and output filters to block verbatim regurgitation of copyrighted works, and have cited the efficacy of these measures in their legal defenses against copyright infringement claims. We show that finetuning bypasses these protections: by training models to expand plot summaries into full text, a task naturally suited for commercial writing assistants, we cause GPT-4o, Gemini-2.5-Pro, and DeepSeek-V3.1 to reproduce up to 85-90% of held-out copyrighted books, with single verbatim spans exceeding 460 words, using only semantic descriptions as prompts and no actual book text. This extraction generalizes across authors: finetuning exclusively on Haruki Murakami’s novels unlocks verbatim recall of copyrighted books from over 30 unrelated authors. The effect is not specific to any training author or corpus: random author pairs and public-domain finetuning data produce comparable extraction, while finetuning on synthetic text yields near-zero extraction, indicating that finetuning on individual authors’ works reactivates latent memorization from pretraining. Three models from different providers memorize the same books in the same regions ($r \ge 0.90$), pointing to an industry-wide vulnerability. Our findings offer compelling evidence that model weights store copies of copyrighted works and that the security failures that manifest after finetuning on individual authors’ works undermine a key premise of recent fair use rulings, where courts have conditioned favorable outcomes on the adequacy of measures preventing reproduction of protected expression.
[51] DiscoUQ: Structured Disagreement Analysis for Uncertainty Quantification in LLM Agent Ensembles
Bo Jiang
Main category: cs.CL
TL;DR: DiscoUQ: A framework for uncertainty quantification in multi-agent LLM systems that leverages semantic disagreement structure rather than simple voting statistics.
Details
Motivation: Existing methods for quantifying uncertainty in multi-agent LLM systems rely on shallow voting statistics that discard rich semantic information in agents' reasoning, leading to poorly calibrated confidence estimates.Method: Three methods: DiscoUQ-LLM (logistic regression on LLM-extracted linguistic features like evidence overlap and argument strength), DiscoUQ-Embed (logistic regression on embedding geometry features like cluster distances), and DiscoUQ-Learn (neural network combining all features).
Result: DiscoUQ-LLM achieves average AUROC of 0.802, outperforming best baseline (0.791) with substantially better calibration (ECE 0.036 vs. 0.098). Features generalize across benchmarks with near-zero performance degradation, especially helping in ambiguous “weak disagreement” cases.
Conclusion: Leveraging semantic disagreement structure in multi-agent LLM systems provides better uncertainty quantification than simple voting statistics, with particularly strong benefits in ambiguous cases where traditional methods fail.
Abstract: Multi-agent LLM systems, where multiple prompted instances of a language model independently answer questions, are increasingly used for complex reasoning tasks. However, existing methods for quantifying the uncertainty of their collective outputs rely on shallow voting statistics that discard the rich semantic information in agents’ reasoning. We introduce DiscoUQ, a framework that extracts and leverages the structure of inter-agent disagreement – both linguistic properties (evidence overlap, argument strength, divergence depth) and embedding geometry (cluster distances, dispersion, cohesion) – to produce well-calibrated confidence estimates. We propose three methods of increasing complexity: DiscoUQ-LLM (logistic regression on LLM-extracted structure features), DiscoUQ-Embed (logistic regression on embedding geometry), and DiscoUQ-Learn (a neural network combining all features). Evaluated on four diverse benchmarks (StrategyQA, MMLU, TruthfulQA, ARC-Challenge) with a 5-agent system using Qwen3.5-27B, DiscoUQ-LLM achieves an average AUROC of 0.802, outperforming the best baseline (LLM Aggregator, 0.791) while being substantially better calibrated (ECE 0.036 vs. 0.098). The learned features generalize across benchmarks with near-zero performance degradation and provide the largest improvements where they are most needed: in the ambiguous “weak disagreement” tier where simple vote counting fails.
[52] Mitigating Selection Bias in Large Language Models via Permutation-Aware GRPO
Jinquan Zheng, Jia Yuan, Jiacheng Yao, Chenyang Gu, Pujun Zheng, Guoxiu He
Main category: cs.CL
TL;DR: PA-GRPO is a training method that reduces selection bias in LLMs for evaluation tasks by enforcing permutation-consistent reasoning through group optimization.
Details
Motivation: LLMs exhibit selection bias in multiple-choice and pairwise evaluation tasks due to non-semantic factors like option positions and label symbols. Existing debiasing methods are costly at inference time or ignore that the same question should yield consistent answers across different permutations.Method: Proposes Permutation-Aware Group Relative Policy Optimization (PA-GRPO) which constructs permutation groups for each instance and uses two mechanisms: (1) cross-permutation advantage (relative to mean reward over all permutations), and (2) consistency-aware reward (encouraging consistent decisions across permutations).
Result: Outperforms strong baselines across seven benchmarks, substantially reducing selection bias while maintaining high overall performance.
Conclusion: PA-GRPO effectively mitigates selection bias in LLMs for evaluation tasks through permutation-consistent semantic reasoning optimization.
Abstract: Large language models (LLMs) used for multiple-choice and pairwise evaluation tasks often exhibit selection bias due to non-semantic factors like option positions and label symbols. Existing inference-time debiasing is costly and may harm reasoning, while pointwise training ignores that the same question should yield consistent answers across permutations. To address this issue, we propose Permutation-Aware Group Relative Policy Optimization (PA-GRPO), which mitigates selection bias by enforcing permutation-consistent semantic reasoning. PA-GRPO constructs a permutation group for each instance by generating multiple candidate permutations, and optimizes the model using two complementary mechanisms: (1) cross-permutation advantage, which computes advantages relative to the mean reward over all permutations of the same instance, and (2) consistency-aware reward, which encourages the model to produce consistent decisions across different permutations. Experimental results demonstrate that PA-GRPO outperforms strong baselines across seven benchmarks, substantially reducing selection bias while maintaining high overall performance. The code will be made available on Github (https://github.com/ECNU-Text-Computing/PA-GRPO).
[53] Left Behind: Cross-Lingual Transfer as a Bridge for Low-Resource Languages in Large Language Models
Abdul-Salem Beibitkhan
Main category: cs.CL
TL;DR: LLMs show significant performance gaps (13.8-16.7pp) for low-resource languages like Kazakh and Mongolian compared to English, with cross-lingual transfer strategies providing architecture-dependent benefits.
Details
Motivation: To evaluate how current large language models perform on low-resource languages and identify effective mitigation strategies for performance gaps between English and under-resourced languages.Method: Benchmarked eight LLMs across five experimental conditions in English, Kazakh, and Mongolian using 50 hand-crafted questions spanning factual, reasoning, technical, and cultural categories. Evaluated 2,000 responses on accuracy, fluency, and completeness. Tested cross-lingual transfer prompting (reason in English then translate back).
Result: Found consistent 13.8-16.7 percentage point performance gap between English and low-resource language conditions. Models maintained surface-level fluency but produced significantly less accurate content. Cross-lingual transfer yielded selective gains (+2.2pp to +4.3pp) for bilingual architectures but no benefit for English-dominant models.
Conclusion: Current LLMs systematically underserve low-resource language communities, and effective mitigation strategies are architecture-dependent rather than universal, requiring tailored approaches for different model architectures.
Abstract: We investigate how large language models perform on low-resource languages by benchmarking eight LLMs across five experimental conditions in English, Kazakh, and Mongolian. Using 50 hand-crafted questions spanning factual, reasoning, technical, and culturally grounded categories, we evaluate 2,000 responses on accuracy, fluency, and completeness. We find a consistent performance gap of 13.8-16.7 percentage points between English and low-resource language conditions, with models maintaining surface-level fluency while producing significantly less accurate content. Cross-lingual transfer-prompting models to reason in English before translating back-yields selective gains for bilingual architectures (+2.2pp to +4.3pp) but provides no benefit to English-dominant models. Our results demonstrate that current LLMs systematically underserve low-resource language communities, and that effective mitigation strategies are architecture-dependent rather than universal.
[54] Reading Between the Lines: How Electronic Nonverbal Cues shape Emotion Decoding
Taara Kumar, Kokil Jaidka
Main category: cs.CL
TL;DR: Systematic study of electronic nonverbal cues (eNVCs) in text-based CMC, developing taxonomy, detection tools, and examining their effects on emotional decoding accuracy and user interpretation strategies.
Details
Motivation: As text-based computer-mediated communication becomes ubiquitous, there's a need to understand how users reconstruct nonverbal expression in environments lacking embodied cues, addressing how textual analogues of kinesics, vocalics, and paralinguistics function in digital communication.Method: Three complementary studies: 1) Develops unified taxonomy of eNVCs grounded in nonverbal communication theory and creates Python toolkit for automated detection; 2) Within-subject survey experiment testing causal effects of eNVCs on emotional decoding; 3) Focus group discussions exploring user interpretive strategies for digital prosody.
Result: eNVCs substantially improve emotional decoding accuracy and lower perceived ambiguity, though benefits weaken with sarcasm. Users employ specific interpretive strategies including drawing meaning from cue absence and defaulting to negative interpretations in ambiguous contexts. Provides practical toolkit for detection.
Conclusion: Establishes eNVCs as coherent, measurable digital behaviors; refines theories of cue richness and interpretive effort; provides tools for affective computing, user modeling, and emotion-aware interface design.
Abstract: As text-based computer-mediated communication (CMC) increasingly structures everyday interaction, a central question re-emerges with new urgency: How do users reconstruct nonverbal expression in environments where embodied cues are absent? This paper provides a systematic, theory-driven account of electronic nonverbal cues (eNVCs) - textual analogues of kinesics, vocalics, and paralinguistics - in public microblog communication. Across three complementary studies, we advance conceptual, empirical, and methodological contributions. Study 1 develops a unified taxonomy of eNVCs grounded in foundational nonverbal communication theory and introduces a scalable Python toolkit for their automated detection. Study 2, a within-subject survey experiment, offers controlled causal evidence that eNVCs substantially improve emotional decoding accuracy and lower perceived ambiguity, while also identifying boundary conditions, such as sarcasm, under which these benefits weaken or disappear. Study 3, through focus group discussions, reveals the interpretive strategies users employ when reasoning about digital prosody, including drawing meaning from the absence of expected cues and defaulting toward negative interpretations in ambiguous contexts. Together, these studies establish eNVCs as a coherent and measurable class of digital behaviors, refine theoretical accounts of cue richness and interpretive effort, and provide practical tools for affective computing, user modeling, and emotion-aware interface design. The eNVC detection toolkit is available as a Python and R package at https://github.com/kokiljaidka/envc.
[55] ViCLSR: A Supervised Contrastive Learning Framework with Natural Language Inference for Natural Language Understanding Tasks
Tin Van Huynh, Kiet Van Nguyen, Ngan Luu-Thuy Nguyen
Main category: cs.CL
TL;DR: ViCLSR: A supervised contrastive learning framework for Vietnamese sentence representations that outperforms PhoBERT on multiple NLU benchmarks by leveraging NLI datasets and adapting existing Vietnamese datasets for contrastive learning.
Details
Motivation: Vietnamese faces challenges in natural language understanding due to limited annotated data, and while pre-trained models like PhoBERT exist, their effectiveness is constrained by data scarcity. Contrastive learning shows promise for improving sentence representations but needs adaptation for Vietnamese.Method: Proposes ViCLSR, a supervised contrastive learning framework specifically designed for Vietnamese sentence embeddings. Leverages existing NLI datasets and develops a process to adapt existing Vietnamese datasets for supervised contrastive learning compatibility.
Result: ViCLSR significantly outperforms PhoBERT on five benchmark NLU datasets: ViNLI (+6.97% F1), ViWikiFC (+4.97% F1), ViFactCheck (+9.02% F1), UIT-ViCTSD (+5.36% F1), and ViMMRC2.0 (+4.33% Accuracy).
Conclusion: Supervised contrastive learning can effectively address resource limitations in Vietnamese NLU tasks and improve sentence representation learning for low-resource languages. ViCLSR is released for research purposes.
Abstract: High-quality text representations are crucial for natural language understanding (NLU), but low-resource languages like Vietnamese face challenges due to limited annotated data. While pre-trained models like PhoBERT and CafeBERT perform well, their effectiveness is constrained by data scarcity. Contrastive learning (CL) has recently emerged as a promising approach for improving sentence representations, enabling models to effectively distinguish between semantically similar and dissimilar sentences. We propose ViCLSR (Vietnamese Contrastive Learning for Sentence Representations), a novel supervised contrastive learning framework specifically designed to optimize sentence embeddings for Vietnamese, leveraging existing natural language inference (NLI) datasets. Additionally, we propose a process to adapt existing Vietnamese datasets for supervised learning, ensuring compatibility with CL methods. Our experiments demonstrate that ViCLSR significantly outperforms the powerful monolingual pre-trained model PhoBERT on five benchmark NLU datasets such as ViNLI (+6.97% F1), ViWikiFC (+4.97% F1), ViFactCheck (+9.02% F1), UIT-ViCTSD (+5.36% F1), and ViMMRC2.0 (+4.33% Accuracy). ViCLSR shows that supervised contrastive learning can effectively address resource limitations in Vietnamese NLU tasks and improve sentence representation learning for low-resource languages. Furthermore, we conduct an in-depth analysis of the experimental results to uncover the factors contributing to the superior performance of contrastive learning models. ViCLSR is released for research purposes in advancing natural language processing tasks.
[56] Evaluating Reasoning-Based Scaffolds for Human-AI Co-Annotation: The ReasonAlign Annotation Protocol
Smitha Muthya Sudheendra, Jaideep Srivastava
Main category: cs.CL
TL;DR: ReasonAlign: A reasoning-based annotation scaffold that exposes LLM-generated explanations without labels to study how reasoning affects human annotation behavior, showing increased agreement with minimal revision.
Details
Motivation: Human annotation in NLP has substantial variability across annotators, and while LLMs can provide structured reasoning to support annotation, their influence on human annotation behavior remains unclear. The paper aims to study how reasoning affects human annotation behavior rather than evaluating annotation accuracy.Method: Introduces ReasonAlign, a reasoning-based annotation scaffold that exposes LLM-generated explanations while withholding predicted labels. Uses a two-pass protocol inspired by Delphi-style revision: annotators first label instances independently, then revise decisions after viewing model-generated reasoning. Evaluates on sentiment classification and opinion detection tasks, analyzing changes in inter-annotator agreement and revision behavior. Introduces Annotator Effort Proxy (AEP) metric to capture proportion of labels revised after exposure to reasoning.
Result: Exposure to reasoning is associated with increased agreement alongside minimal revision, suggesting that reasoning primarily helps resolve ambiguous cases without inducing widespread changes. The findings provide insight into how reasoning explanations shape annotation consistency.
Conclusion: Reasoning-based scaffolds serve as a practical mechanism for supporting human-AI annotation workflows by improving annotation consistency through exposure to LLM-generated reasoning, particularly for ambiguous cases.
Abstract: Human annotation is central to NLP evaluation, yet subjective tasks often exhibit substantial variability across annotators. While large language models (LLMs) can provide structured reasoning to support annotation, their influence on human annotation behavior remains unclear. We introduce ReasonAlign, a reasoning-based annotation scaffold that exposes LLM-generated explanations while withholding predicted labels. We frame this as a controlled study of how reasoning affects human annotation behavior, rather than a full evaluation of annotation accuracy. Using a two-pass protocol inspired by Delphi-style revision, annotators first label instances independently and then revise their decisions after viewing model-generated reasoning. We evaluate the approach on sentiment classification and opinion detection tasks, analyzing changes in inter-annotator agreement and revision behavior. To quantify these effects, we introduce the Annotator Effort Proxy (AEP), a metric capturing the proportion of labels revised after exposure to reasoning. Our results show that exposure to reasoning is associated with increased agreement alongside minimal revision, suggesting that reasoning primarily helps resolve ambiguous cases without inducing widespread changes. These findings provide insight into how reasoning explanations shape annotation consistency and highlight reasoning-based scaffolds as a practical mechanism for supporting human-AI annotation workflows.
[57] Many Dialects, Many Languages, One Cultural Lens: Evaluating Multilingual VLMs for Bengali Culture Understanding Across Historically Linked Languages and Regional Dialects
Nurul Labib Sayeedi, Md. Faiyaz Abdullah Sayeedi, Shubhashis Roy Dipta, Rubaya Tabassum, Ariful Ekraj Hridoy, Mehraj Mahmood, Mahbub E Sobhani, Md. Tarek Hasan, Swakkhar Shatabda
Main category: cs.CL
TL;DR: BanglaVerse is a culturally grounded benchmark for evaluating multilingual vision-language models on Bengali culture across languages and dialects, revealing performance gaps in cultural understanding.
Details
Motivation: Bangla culture is richly expressed through various aspects but remains underrepresented in multimodal evaluation. Current benchmarks don't adequately test cultural understanding across linguistic variations.Method: Created BanglaVerse with 1,152 manually curated images across nine cultural domains, supporting visual question answering and captioning. Expanded into four languages and five Bangla dialects, yielding ~32.3K artifacts for comprehensive evaluation.
Result: Evaluating only standard Bangla overestimates model capability - performance drops under dialectal variation, especially for caption generation. Hindi and Urdu retain some cultural meaning but are weaker for structured reasoning. Main bottleneck is missing cultural knowledge rather than visual grounding alone.
Conclusion: BanglaVerse provides a more realistic test bed for measuring culturally grounded multimodal understanding under linguistic variation, highlighting the importance of cultural knowledge in vision-language models.
Abstract: Bangla culture is richly expressed through region, dialect, history, food, politics, media, and everyday visual life, yet it remains underrepresented in multimodal evaluation. To address this gap, we introduce BanglaVerse, a culturally grounded benchmark for evaluating multilingual vision-language models (VLMs) on Bengali culture across historically linked languages and regional dialects. Built from 1,152 manually curated images across nine domains, the benchmark supports visual question answering and captioning, and is expanded into four languages and five Bangla dialects, yielding ~32.3K artifacts. Our experiments show that evaluating only standard Bangla overestimates true model capability: performance drops under dialectal variation, especially for caption generation, while historically linked languages such as Hindi and Urdu retain some cultural meaning but remain weaker for structured reasoning. Across domains, the main bottleneck is missing cultural knowledge rather than visual grounding alone, with knowledge-intensive categories. These findings position BanglaVerse as a more realistic test bed for measuring culturally grounded multimodal understanding under linguistic variation.
[58] Entropy Alone is Insufficient for Safe Selective Prediction in LLMs
Edward Phillips, Fredrik K. Gustafsson, Sean Wu, Anshul Thakur, David A. Clifton
Main category: cs.CL
TL;DR: Combining entropy-based uncertainty with correctness probes improves selective prediction for QA tasks, addressing failure modes of entropy-only methods.
Details
Motivation: Selective prediction systems can reduce harms from language model hallucinations by abstaining in high-risk cases, but current uncertainty quantification methods aren't properly evaluated for deployment scenarios at low target error rates.Method: Identifies failure mode of entropy-based uncertainty methods, then combines entropy scores with correctness probe signals to create more reliable abstention behavior.
Result: Across three QA benchmarks (TriviaQA, BioASQ, MedicalQA) and four model families, the combined score improves both risk-coverage trade-off and calibration performance relative to entropy-only baselines.
Conclusion: Deployment-facing evaluation of uncertainty methods is crucial, using metrics that reflect whether systems can be trusted to operate at stated risk levels.
Abstract: Selective prediction systems can mitigate harms resulting from language model hallucinations by abstaining from answering in high-risk cases. Uncertainty quantification techniques are often employed to identify such cases, but are rarely evaluated in the context of the wider selective prediction policy and its ability to operate at low target error rates. We identify a model-dependent failure mode of entropy-based uncertainty methods that leads to unreliable abstention behaviour, and address it by combining entropy scores with a correctness probe signal. We find that across three QA benchmarks (TriviaQA, BioASQ, MedicalQA) and four model families, the combined score generally improves both the risk–coverage trade-off and calibration performance relative to entropy-only baselines. Our results highlight the importance of deployment-facing evaluation of uncertainty methods, using metrics that directly reflect whether a system can be trusted to operate at a stated risk level.
[59] Explainable Semantic Textual Similarity via Dissimilar Span Detection
Diego Miguel Lozano, Daryna Dementieva, Alexander Fraser
Main category: cs.CL
TL;DR: Introduces Dissimilar Span Detection (DSD) task to identify semantically differing spans between text pairs, with a new dataset and baseline methods, showing potential for improving downstream NLP tasks like paraphrase detection.
Details
Motivation: Existing Semantic Textual Similarity (STS) approaches reduce semantic nuances to single scores, limiting interpretability. Need methods to identify specific spans causing semantic differences to help users understand negative similarity factors and improve downstream tasks.Method: Introduces DSD task and releases Span Similarity Dataset (SSD) created via semi-automated pipeline combining LLMs with human verification. Proposes unsupervised baselines (LIME, SHAP, LLMs, custom method) and supervised approach. Evaluates on paraphrase detection task.
Result: LLMs and supervised models achieve highest performance but overall results remain low, highlighting task complexity. DSD shows potential to improve performance in paraphrase detection through additional experiment.
Conclusion: DSD addresses interpretability limitations in STS by identifying specific dissimilar spans. While challenging, it offers benefits for understanding semantic differences and improving downstream NLP applications.
Abstract: Semantic Textual Similarity (STS) is a crucial component of many Natural Language Processing (NLP) applications. However, existing approaches typically reduce semantic nuances to a single score, limiting interpretability. To address this, we introduce the task of Dissimilar Span Detection (DSD), which aims to identify semantically differing spans between pairs of texts. This can help users understand which particular words or tokens negatively affect the similarity score, or be used to improve performance in STS-dependent downstream tasks. Furthermore, we release a new dataset suitable for the task, the Span Similarity Dataset (SSD), developed through a semi-automated pipeline combining large language models (LLMs) with human verification. We propose and evaluate different baseline methods for DSD, both unsupervised, based on LIME, SHAP, LLMs, and our own method, as well as an additional supervised approach. While LLMs and supervised models achieve the highest performance, overall results remain low, highlighting the complexity of the task. Finally, we set up an additional experiment that shows how DSD can lead to increased performance in the specific task of paraphrase detection.
[60] Context Selection for Hypothesis and Statistical Evidence Extraction from Full-Text Scientific Articles
Sai Koneru, Jian Wu, Sarah Rajtmajer
Main category: cs.CL
TL;DR: A two-stage retrieve-and-extract framework for extracting hypotheses and supporting statistical evidence from full-text scientific articles, with controlled study of retrieval design choices and LLM extractors.
Details
Motivation: Extracting hypotheses and statistical evidence from scientific articles is crucial for empirical synthesis but remains difficult due to document length and distribution of arguments across sections. Current methods struggle with linking abstract findings to corresponding hypothesis statements and supporting evidence in paper bodies.Method: Two-stage retrieve-and-extract framework with controlled study of retrieval design choices: varying context quantity, quality (standard RAG, reranking, fine-tuned retriever with reranking), and oracle paragraph setting. Four LLM extractors evaluated across these configurations to separate retrieval failures from extraction limits.
Result: Targeted context selection consistently improves hypothesis extraction relative to full-text prompting, with gains concentrated in configurations optimizing retrieval quality and context cleanliness. Statistical evidence extraction remains substantially harder - even with oracle paragraphs, performance remains moderate, indicating persistent extractor limitations in handling hybrid numeric-textual statements.
Conclusion: While targeted retrieval improves hypothesis extraction, statistical evidence extraction presents distinct challenges requiring specialized extractor capabilities for hybrid numeric-textual content, beyond just better retrieval.
Abstract: Extracting hypotheses and their supporting statistical evidence from full-text scientific articles is central to the synthesis of empirical findings, but remains difficult due to document length and the distribution of scientific arguments across sections of the paper. The work studies a sequential full-text extraction setting, where the statement of a primary finding in an article’s abstract is linked to (i) a corresponding hypothesis statement in the paper body and (ii) the statistical evidence that supports or refutes that hypothesis. This formulation induces a challenging within-document retrieval setting in which many candidate paragraphs are topically related to the finding but differ in rhetorical role, creating hard negatives for retrieval and extraction. Using a two-stage retrieve-and-extract framework, we conduct a controlled study of retrieval design choices, varying context quantity, context quality (standard Retrieval Augmented Generation, reranking, and a fine-tuned retriever paired with reranking), as well as an oracle paragraph setting to separate retrieval failures from extraction limits across four Large Language Model extractors. We find that targeted context selection consistently improves hypothesis extraction relative to full-text prompting, with gains concentrated in configurations that optimize retrieval quality and context cleanliness. In contrast, statistical evidence extraction remains substantially harder. Even with oracle paragraphs, performance remains moderate, indicating persistent extractor limitations in handling hybrid numeric-textual statements rather than retrieval failures alone.
[61] Graph Fusion Across Languages using Large Language Models
Kaung Myat Kyaw, Khush Agarwal, Jonathan Chan
Main category: cs.CL
TL;DR: LLM-based framework for cross-lingual knowledge graph fusion using natural language linearization to resolve semantic heterogeneity across languages
Details
Motivation: Addressing the challenge of combining multiple knowledge graphs across linguistic boundaries due to semantic heterogeneity and graph complexity, requiring a scalable solution for continuous knowledge synthesis in multilingual environmentsMethod: Proposes a cross-lingual graph fusion framework leveraging LLMs’ in-context reasoning and multilingual semantic priors. Uses structural linearization by mapping triplets to natural language sequences ([head] [relation] [tail]), enabling LLMs to map relations and reconcile entities between evolving fused graph and new candidate graphs
Result: Evaluated on DBP15K dataset, demonstrates LLMs can serve as universal semantic bridge to resolve cross-lingual discrepancies. Shows successful sequential agglomeration of multiple heterogeneous graphs
Conclusion: LLMs offer scalable, modular solution for continuous knowledge synthesis in multi-source, multilingual environments, effectively bridging cross-lingual semantic gaps in knowledge graph fusion
Abstract: Combining multiple knowledge graphs (KGs) across linguistic boundaries is a persistent challenge due to semantic heterogeneity and the complexity of graph environments. We propose a framework for cross-lingual graph fusion, leveraging the in-context reasoning and multilingual semantic priors of Large Language Models (LLMs). The framework implements structural linearization by mapping triplets directly into natural language sequences (e.g., [head] [relation] [tail]), enabling the LLM to map relations and reconcile entities between an evolving fused graph ($G_{c}^{(t-1)}$) and a new candidate graph ($G_{t}$). Evaluated on the DBP15K dataset, this exploratory study demonstrates that LLMs can serve as a universal semantic bridge to resolve cross-lingual discrepancies. Results show the successful sequential agglomeration of multiple heterogeneous graphs, offering a scalable, modular solution for continuous knowledge synthesis in multi-source, multilingual environments.
[62] Conversation Tree Architecture: A Structured Framework for Context-Aware Multi-Branch LLM Conversations
Pranav Hemanth, Sampriti Saha
Main category: cs.CL
TL;DR: CTA organizes LLM conversations as hierarchical trees with isolated context nodes to prevent logical context poisoning in multi-topic conversations.
Details
Motivation: Current LLM conversation interfaces use flat, append-only structures where all context accumulates in a single unbounded window, causing topically distinct threads to bleed into each other and degrade response quality (logical context poisoning).Method: Introduces Conversation Tree Architecture (CTA) - a hierarchical framework organizing conversations as trees of discrete, context-isolated nodes. Each node maintains its own local context window with structured mechanisms for context flow between parent and child nodes (downstream on creation, upstream on deletion). Includes volatile nodes for transient branches.
Result: Formalizes architecture primitives, characterizes design problems in context flow, relates framework to prior LLM memory management work, and describes a working prototype implementation.
Conclusion: CTA provides principled foundation for structured conversational context management and extends naturally to multi-agent settings, addressing the fundamental limitation of flat conversation interfaces.
Abstract: Large language models (LLMs) are increasingly deployed for extended, multi-topic conversations, yet the flat, append-only structure of current conversation interfaces introduces a fundamental limitation: all context accumulates in a single unbounded window, causing topically distinct threads to bleed into one another and progressively degrade response quality. We term this failure mode logical context poisoning. In this paper, we introduce the Conversation Tree Architecture (CTA), a hierarchical framework that organizes LLM conversations as trees of discrete, context-isolated nodes. Each node maintains its own local context window; structured mechanisms govern how context flows between parent and child nodes, downstream on branch creation and upstream on branch deletion. We additionally introduce volatile nodes, transient branches whose local context must be selectively merged upward or permanently discarded before purging. We formalize the architecture’s primitives, characterize the open design problems in context flow, relate our framework to prior work in LLM memory management, and describe a working prototype implementation. The CTA provides a principled foundation for structured conversational context management and extends naturally to multi-agent settings.
[63] More Than Sum of Its Parts: Deciphering Intent Shifts in Multimodal Hate Speech Detection
Runze Sun, Yu Zheng, Zexuan Xiong, Zhongjin Qu, Lei Chen, Jiwen Lu, Jie Zhou
Main category: cs.CL
TL;DR: Proposes H-VLI benchmark and ARCADE framework for detecting implicit multimodal hate speech by analyzing semantic intent shifts through modality interplay, outperforming SOTA on challenging cases.
Details
Motivation: Hate speech is evolving from plain text to complex multimodal expressions where emergent meaning transcends individual modalities, making implicit attacks harder to detect. Current systems struggle with these subtle cases where modalities interact to construct implicit hate from benign cues or neutralize toxicity.Method: 1) Curated Hate via Vision-Language Interplay (H-VLI) benchmark focusing on cases where true intent hinges on intricate modality interplay rather than overt slurs. 2) Proposed Asymmetric Reasoning via Courtroom Agent DEbate (ARCADE) framework simulating judicial process where agents argue for accusation and defense, forcing deep semantic scrutiny before verdict.
Result: ARCADE significantly outperforms state-of-the-art baselines on H-VLI benchmark, particularly for challenging implicit cases, while maintaining competitive performance on established benchmarks.
Conclusion: The work addresses critical gap in multimodal hate speech detection by focusing on semantic intent shifts through modality interplay, providing both benchmark and framework for better understanding complex multimodal expressions in online content moderation.
Abstract: Combating hate speech on social media is critical for securing cyberspace, yet relies heavily on the efficacy of automated detection systems. As content formats evolve, hate speech is transitioning from solely plain text to complex multimodal expressions, making implicit attacks harder to spot. Current systems, however, often falter on these subtle cases, as they struggle with multimodal content where the emergent meaning transcends the aggregation of individual modalities. To bridge this gap, we move beyond binary classification to characterize semantic intent shifts where modalities interact to construct implicit hate from benign cues or neutralize toxicity through semantic inversion. Guided by this fine-grained formulation, we curate the Hate via Vision-Language Interplay (H-VLI) benchmark where the true intent hinges on the intricate interplay of modalities rather than overt visual or textual slurs. To effectively decipher these complex cues, we further propose the Asymmetric Reasoning via Courtroom Agent DEbate (ARCADE) framework. By simulating a judicial process where agents actively argue for accusation and defense, ARCADE forces the model to scrutinize deep semantic cues before reaching a verdict. Extensive experiments demonstrate that ARCADE significantly outperforms state-of-the-art baselines on H-VLI, particularly for challenging implicit cases, while maintaining competitive performance on established benchmarks. Our code and data are available at: https://github.com/Sayur1n/H-VLI
[64] Enhancing reasoning accuracy in large language models during inference time
Vinay Sharma, Manish Jain
Main category: cs.CL
TL;DR: Systematic evaluation of three inference-time strategies to improve LLM reasoning accuracy: self-consistency via stochastic decoding, dual-model reasoning agreement, and self-reflection.
Details
Motivation: LLMs show strong linguistic abilities but remain unreliable on multi-step reasoning tasks without additional training/fine-tuning. Need inference-time techniques to improve reasoning accuracy.Method: Three inference-time strategies evaluated: (1) self-consistency via stochastic decoding with temperature/nucleus sampling, (2) dual-model reasoning agreement comparing outputs from two independent models, (3) self-reflection where model critiques/revises its own reasoning. All use Chain-of-Thought prompting.
Result: Self-consistency with nucleus sampling and controlled temperature yields substantial gains (9-15% absolute improvement over greedy single-pass decoding). Dual-model approach provides additional confirmation for moderate-risk domains. Self-reflection offers only marginal improvements.
Conclusion: Self-consistency is well-suited for low-risk domains with minimal overhead, dual-model approach for moderate-risk domains requiring higher reliability, and self-reflection has limited effectiveness for smaller non-reasoning models at inference time.
Abstract: Large Language Models (LLMs) often exhibit strong linguistic abilities while remaining unreliable on multi-step reasoning tasks, particularly when deployed without additional training or fine-tuning. In this work, we study inference-time techniques to improve the reasoning accuracy of LLMs. We systematically evaluate three classes of inference-time strategies: (i) self-consistency via stochastic decoding, where the model is sampled multiple times using controlled temperature and nucleus sampling and the most frequent final answer is selected; (ii) dual-model reasoning agreement, where outputs from two independent models are compared and only consistent reasoning traces are trusted; and (iii) self-reflection, where the model critiques and revises its own reasoning. Across all evaluated methods, we employ Chain-of-Thought (CoT) [1] prompting to elicit explicit intermediate reasoning steps before generating final answers. In this work, we provide a controlled comparative evaluation across three inference-time strategies under identical prompting and verification settings. Our experiments on LLM [2] show that self-consistency with nucleus sampling and controlled temperature value yields the substantial gains, achieving a 9% to 15% absolute improvement in accuracy over greedy single-pass decoding, well-suited for low-risk domains, offering meaningful gains with minimal overhead. The dual-model approach provides additional confirmation for model reasoning steps thus more appropriate for moderate-risk domains, where higher reliability justifies additional compute. Self-reflection offers only marginal improvements, suggesting limited effectiveness for smaller non-reasoning models at inference time.
[65] TimeTox: An LLM-Based Pipeline for Automated Extraction of Time Toxicity from Clinical Trial Protocols
Saketh Vinjamuri, Marielle Fis Loperena, Marie C. Spezia, Ramez Kouzy
Main category: cs.CL
TL;DR: TimeTox: LLM-based pipeline using Gemini models to automatically extract time toxicity metrics from clinical trial protocol documents, with two-stage architecture showing best performance on real-world data.
Details
Motivation: Time toxicity (cumulative healthcare contact days) is important for clinical trial evaluation but labor-intensive to extract manually from protocol documents. Need automated extraction methods.Method: Three-stage pipeline using Google’s Gemini models: 1) summary extraction from PDFs, 2) time toxicity quantification at six timepoints per treatment arm, 3) multi-run consensus via position-based arm matching. Compared single-pass vs two-stage (structure-then-count) architectures.
Result: Two-stage pipeline achieved 100% clinically acceptable accuracy (±3 days) on synthetic data (MAE 0.81 days) vs 41.5% for vanilla (MAE 9.0 days). On real-world protocols, vanilla pipeline showed superior reproducibility: 95.3% clinically acceptable accuracy across 3 runs on 644 protocols, with 82.0% perfect stability.
Conclusion: Extraction stability on real-world data, rather than accuracy on synthetic benchmarks, is decisive for production LLM deployment. Successfully extracted time toxicity for 1,288 treatment arms across multiple disease sites.
Abstract: Time toxicity, the cumulative healthcare contact days from clinical trial participation, is an important but labor-intensive metric to extract from protocol documents. We developed TimeTox, an LLM-based pipeline for automated extraction of time toxicity from Schedule of Assessments tables. TimeTox uses Google’s Gemini models in three stages: summary extraction from full-length protocol PDFs, time toxicity quantification at six cumulative timepoints for each treatment arm, and multi-run consensus via position-based arm matching. We validated against 20 synthetic schedules (240 comparisons) and assessed reproducibility on 644 real-world oncology protocols. Two architectures were compared: single-pass (vanilla) and two-stage (structure-then-count). The two-stage pipeline achieved 100% clinically acceptable accuracy ($\pm$3 days) on synthetic data (MAE 0.81 days) versus 41.5% for vanilla (MAE 9.0 days). However, on real-world protocols, the vanilla pipeline showed superior reproducibility: 95.3% clinically acceptable accuracy (IQR $\leq$ 3 days) across 3 runs on 644 protocols, with 82.0% perfect stability (IQR = 0). The production pipeline extracted time toxicity for 1,288 treatment arms across multiple disease sites. Extraction stability on real-world data, rather than accuracy on synthetic benchmarks, is the decisive factor for production LLM deployment.
[66] Beyond Memorization: Distinguishing between Reductive and Epistemic Reasoning in LLMs using Classic Logic Puzzles
Adi Gabay, Gabriel Stanovsky, Liat Peterfreund
Main category: cs.CL
TL;DR: LLMs’ performance on epistemic puzzles is better understood through a “reduction ladder” framework rather than a simple memorization vs. reasoning dichotomy, showing models struggle with true epistemic reasoning.
Details
Motivation: To better understand how LLMs handle epistemic reasoning tasks by moving beyond the oversimplified dichotomy of memorization vs. reasoning, and instead examining how models reduce new instances to known problems.Method: Introduces a “reduction ladder” - a sequence of modifications that progressively move instances away from canonical epistemic puzzles while preserving the underlying logic, making reduction increasingly difficult.
Result: While some large models succeed via reduction, others fail early, and all models struggle once true epistemic reasoning is required rather than just pattern matching.
Conclusion: Current LLMs’ performance on epistemic puzzles is better explained by their ability to reduce problems to known patterns rather than true reasoning, highlighting limitations in their epistemic reasoning capabilities.
Abstract: Epistemic reasoning requires agents to infer the state of the world from partial observations and information about other agents’ knowledge. Prior work evaluating LLMs on canonical epistemic puzzles interpreted their behavior through a dichotomy between epistemic reasoning and brittle memorization. We argue that this framing is incomplete: in recent models, memorization is better understood as a special case of reduction, where a new instance is mapped onto a known problem. Instead, we introduce a reduction ladder, a sequence of modifications that progressively move instances away from a canonical epistemic puzzle, making reduction increasingly difficult while preserving the underlying logic. We find that while some large models succeed via reduction, other models fail early, and all models struggle once epistemic reasoning is required.
[67] Benchmarking Bengali Dialectal Bias: A Multi-Stage Framework Integrating RAG-Based Translation and Human-Augmented RLAIF
K. M. Jubair Sami, Dipto Sumit, Ariyan Hossain, Farig Sadeque
Main category: cs.CL
TL;DR: Proposes a two-phase framework to evaluate dialectal bias in LLMs for Bengali dialects, including translation quality assessment using LLM-as-judge and benchmarking 19 LLMs across 9 dialects with 68,395 evaluations.
Details
Motivation: LLMs frequently exhibit performance biases against regional dialects of low-resource languages, but frameworks to quantify these disparities remain scarce, particularly for unstandardized dialects where traditional translation metrics fail.Method: Two-phase framework: 1) Translate and gold-label standard Bengali questions into dialectal variants using RAG pipeline (4,000 question sets), evaluate translation fidelity using LLM-as-judge validated by human correlation; 2) Benchmark 19 LLMs across gold-labeled sets with 68,395 RLAIF evaluations validated through multi-judge agreement and human fallback.
Result: Reveals severe performance drops linked to linguistic divergence (e.g., Chittagong dialect scores 5.44/10 vs. Tangail’s 7.68/10). Increased model scale does not consistently mitigate bias. Provides validated translation quality evaluation method, benchmark dataset, and Critical Bias Sensitivity metric.
Conclusion: Establishes comprehensive framework for quantifying dialectal bias in LLMs, demonstrates significant performance disparities across Bengali dialects, and provides tools for safety-critical applications while showing that model scaling alone doesn’t solve dialect bias.
Abstract: Large language models (LLMs) frequently exhibit performance biases against regional dialects of low-resource languages. However, frameworks to quantify these disparities remain scarce. We propose a two-phase framework to evaluate dialectal bias in LLM question-answering across nine Bengali dialects. First, we translate and gold-label standard Bengali questions into dialectal variants adopting a retrieval-augmented generation (RAG) pipeline to prepare 4,000 question sets. Since traditional translation quality evaluation metrics fail on unstandardized dialects, we evaluate fidelity using an LLM-as-a-judge, which human correlation confirms outperforms legacy metrics. Second, we benchmark 19 LLMs across these gold-labeled sets, running 68,395 RLAIF evaluations validated through multi-judge agreement and human fallback. Our findings reveal severe performance drops linked to linguistic divergence. For instance, responses to the highly divergent Chittagong dialect score 5.44/10, compared to 7.68/10 for Tangail. Furthermore, increased model scale does not consistently mitigate this bias. We contribute a validated translation quality evaluation method, a rigorous benchmark dataset, and a Critical Bias Sensitivity (CBS) metric for safety-critical applications.
[68] Conspiracy Frame: a Semiotically-Driven Approach for Conspiracy Theories Detection
Heidi Campana Piva, Shaina Ashraf, Maziar Kianimoghadam Jouneghani, Arianna Longo, Rossana Damiano, Lucie Flek, Marco Antonio Stranisci
Main category: cs.CL
TL;DR: Introduces Conspiracy Frame semantic representation and Con.Fra. dataset for analyzing conspiracy theories in Telegram messages, with experiments on LLMs’ ability to detect conspiratorial narratives using frame-semantic approaches.
Details
Motivation: Conspiracy theories create social conflict and affect political information perception; need better tools for understanding and detecting conspiratorial narratives through fine-grained semantic analysis.Method: Developed Conspiracy Frame based on frame-semantics and semiotics; created Con.Fra. dataset of Telegram messages with span-level annotations; tested LLMs’ ability to recognize conspiracy theories with and without frame injection; mapped annotated spans to FrameNet.
Result: Frame injection in in-context learning doesn’t significantly boost performance but shows potential; FrameNet mapping reveals abstract semantic patterns (e.g., ‘Kinship’, ‘Ingest_substance’) that could enable more semantically-aware conspiracy detection.
Conclusion: Conspiracy Frame and Con.Fra. dataset advance understanding of conspiratorial narratives; frame-semantic approaches show promise for developing more generalizable conspiracy theory detection systems.
Abstract: Conspiracy theories are anti-authoritarian narratives that lead to social conflict, impacting how people perceive political information. To help in understanding this issue, we introduce the Conspiracy Frame: a fine-grained semantic representation of conspiratorial narratives derived from frame-semantics and semiotics, which spawned the Conspiracy Frames (Con.Fra.) dataset: a corpus of Telegram messages annotated at span-level. The Conspiracy Frame and Con.Fra. dataset contribute to the implementation of a more generalizable understanding and recognition of conspiracy theories. We observe the ability of LLMs to recognize this phenomenon in-domain and out-of-domain, investigating the role that frames may have in supporting this task. Results show that, while the injection of frames in an in-context approach does not lead to clear increase of performance, it has potential; the mapping of annotated spans with FrameNet shows abstract semantic patterns (e.g., Kinship', Ingest_substance’) that potentially pave the way for a more semantically- and semiotically-aware detection of conspiratorial narratives.
[69] Task-Specific Efficiency Analysis: When Small Language Models Outperform Large Language Models
Jinghan Cao, Yu Ma, Xinjin Li, Qingyang Ren, Xiangyun Chen
Main category: cs.CL
TL;DR: Small language models (0.5-3B parameters) achieve better efficiency-performance balance than larger models across diverse NLP tasks, as measured by a novel Performance-Efficiency Ratio metric.
Details
Motivation: Large Language Models have impressive performance but high computational costs that make them unsuitable for resource-constrained deployments. There's a need to understand the efficiency trade-offs of different model sizes for practical applications.Method: Comprehensive task-specific efficiency analysis comparing 16 language models across five diverse NLP tasks. Introduced Performance-Efficiency Ratio (PER) - a novel metric integrating accuracy, throughput, memory, and latency through geometric mean normalization.
Result: Small models (0.5-3B parameters) achieve superior PER scores across all tasks. These models provide the best efficiency-performance balance for production environments.
Conclusion: Small language models are quantitatively better suited for production deployments prioritizing inference efficiency over marginal accuracy gains. The PER metric provides a systematic way to evaluate model efficiency trade-offs.
Abstract: Large Language Models achieve remarkable performance but incur substantial computational costs unsuitable for resource-constrained deployments. This paper presents the first comprehensive task-specific efficiency analysis comparing 16 language models across five diverse NLP tasks. We introduce the Performance-Efficiency Ratio (PER), a novel metric integrating accuracy, throughput, memory, and latency through geometric mean normalization. Our systematic evaluation reveals that small models (0.5–3B parameters) achieve superior PER scores across all given tasks. These findings establish quantitative foundations for deploying small models in production environments prioritizing inference efficiency over marginal accuracy gains.
[70] Multi-Perspective LLM Annotations for Valid Analyses in Subjective Tasks
Navya Mehrotra, Adam Visokay, Kristina Gligorić
Main category: cs.CL
TL;DR: PDI method improves LLM annotation by focusing on demographic group perspectives using adaptive human sampling
Details
Motivation: LLMs reflect some human perspectives better than others, and existing correction methods assume single ground truth, failing for subjective tasks where demographic disagreement mattersMethod: Perspective-Driven Inference treats distribution of annotations across groups as quantity of interest, uses adaptive sampling strategy to concentrate human annotation on groups where LLM proxies are least accurate
Result: Shows targeted improvements for harder-to-model demographic groups relative to uniform sampling baselines on politeness and offensiveness rating tasks while maintaining coverage
Conclusion: PDI provides effective approach for capturing diverse human perspectives in subjective annotation tasks using limited human annotation budget
Abstract: Large language models are increasingly used to annotate texts, but their outputs reflect some human perspectives better than others. Existing methods for correcting LLM annotation error assume a single ground truth. However, this assumption fails in subjective tasks where disagreement across demographic groups is meaningful. Here we introduce Perspective-Driven Inference, a method that treats the distribution of annotations across groups as the quantity of interest, and estimates it using a small human annotation budget. We contribute an adaptive sampling strategy that concentrates human annotation effort on groups where LLM proxies are least accurate. We evaluate on politeness and offensiveness rating tasks, showing targeted improvements for harder-to-model demographic groups relative to uniform sampling baselines, while maintaining coverage.
[71] Efficient Fine-Tuning Methods for Portuguese Question Answering: A Comparative Study of PEFT on BERTimbau and Exploratory Evaluation of Generative LLMs
Mariela M. Nina, Caio Veloso Costa, Lilian Berton, Didier A. Vega-Oliveros
Main category: cs.CL
TL;DR: Systematic evaluation of PEFT and quantization techniques for Brazilian Portuguese QA using BERTimbau, showing efficient fine-tuning with LoRA achieving 95.8% of baseline performance while reducing training time by 73.5%.
Details
Motivation: Address computational cost barriers for low-resource languages like Brazilian Portuguese by evaluating parameter-efficient fine-tuning and quantization techniques to make NLP more accessible and sustainable.Method: Evaluated 40 configurations combining four PEFT methods (LoRA, DoRA, QLoRA, QDoRA) across two BERTimbau model sizes (Base: 110M, Large: 335M parameters) on SQuAD-BR dataset for extractive question answering.
Result: LoRA achieved 95.8% of baseline performance on BERTimbau-Large while reducing training time by 73.5%; higher learning rates (2e-4) improved PEFT performance by up to +19.71 F1 points; larger models showed twice the quantization resilience.
Conclusion: Encoder-based models can be efficiently fine-tuned for Brazilian Portuguese QA with substantially lower computational cost than large generative LLMs, promoting sustainable Green AI approaches while maintaining competitive performance.
Abstract: Although large language models have transformed natural language processing, their computational costs create accessibility barriers for low-resource languages such as Brazilian Portuguese. This work presents a systematic evaluation of Parameter-Efficient Fine-Tuning (PEFT) and quantization techniques applied to BERTimbau for Question Answering on SQuAD-BR, the Brazilian Portuguese translation of SQuAD v1. We evaluate 40 configurations combining four PEFT methods (LoRA, DoRA, QLoRA, QDoRA) across two model sizes (Base: 110M, Large: 335M parameters). Our findings reveal three critical insights: (1) LoRA achieves 95.8% of baseline performance on BERTimbau-Large while reducing training time by 73.5% (F1=81.32 vs 84.86); (2) higher learning rates (2e-4) substantially improve PEFT performance, with F1 gains of up to +19.71 points over standard rates; and (3) larger models show twice the quantization resilience (loss of 4.83 vs 9.56 F1 points). These results demonstrate that encoder-based models can be efficiently fine-tuned for extractive Brazilian Portuguese QA with substantially lower computational cost than large generative LLMs, promoting more sustainable approaches aligned with \textit{Green AI} principles. An exploratory evaluation of Tucano and Sabiá on the same extractive QA benchmark shows that while generative models can reach competitive F1 scores with LoRA fine-tuning, they require up to 4.2$\times$ more GPU memory and 3$\times$ more training time than BERTimbau-Base, reinforcing the efficiency advantage of smaller encoder-based architectures for this task.
[72] Semantic Shift: the Fundamental Challenge in Text Embedding and Retrieval
Hang Gao, Dimitris N. Metaxas
Main category: cs.CL
TL;DR: The paper analyzes semantic shift as the root cause of embedding collapse in Transformer models, showing that semantic diversity within text (not just length) drives retrieval degradation.
Details
Motivation: Transformer embedding models suffer from geometric pathologies like anisotropy and length-induced embedding collapse, but existing work only describes what these look like without explaining when and why they harm downstream retrieval performance.Method: Theoretical analysis of semantic smoothing in Transformer embeddings, formalizing semantic shift as a computable measure integrating local semantic evolution and global semantic dispersion. Controlled experiments across corpora and multiple embedding models.
Result: Semantic shift aligns closely with embedding concentration severity and predicts retrieval degradation, while text length alone does not. Semantic shift provides a unified lens for understanding embedding collapse.
Conclusion: Semantic shift (not just text length) is the key causal factor explaining when and why embedding pathologies harm retrieval, offering an actionable diagnostic tool.
Abstract: Transformer-based embedding models rely on pooling to map variable-length text into a single vector, enabling efficient similarity search but also inducing well-known geometric pathologies such as anisotropy and length-induced embedding collapse. Existing accounts largely describe \emph{what} these pathologies look like, yet provide limited insight into \emph{when} and \emph{why} they harm downstream retrieval. In this work, we argue that the missing causal factor is \emph{semantic shift}: the intrinsic, structured evolution and dispersion of semantics within a text. We first present a theoretical analysis of \emph{semantic smoothing} in Transformer embeddings: as the semantic diversity among constituent sentences increases, the pooled representation necessarily shifts away from every individual sentence embedding, yielding a smoothed and less discriminative vector. Building on this foundation, we formalize semantic shift as a computable measure integrating local semantic evolution and global semantic dispersion. Through controlled experiments across corpora and multiple embedding models, we show that semantic shift aligns closely with the severity of embedding concentration and predicts retrieval degradation, whereas text length alone does not. Overall, semantic shift offers a unified and actionable lens for understanding embedding collapse and for diagnosing when anisotropy becomes harmful.
[73] PROMPT2BOX: Uncovering Entailment Structure among LLM Prompts
Neeladri Bhuiya, Shib Sankar Dasgupta, Andrew McCallum, Haw-Shiuan Chang
Main category: cs.CL
TL;DR: PROMPT2BOX embeds prompts into box embeddings to capture both semantic similarity and specificity relations, enabling better LLM weakness analysis than vector embeddings.
Details
Motivation: Current vector embeddings for prompt analysis primarily capture topical similarity but fail to distinguish prompts that share topics but differ in specificity/difficulty, making fine-grained weakness analysis of LLMs difficult.Method: Proposes PROMPT2BOX which embeds prompts into a box embedding space using a trained encoder. The encoder is trained on existing and synthesized datasets to output box embeddings that capture semantic similarity and specificity relations. Also develops a novel dimension reduction technique for box embeddings for visualization.
Result: Box embeddings consistently capture prompt specificity better than vector baselines. On hierarchical clustering tasks for 17 LLMs from UltraFeedback dataset, PROMPT2BOX identifies 8.9% more LLM weaknesses than vector baselines and achieves ~33% stronger correlation between hierarchical depth and instruction specificity.
Conclusion: PROMPT2BOX provides a more nuanced representation of prompts that captures specificity relations, enabling better analysis of LLM weaknesses compared to traditional vector embeddings.
Abstract: To discover the weaknesses of LLMs, researchers often embed prompts into a vector space and cluster them to extract insightful patterns. However, vector embeddings primarily capture topical similarity. As a result, prompts that share a topic but differ in specificity, and consequently in difficulty, are often represented similarly, making fine-grained weakness analysis difficult. To address this limitation, we propose PROMPT2BOX, which embeds prompts into a box embedding space using a trained encoder. The encoder, trained on existing and synthesized datasets, outputs box embeddings that capture not only semantic similarity but also specificity relations between prompts (e.g., “writing an adventure story” is more specific than “writing a story”). We further develop a novel dimension reduction technique for box embeddings to facilitate dataset visualization and comparison. Our experiments demonstrate that box embeddings consistently capture prompt specificity better than vector baselines. On the downstream task of creating hierarchical clustering trees for 17 LLMs from the UltraFeedback dataset, PROMPT2BOX can identify 8.9% more LLM weaknesses than vector baselines and achieves an approximately 33% stronger correlation between hierarchical depth and instruction specificity.
[74] KG-Hopper: Empowering Compact Open LLMs with Knowledge Graph Reasoning via Reinforcement Learning
Shuai Wang, Yinan Yu
Main category: cs.CL
TL;DR: KG-Hopper: RL framework enabling compact LLMs to perform integrated multi-hop KG reasoning in single inference round, outperforming larger multi-step systems
Details
Motivation: LLMs struggle with knowledge-intensive reasoning tasks like KBQA that require accurate multi-hop reasoning over KGs. Existing approaches use sequential reasoning with predefined pipelines, causing error cascades and lacking flexibility.Method: Propose KG-Hopper, a Reinforcement Learning framework that trains a Reasoning LLM to embed entire KG traversal and decision process into unified “thinking” stage, enabling global reasoning over cross-step dependencies with dynamic path exploration and backtracking.
Result: On eight KG reasoning benchmarks, KG-Hopper based on 7B-parameter LLM consistently outperforms larger multi-step systems (up to 70B) and achieves competitive performance with proprietary models like GPT-3.5-Turbo and GPT-4o-mini.
Conclusion: KG-Hopper enables compact open LLMs to perform integrated multi-hop KG reasoning efficiently in single inference round, addressing limitations of sequential approaches while remaining data-efficient.
Abstract: Large Language Models (LLMs) demonstrate impressive natural language capabilities but often struggle with knowledge-intensive reasoning tasks. Knowledge Base Question Answering (KBQA), which leverages structured Knowledge Graphs (KGs) exemplifies this challenge due to the need for accurate multi-hop reasoning. Existing approaches typically perform sequential reasoning steps guided by predefined pipelines, restricting flexibility and causing error cascades due to isolated reasoning at each step. To address these limitations, we propose KG-Hopper, a novel Reinforcement Learning (RL) framework that empowers compact open LLMs with the ability to perform integrated multi-hop KG reasoning within a single inference round. Rather than reasoning step-by-step, we train a Reasoning LLM that embeds the entire KG traversal and decision process into a unified ``thinking’’ stage, enabling global reasoning over cross-step dependencies and dynamic path exploration with backtracking. Experimental results on eight KG reasoning benchmarks show that KG-Hopper, based on a 7B-parameter LLM, consistently outperforms larger multi-step systems (up to 70B) and achieves competitive performance with proprietary models such as GPT-3.5-Turbo and GPT-4o-mini, while remaining compact, open, and data-efficient. The code is publicly available at: https://github.com/Wangshuaiia/KG-Hopper.
[75] Cross-Context Verification: Hierarchical Detection of Benchmark Contamination through Session-Isolated Analysis
Tae-Eun Song
Main category: cs.CL
TL;DR: CCV method detects LLM benchmark contamination by comparing solution diversity across independent sessions, showing contamination is binary and 33% of prior labels are false positives.
Details
Motivation: LLM coding benchmarks suffer from solution leakage and poor test quality, with existing detection methods failing to directly observe whether models reason or recall memorized solutions.Method: Cross-Context Verification (CCV) solves same benchmark problems in N independent sessions and measures solution diversity, combined with Hierarchical Cross-Context Architecture (HCCA) multi-agent framework that restricts information across specialized roles to prevent confirmation bias.
Result: CCV achieves perfect separation between contaminated and genuine reasoning on SWE-bench Verified problems. Key findings: contamination is binary, reasoning absence perfectly discriminates, 33% of prior contamination labels are false positives, and HCCA discovers contamination-flaw composite cases.
Conclusion: Information restriction, not structural complexity, is key to detecting benchmark contamination. The method provides reliable detection of whether LLMs reason or recall memorized solutions.
Abstract: LLM coding benchmarks face a credibility crisis: widespread solution leakage and test quality issues undermine SWE-bench Verified, while existing detection methods–paraphrase consistency, n-gram overlap, perplexity analysis–never directly observe whether a model reasons or recalls. Meanwhile, simply repeating verification degrades accuracy: multi-turn review generates false positives faster than it discovers true errors, suggesting that structural approaches are needed. We introduce Cross-Context Verification (CCV), a black-box method that solves the same benchmark problem in N independent sessions and measures solution diversity, combined with the Hierarchical Cross-Context Architecture (HCCA), a multi-agent analysis framework that prevents confirmation bias through intentional information restriction across specialized analytical roles. On 9 SWE-bench Verified problems (45 trials, Claude Opus 4.6, temperature 0), CCV achieves perfect separation between contaminated and genuine reasoning (Mann-Whitney U=0, p approx 0.012, r = 1.0). Key findings: (1) contamination is binary–models either recall perfectly or not at all; (2) reasoning absence is a perfect discriminator; (3) 33% of prior contamination labels are false positives; (4) HCCA’s independent analysis structure discovers contamination-flaw composite cases that single-analyst approaches miss. A pilot experiment extending HCCA to multi-stage verification (Worker to Verifier to Director) yields a negative result–100% sycophantic confirmation–providing further evidence that information restriction, not structural complexity, is the key mechanism. We release all code and data.
[76] DRTriton: Large-Scale Synthetic Data Reinforcement Learning for Triton Kernel Generation
Siqi Guo, Ming Lin, Tianbao Yang
Main category: cs.CL
TL;DR: DRTriton is a framework for training LLMs to convert PyTorch code into optimized Triton kernels that compile to CUDA, using synthetic data generation, curriculum RL, and test-time search to achieve better performance than state-of-the-art LLMs.
Details
Motivation: Developing efficient CUDA kernels is challenging but essential for generative AI. Current LLMs struggle to convert PyTorch to optimized CUDA kernels effectively, requiring a specialized training approach.Method: Three components: 1) CSP-DAG synthetic data algorithm for full operator coverage with controlled difficulty, 2) curriculum reinforcement learning with decoupled rewards for conversion success and speed optimization, 3) test-time search to further improve inference speed of generated Triton kernels.
Result: DRTriton-7B achieves speedup on 92% of KernelBench Level 2, significantly outperforming GPT-5.2 (23%) and Claude-Sonnet-4.5 (19%). It generalizes well to real-world CUDA kernels despite being trained only on synthetic data.
Conclusion: DRTriton demonstrates that specialized training frameworks can enable LLMs to effectively generate optimized CUDA kernels from PyTorch code, outperforming general-purpose LLMs and showing strong generalization to real-world scenarios.
Abstract: Developing efficient CUDA kernels is a fundamental yet challenging task in the generative AI industry. Recent researches leverage Large Language Models (LLMs) to automatically convert PyTorch reference implementations to CUDA kernels, significantly reducing the engineering efforts. State-of-the-art LLMs, such as GPT-5.2 and Claude-Sonnet-4.5, still struggle in this specific task. To address this challenge, we propose DRTriton, a scalable learning framework for training LLMs to convert PyTorch codes into highly optimized Triton kernels, which are then compiled to CUDA kernels at runtime. DRTriton consists of three key components: (i) a data synthetic algorithm CSP-DAG that guarantees full coverage and unbiased uniform sampling over the operator space with controlled difficulty; (ii) a curriculum reinforcement learning with decoupled reward efficiently optimizes conversion success rate and inference speed simultaneously; and (iii) a test-time search algorithm that further improves the inference speed of the generated Triton kernels. Notably, despite being trained exclusively on synthetic data, DRTriton generalizes effectively to real-world CUDA kernels that are challenging even for human experts. Experimental results show that DRTriton-7B achieves speedup on 92% of the KernelBench Level 2, compared to 23% for GPT-5.2 and 19% for Claude-Sonnet-4.5.
[77] Effective Strategies for Asynchronous Software Engineering Agents
Jiayi Geng, Graham Neubig
Main category: cs.CL
TL;DR: CAID introduces a multi-agent coordination paradigm using software engineering primitives (centralized delegation, asynchronous execution, isolated workspaces) to improve long-horizon software engineering tasks through structured collaboration.
Details
Motivation: While AI agents excel at isolated software engineering tasks, they struggle with long-horizon tasks involving multiple interdependent subtasks due to interference, dependency synchronization challenges, and difficulty combining partial progress into coherent solutions.Method: CAID uses three core SWE primitives: 1) centralized task delegation through a central manager that creates dependency-aware plans, 2) asynchronous execution of subtasks concurrently, and 3) isolated workspaces for each agent. It employs git-based coordination mechanisms (worktree, commit, merge) with executable test-based verification for progress consolidation.
Result: CAID improves accuracy by 26.7% absolute on paper reproduction tasks (PaperBench) and 14.3% on Python library development tasks (Commit0) compared to single-agent baselines. Branch-and-merge is identified as a central coordination mechanism for effective multi-agent collaboration.
Conclusion: Software engineering collaboration primitives like git workflows provide reliable and executable mechanisms for multi-agent coordination, enabling effective asynchronous collaboration on complex, interdependent tasks through structured isolation and integration.
Abstract: AI agents have become increasingly capable at isolated software engineering (SWE) tasks such as resolving issues on Github. Yet long-horizon tasks involving multiple interdependent subtasks still pose challenges both with respect to accuracy, and with respect to timely completion. A natural approach to solving these long-horizon tasks in a timely manner is asynchronous multi-agent collaboration, where multiple agents work on different parts of the task at the same time. But effective application of multi-agent systems has proven surprisingly difficult: concurrent edits by multiple agents interfere with each other, dependencies are difficult to synchronize, and combining partial progress into a coherent whole is challenging. On the other hand, human developers have long relied on mature collaboration infrastructure to manage these challenges in large software projects. Inspired by these collaboration primitives, we introduce Centralized Asynchronous Isolated Delegation (CAID), a structured multi-agent coordination paradigm grounded in three core SWE primitives: centralized task delegation, asynchronous execution, and isolated workspaces. CAID constructs dependency-aware task plans through a central manager, executes subtasks concurrently in isolated workspaces, and consolidates progress via structured integration with executable test-based verification. In empirical evaluation, we find that CAID improves accuracy over single-agent baselines by 26.7% absolute on paper reproduction tasks (PaperBench) and 14.3% on Python library development tasks (Commit0). Through systematic analysis, we find that branch-and-merge is a central coordination mechanism for multi-agent collaboration, and that SWE primitives such as git worktree, git commit, and git merge enable it to be realized in a reliable and executable manner.
[78] Triangulating Temporal Dynamics in Multilingual Swiss Online News
Bros Victor, Dufraisse Evan, Popescu Adrian, Gatica-Perez Daniel
Main category: cs.CL
TL;DR: Multilingual analysis of Swiss digital media across French, German, and Italian regions using quantitative metrics and qualitative insights to study temporal trends and cultural influences on news reporting.
Details
Motivation: To address the lack of comprehensive studies accounting for linguistic and cultural diversity in national media ecosystems, particularly in complex multilingual contexts like Switzerland, and to understand how news coverage varies across different linguistic regions.Method: Triangulated methodology combining quantitative analyses (lexical metrics, named entity recognition with Wikidata linking, targeted sentiment analysis, consensus-based change-point detection) with qualitative insights on 1.7M+ news articles across three linguistic regions.
Result: Revealed distinct temporal patterns in Swiss digital media and demonstrated how linguistic and cultural contexts influence reporting, with findings spanning thematic, recurrent, and singular events.
Conclusion: The approach provides a framework applicable to other multilingual/culturally diverse media environments, contributing to understanding how news is shaped by linguistic and cultural factors, and demonstrates the usefulness of triangulation in media studies.
Abstract: Analyzing news coverage in multilingual societies can offer valuable insights into the dynamics of public discourse and the development of collective narratives, yet comprehensive studies that account for linguistic and cultural diversity within national media ecosystems remain limited, particularly in complex contexts such as Switzerland. This paper studies temporal trends in Swiss digital media across the country’s three main linguistic regions, French, German, and Italian, using a triangulated methodology that combines quantitative analyses with qualitative insights. We collected and processed over 1.7 million news articles, applying lexical metrics, named entity recognition and Wikidata-based linking, targeted sentiment analysis, and consensus-based change-point detection. To enable principled cross-language comparisons and to connect to theories of domestication and cultural proximity, we derive domestication profiles together with a proximity salience ratio. Our analysis spans thematic, recurrent, and singular events. By integrating quantitative data with qualitative interpretation, we provide new insights into the dynamics of Swiss digital media and demonstrate the usefulness of triangulation in media studies. The findings reveal distinct temporal patterns and highlight how linguistic and cultural contexts influence reporting. Our approach offers a framework applicable to other multilingual or culturally diverse media environments, contributing to a deeper understanding of how news is shaped by linguistic and cultural factors.
[79] Generalizable Self-Evolving Memory for Automatic Prompt Optimization
Guanbao Liang, Yuanchen Bei, Sheng Zhou, Yuheng Qin, Huan Zhou, Bingxin Jia, Bin Li, Jiajun Bu
Main category: cs.CL
TL;DR: MemAPO is a memory-driven framework for prompt optimization that accumulates reusable prompting knowledge over time through a dual-memory mechanism, enabling generalization across tasks rather than task-specific optimization.
Details
Motivation: Current prompt optimization methods are limited to task-specific prompts, preventing generalization across heterogeneous queries and accumulation of reusable prompting knowledge over time. The authors aim to create a framework that learns from experience and improves continuously.Method: MemAPO uses a dual-memory mechanism: (1) distills successful reasoning trajectories into reusable strategy templates, and (2) organizes incorrect generations into structured error patterns capturing recurrent failure modes. For new prompts, it retrieves relevant strategies and failure patterns to compose prompts that promote effective reasoning while avoiding known mistakes.
Result: Experiments on diverse benchmarks show MemAPO consistently outperforms representative prompt optimization baselines while substantially reducing optimization cost.
Conclusion: MemAPO successfully reconceptualizes prompt optimization as generalizable and self-evolving experience accumulation, enabling continuous improvement over time rather than restarting from scratch for each task.
Abstract: Automatic prompt optimization is a promising approach for adapting large language models (LLMs) to downstream tasks, yet existing methods typically search for a specific prompt specialized to a fixed task. This paradigm limits generalization across heterogeneous queries and prevents models from accumulating reusable prompting knowledge over time. In this paper, we propose MemAPO, a memory-driven framework that reconceptualizes prompt optimization as generalizable and self-evolving experience accumulation. MemAPO maintains a dual-memory mechanism that distills successful reasoning trajectories into reusable strategy templates while organizing incorrect generations into structured error patterns that capture recurrent failure modes. Given a new prompt, the framework retrieves both relevant strategies and failure patterns to compose prompts that promote effective reasoning while discouraging known mistakes. Through iterative self-reflection and memory editing, MemAPO continuously updates its memory, enabling prompt optimization to improve over time rather than restarting from scratch for each task. Experiments on diverse benchmarks show that MemAPO consistently outperforms representative prompt optimization baselines while substantially reducing optimization cost.
[80] CatRAG: Functor-Guided Structural Debiasing with Retrieval Augmentation for Fair LLMs
Ravi Ranjan, Utkarsh Grover, Mayur Akewar, Xiaomin Lin, Agoritsa Polyzou
Main category: cs.CL
TL;DR: CatRAG Debiasing integrates category theory with RAG for structural debiasing of LLMs, achieving state-of-the-art bias reduction while preserving utility.
Details
Motivation: LLMs deployed in high-stakes settings show demographic, gender, and geographic biases that undermine fairness and trust. Existing debiasing methods are incomplete and create brittle utility trade-offs under distribution shifts.Method: Proposes CatRAG Debiasing, a dual-pronged framework that integrates functor (category-theoretic structure for principled, structure-preserving projection) with Retrieval-Augmented Generation (RAG) guided structural debiasing to suppress bias-associated directions while retaining task-relevant semantics.
Result: On Bias Benchmark for Question Answering (BBQ) across three open-source LLMs (Meta Llama-3, OpenAI GPT-OSS, Google Gemma-3), CatRAG achieves SOTA results: improves accuracy by up to 40% over base models and >10% over prior debiasing methods, reducing bias scores to near zero (from 60% for base models) across gender, nationality, race, and intersectional subgroups.
Conclusion: CatRAG Debiasing effectively mitigates LLM biases through structural, category-theoretic approach combined with RAG, achieving superior bias reduction while maintaining utility.
Abstract: Large Language Models (LLMs) are deployed in high-stakes settings but can show demographic, gender, and geographic biases that undermine fairness and trust. Prior debiasing methods, including embedding-space projections, prompt-based steering, and causal interventions, often act at a single stage of the pipeline, resulting in incomplete mitigation and brittle utility trade-offs under distribution shifts. We propose CatRAG Debiasing, a dual-pronged framework that integrates functor with Retrieval-Augmented Generation (RAG) guided structural debiasing. The functor component leverages category-theoretic structure to induce a principled, structure-preserving projection that suppresses bias-associated directions in the embedding space while retaining task-relevant semantics. On the Bias Benchmark for Question Answering (BBQ) across three open-source LLMs (Meta Llama-3, OpenAI GPT-OSS, and Google Gemma-3), CatRAG achieves state-of-the-art results, improving accuracy by up to 40% over the corresponding base models and by more than 10% over prior debiasing methods, while reducing bias scores to near zero (from 60% for the base models) across gender, nationality, race, and intersectional subgroups.
[81] SynSym: A Synthetic Data Generation Framework for Psychiatric Symptom Identification
Migyeong Kang, Jihyun Kim, Hyolim Jeon, Sunwoo Hwang, Jihyun An, Yonghoon Kim, Haewoon Kwak, Jisun An, Jinyoung Han
Main category: cs.CL
TL;DR: SynSym: A synthetic data generation framework using LLMs to create training data for psychiatric symptom identification from social media text, addressing data scarcity through symptom expansion, diverse linguistic style generation, and realistic multi-symptom composition.
Details
Motivation: Building large-scale symptom-level datasets for mental health analysis from social media is challenging due to expensive expert labeling and lack of standardized annotation guidelines, limiting model generalizability for identifying diverse symptom expressions.Method: SynSym uses LLMs to generate synthetic training data by: 1) expanding symptoms into sub-concepts for diversity, 2) producing symptom expressions in diverse linguistic styles, and 3) composing realistic multi-symptom expressions based on clinical co-occurrence patterns.
Result: Models trained solely on SynSym’s synthetic data perform comparably to those trained on real data, with additional fine-tuning on real data providing further benefits. Validated on three benchmark datasets for depressive symptom expression.
Conclusion: Synthetic data generated by SynSym serves as a viable alternative to real-world annotations for psychiatric symptom modeling, offering a practical framework for creating clinically relevant and realistic symptom expressions.
Abstract: Psychiatric symptom identification on social media aims to infer fine-grained mental health symptoms from user-generated posts, allowing a detailed understanding of users’ mental states. However, the construction of large-scale symptom-level datasets remains challenging due to the resource-intensive nature of expert labeling and the lack of standardized annotation guidelines, which in turn limits the generalizability of models to identify diverse symptom expressions from user-generated text. To address these issues, we propose SynSym, a synthetic data generation framework for constructing generalizable datasets for symptom identification. Leveraging large language models (LLMs), SynSym constructs high-quality training samples by (1) expanding each symptom into sub-concepts to enhance the diversity of generated expressions, (2) producing synthetic expressions that reflect psychiatric symptoms in diverse linguistic styles, and (3) composing realistic multi-symptom expressions, informed by clinical co-occurrence patterns. We validate SynSym on three benchmark datasets covering different styles of depressive symptom expression. Experimental results demonstrate that models trained solely on the synthetic data generated by SynSym perform comparably to those trained on real data, and benefit further from additional fine-tuning with real data. These findings underscore the potential of synthetic data as an alternative resource to real-world annotations in psychiatric symptom modeling, and SynSym serves as a practical framework for generating clinically relevant and realistic symptom expressions.
[82] DATASHI: A Parallel English-Tashlhiyt Corpus for Orthography Normalization and Low-Resource Language Processing
Nasser-Eddine Monir, Zakaria Baou
Main category: cs.CL
TL;DR: DATASHI is a new parallel English-Tashlhiyt corpus with 5,000 sentence pairs, including standardized and user-generated versions, enabling study of orthographic diversity and supporting NLP tasks and multimodal alignment.
Details
Motivation: Addresses the critical gap in computational resources for Amazigh languages, particularly Tashlhiyt, by creating a parallel corpus that supports both text-based NLP tasks and multimodal applications like read-speech data collection.Method: Created a 5,000 sentence pair English-Tashlhiyt corpus with a 1,500-sentence subset containing both expert-standardized and non-standard user-generated versions. Evaluated with state-of-the-art LLMs (GPT-5, Claude-Sonnet-4.5, Gemini-2.5-Pro, Mistral, Qwen3-Max) using zero-shot to few-shot prompting, and conducted fine-grained analysis of edit operations across phonological classes.
Result: Gemini-2.5-Pro achieved the lowest word and character-level error rates and exhibited robust cross-lingual generalization. Fine-grained analysis revealed model-specific sensitivities to marked Tashlhiyt phonological features (geminates, emphatics, uvulars, pharyngeals), providing diagnostic insights for orthography normalization.
Conclusion: DATASHI fills a critical resource gap for Amazigh languages, enables systematic study of orthographic diversity, supports text-based NLP tasks and multimodal applications, and provides valuable diagnostic insights through LLM evaluations for low-resource language processing.
Abstract: DATASHI is a new parallel English-Tashlhiyt corpus that fills a critical gap in computational resources for Amazigh languages. It contains 5,000 sentence pairs, including a 1,500-sentence subset with expert-standardized and non-standard user-generated versions, enabling systematic study of orthographic diversity and normalization. This dual design supports text-based NLP tasks - such as tokenization, translation, and normalization - and also serves as a foundation for read-speech data collection and multimodal alignment. Comprehensive evaluations with state-of-the-art Large Language Models (GPT-5, Claude-Sonnet-4.5, Gemini-2.5-Pro, Mistral, Qwen3-Max) show clear improvements from zero-shot to few-shot prompting, with Gemini-2.5-Pro achieving the lowest word and character-level error rates and exhibiting robust cross-lingual generalization. A fine-grained analysis of edit operations - deletions, substitutions, and insertions - across phonological classes (geminates, emphatics, uvulars, and pharyngeals) further highlights model-specific sensitivities to marked Tashlhiyt features and provides new diagnostic insights for low-resource Amazigh orthography normalization.
[83] A Comparative Analysis of LLM Memorization at Statistical and Internal Levels: Cross-Model Commonalities and Model-Specific Signatures
Bowen Chen, Namgi Han, Yusuke Miyao
Main category: cs.CL
TL;DR: Analysis of memorization patterns across multiple LLM families reveals both universal scaling laws and family-specific behaviors at statistical and internal levels.
Details
Motivation: Previous studies on LLM memorization have been limited to single model series, making it unclear which findings are generalizable versus specific. The authors aim to understand memorization patterns across multiple LLM families to identify universal principles versus family-specific behaviors.Method: Collected multiple model series (Pythia, OpenLLaMa, StarCoder, OLMo1/2/3) and analyzed memorization behavior at both statistical and internal levels. Statistical analysis examined scaling laws, compression, and distribution patterns. Internal analysis studied perturbation removal, sensitivity, layer decoding, and attention head ablation.
Result: Found that memorization rate scales log-linearly with model size, memorized sequences can be further compressed, and there are shared frequency/domain distribution patterns. However, models also show individual features. Internally, LLMs can remove certain perturbations while memorized sequences are more sensitive. Identified general decoding processes and shared important heads for memorization, but distribution of important heads differs between families.
Conclusion: The study bridges various experiments to reveal both universal principles and family-specific features of memorization in LLMs, paving the way for a more fundamental understanding of memorization across different model architectures.
Abstract: Memorization is a fundamental component of intelligence for both humans and LLMs. However, while LLM performance scales rapidly, our understanding of memorization lags. Due to limited access to the pre-training data of LLMs, most previous studies focus on a single model series, leading to isolated observations among series, making it unclear which findings are general or specific. In this study, we collect multiple model series (Pythia, OpenLLaMa, StarCoder, OLMo1/2/3) and analyze their shared or unique memorization behavior at both the statistical and internal levels, connecting individual observations while showing new findings. At the statistical level, we reveal that the memorization rate scales log-linearly with model size, and memorized sequences can be further compressed. Further analysis demonstrated a shared frequency and domain distribution pattern for memorized sequences. However, different models also show individual features under the above observations. At the internal level, we find that LLMs can remove certain injected perturbations, while memorized sequences are more sensitive. By decoding middle layers and attention head ablation, we revealed the general decoding process and shared important heads for memorization. However, the distribution of those important heads differs between families, showing a unique family-level feature. Through bridging various experiments and revealing new findings, this study paves the way for a universal and fundamental understanding of memorization in LLM.
[84] TAMTRL: Teacher-Aligned Reward Reshaping for Multi-Turn Reinforcement Learning in Long-Context Compression
Li Wang, Yandong Wang, Xin Yu, Kui Zhang, Tianhao Peng, Wenjun Wu
Main category: cs.CL
TL;DR: TAMTRL addresses credit assignment in multi-turn LLM training for long documents by using teacher-aligned reward reshaping with relevant documents as supervision signals.
Details
Motivation: When LLMs process long documents exceeding context windows, they need chunk-wise processing with memory updates across multiple turns. However, supervision is only available for final outcomes, creating a temporal credit assignment problem for evaluating intermediate memory updates.Method: TAMTRL (Teacher-Aligned Reward Reshaping for Multi-Turn RL) uses relevant documents as teacher signals, aligns them with each turn’s model input, and assigns rewards through normalized probabilities in a self-supervised manner to provide fine-grained learning signals for each memory update.
Result: Experiments with multiple models across seven long-context benchmarks show TAMTRL consistently outperforms strong baselines, demonstrating effectiveness in improving long-context processing.
Conclusion: TAMTRL effectively addresses the credit assignment problem in multi-turn memory training for LLMs, providing a practical solution for long-document processing without substantial computational overhead.
Abstract: The rapid progress of large language models (LLMs) has led to remarkable performance gains across a wide range of tasks. However, when handling long documents that exceed the model’s context window limit, the entire context cannot be processed in a single pass, making chunk-wise processing necessary. This requires multiple turns to read different chunks and update memory. However, supervision is typically provided only by the final outcome, which makes it difficult to evaluate the quality of memory updates at each turn in the multi-turn training setting. This introduces a temporal credit assignment challenge. Existing approaches, such as LLM-as-a-judge or process reward models, incur substantial computational overhead and suffer from estimation noise. To better address the credit assignment problem in multi-turn memory training, we propose Teacher-Aligned Reward Reshaping for Multi-Turn Reinforcement Learning (TAMTRL). TAMTRL leverages relevant documents as teacher signals by aligning them with each turn of model input and assigns rewards through normalized probabilities in a self-supervised manner. This provides fine-grained learning signals for each memory update and improves long-context processing. Experiments with multiple models of varying scales across seven long-context benchmarks show that TAMTRL consistently outperforms strong baselines, demonstrating its effectiveness. Our code is available at https://anonymous.4open.science/r/TAMTRL-F1F8.
[85] Optimizing Multi-Agent Weather Captioning via Text Gradient Descent: A Training-Free Approach with Consensus-Aware Gradient Fusion
Shixu Liu
Main category: cs.CL
TL;DR: WeatherTGD: A training-free multi-agent framework using Text Gradient Descent to generate interpretable natural language captions from weather time series data through collaborative refinement by specialized LLM agents.
Details
Motivation: Existing approaches for weather time series analysis either produce numerical predictions without human-accessible explanations or generate generic descriptions lacking domain-specific depth. There's a need for interpretable natural language captions that combine meteorological expertise with natural language processing.Method: WeatherTGD uses a training-free multi-agent framework with three specialized LLM agents: Statistical Analyst, Physics Interpreter, and Meteorology Expert. These agents generate domain-specific textual gradients from weather time series observations. A Consensus-Aware Gradient Fusion mechanism aggregates these gradients, preserving unique domain perspectives while extracting common signals. The fused gradients guide an iterative refinement process analogous to gradient descent.
Result: Experiments on real-world meteorological datasets show WeatherTGD achieves significant improvements in both LLM-based evaluation and human expert evaluation, substantially outperforming existing multi-agent baselines while maintaining computational efficiency through parallel agent execution.
Conclusion: WeatherTGD successfully addresses the challenge of generating interpretable natural language captions from weather time series data by combining domain expertise with collaborative refinement through Text Gradient Descent, offering a promising approach for meteorological NLP applications.
Abstract: Generating interpretable natural language captions from weather time series data remains a significant challenge at the intersection of meteorological science and natural language processing. While recent advances in Large Language Models (LLMs) have demonstrated remarkable capabilities in time series forecasting and analysis, existing approaches either produce numerical predictions without human-accessible explanations or generate generic descriptions lacking domain-specific depth. We introduce WeatherTGD, a training-free multi-agent framework that reinterprets collaborative caption refinement through the lens of Text Gradient Descent (TGD). Our system deploys three specialized LLM agents including a Statistical Analyst, a Physics Interpreter, and a Meteorology Expert that generate domain-specific textual gradients from weather time series observations. These gradients are aggregated through a novel Consensus-Aware Gradient Fusion mechanism that extracts common signals while preserving unique domain perspectives. The fused gradients then guide an iterative refinement process analogous to gradient descent, where each LLM-generated feedback signal updates the caption toward an optimal solution. Experiments on real-world meteorological datasets demonstrate that WeatherTGD achieves significant improvements in both LLM-based evaluation and human expert evaluation, substantially outperforming existing multi-agent baselines while maintaining computational efficiency through parallel agent execution.
[86] Probing How Scalable Table Data Enhances General Long-Context Reasoning
Huaibing Xie, Guoliang Zhao, Yang Liu, Shihan Dou, Siming Huang, Yanling Xiao, Shaolei Wang, Yiting Liu, Cheng Zhang, Shaofan Liu, Pluto Zhou
Main category: cs.CL
TL;DR: TableLong: Using structured table data with periodic dependencies to enhance LLM long-context reasoning via RL-based data synthesis
Details
Motivation: Real-world tasks require long-context reasoning, but few studies explore effective data types for this capability. The paper finds structured table data with periodic structures shows strong potential for enhancing long-context reasoning in LLMs.Method: Mathematically analyze tabular dependency structures using mutual information to reveal periodic non-vanishing dependencies. Propose TableLong - a scalable pipeline for synthesizing high-quality, diverse, and verifiable structured table data using RL to boost long-context reasoning.
Result: Table data significantly enhances LLM long-context reasoning across multiple benchmarks (+8.24% average improvement), and even improves performance on out-of-domain benchmarks (+8.06% average).
Conclusion: Structured table data is effective for enhancing long-context reasoning in LLMs, providing practical guidance for post-training data selection. The proposed TableLong pipeline demonstrates scalable synthesis of beneficial table data.
Abstract: As real-world tasks grow increasingly complex, long-context reasoning has become a core capability for Large Language Models (LLMs). However, few studies explore which data types are effective for long-context reasoning and why. We find that structured table data with periodic structures shows strong potential for long-context reasoning. Motivated by this observation, we mathematically analyze tabular dependency structures using mutual information, revealing periodic non-vanishing dependencies in table data. Furthermore, we systematically analyze the capabilities of structured table data, conduct relevant scaling experiments, and validate its underlying mechanisms for enhancing long-context reasoning, yielding several meaningful insights. Leveraging these insights, we propose a simple yet scalable pipeline(TableLong) for synthesizing high-quality, diverse, and verifiable structured table data to boost long-context reasoning via RL. Extensive experimental results demonstrate that table data significantly enhances the long-context reasoning capability of LLMs across multiple long-context benchmarks (+8.24% on average), and even improves performance on out-of-domain benchmarks (+8.06% on average). We hope that our insights provide practical guidance for effective post-training data to enhance long-context reasoning in LLMs.
[87] SemEval-2026 Task 12: Abductive Event Reasoning: Towards Real-World Event Causal Inference for Large Language Models
Pengfei Cao, Mingxuan Yang, Yubo Chen, Chenlong Zhang, Mingxuan Liu, Kang Liu, Jun Zhao
Main category: cs.CL
TL;DR: SemEval-2026 Task 12 introduces Abductive Event Reasoning (AER), a benchmark for identifying direct causes of events from supporting evidence, with 122 participants and 518 submissions.
Details
Motivation: Direct-cause inference in evidence-rich settings is important for NLP and decision-making but remains underexplored. The paper aims to address this gap by creating a focused benchmark for abductive reasoning over real-world events.Method: Organized SemEval-2026 Task 12 as an evidence-grounded multiple-choice benchmark capturing real-world causal reasoning challenges including distributed evidence, indirect background factors, and semantically related non-causal distractors.
Result: The shared task attracted 122 participants and received 518 submissions, demonstrating significant community interest in abductive event reasoning.
Conclusion: AER provides a focused benchmark for abductive reasoning over real-world events and highlights challenges for future work on causal reasoning and multi-document understanding.
Abstract: Understanding why real-world events occur is important for both natural language processing and practical decision-making, yet direct-cause inference remains underexplored in evidence-rich settings. To address this gap, we organized SemEval-2026 Task 12: Abductive Event Reasoning (AER).\footnote{The task data is available at https://github.com/sooo66/semeval2026-task12-dataset.git} The task asks systems to identify the most plausible direct cause of a target event from supporting evidence. We formulate AER as an evidence-grounded multiple-choice benchmark that captures key challenges of real-world causal reasoning, including distributed evidence, indirect background factors, and semantically related but non-causal distractors. The shared task attracted 122 participants and received 518 submissions. This paper presents the task formulation, dataset construction pipeline, evaluation setup, and system results. AER provides a focused benchmark for abductive reasoning over real-world events and highlights challenges for future work on causal reasoning and multi-document understanding.
[88] Politics of Questions in News: A Mixed-Methods Study of Interrogative Stances as Markers of Voice and Power
Bros Victor, Barbini Matilde, Gerard Patrick, Gatica-Perez Daniel
Main category: cs.CL
TL;DR: Computational study of interrogatives in French digital news using mixed methods to analyze question functions, patterns, and discourse structure at corpus scale.
Details
Motivation: Previous research on interrogatives in news has been limited to small English corpora and broadcast interviews, with computational studies rarely distinguishing interrogatives from declaratives or analyzing their functions systematically. This paper aims to bridge linguistics, conversation analysis, and large-scale computational analysis to understand how questioning practices structure contemporary news discourse.Method: Mixed-methods approach using over one million French-language digital news articles (Jan 2023-Jun 2024). Automated detection of interrogative stances, functional type approximation, and location of textual answers. Combined with qualitative annotation of subcorpus grounded in semantic and pragmatic theories of questions.
Result: Interrogatives are sparse but systematically patterned: mainly introduce/organize issues, with most being information-seeking or echo-like. Questions overwhelmingly taken up within same article with answer-like spans, usually in journalist’s narrative voice. Interrogative contexts densely populated with named individuals/organizations/places, showing strong personalization and foregrounding of prominent actors.
Conclusion: Demonstrates how interrogative stance, textual uptake, and voice can be operationalized at corpus scale. Combining computational methods with pragmatic and sociological perspectives helps account for how questioning practices structure contemporary news discourse.
Abstract: Interrogatives in news discourse have been examined in linguistics and conversation analysis, but mostly in broadcast interviews and relatively small, often English-language corpora, while large-scale computational studies of news rarely distinguish interrogatives from declaratives or differentiate their functions. This paper brings these strands together through a mixed-methods study of the “Politics of Questions” in contemporary French-language digital news. Using over one million articles published between January 2023 and June 2024, we automatically detect interrogative stances, approximate their functional types, and locate textual answers when present, linking these quantitative measures to a qualitatively annotated subcorpus grounded in semantic and pragmatic theories of questions. Interrogatives are sparse but systematically patterned: they mainly introduce or organize issues, with most remaining cases being information-seeking or echo-like, while explicitly leading or tag questions are rare. Although their density and mix vary across outlets and topics, our heuristic suggests that questions are overwhelmingly taken up within the same article and usually linked to a subsequent answer-like span, most often in the journalist’s narrative voice and less often through quoted speech. Interrogative contexts are densely populated with named individuals, organizations, and places, whereas publics and broad social groups are mentioned much less frequently, suggesting that interrogative discourse tends to foreground already prominent actors and places and thus exhibits strong personalization. We show how interrogative stance, textual uptake, and voice can be operationalized at corpus scale, and argue that combining computational methods with pragmatic and sociological perspectives can help account for how questioning practices structure contemporary news discourse.
[89] Instruction Set and Language for Symbolic Regression
Ezequiel Lopez-Rubio, Mario Pascual-Gonzalez
Main category: cs.CL
TL;DR: IsalSR introduces a canonical representation for symbolic regression that eliminates structural redundancy in expression DAGs by computing pruned canonical strings that collapse equivalent representations.
Details
Motivation: The paper addresses the fundamental problem of structural redundancy in symbolic regression, where multiple equivalent representations of the same expression occupy separate points in the search space, wasting computational resources without adding diversity.Method: IsalSR uses a representation framework that encodes expression DAGs as strings over a compact two-tier alphabet and computes a pruned canonical string that serves as a complete labeled-DAG isomorphism invariant, collapsing all equivalent representations into a single canonical form.
Result: The method eliminates structural redundancy by providing a canonical representation that collapses equivalent expression DAG representations into a single form, improving search efficiency in symbolic regression.
Conclusion: IsalSR provides an effective solution to structural redundancy in symbolic regression through canonical representation, potentially improving the efficiency and effectiveness of symbolic regression algorithms.
Abstract: A fundamental but largely unaddressed obstacle in Symbolic regression (SR) is structural redundancy: every expression DAG with admits many distinct node-numbering schemes that all encode the same expression, each occupying a separate point in the search space and consuming fitness evaluations without adding diversity. We present IsalSR (Instruction Set and Language for Symbolic Regression), a representation framework that encodes expression DAGs as strings over a compact two-tier alphabet and computes a pruned canonical string – a complete labeled-DAG isomorphism invariant – that collapses all the equivalent representations into a single canonical form.
[90] Select, Label, Evaluate: Active Testing in NLP
Antonio Purificato, Maria Sofia Bucarelli, Andrea Bacciu, Amin Mantrach, Fabrizio Silvestri
Main category: cs.CL
TL;DR: Active Testing framework for NLP reduces annotation costs by selecting most informative test samples, achieving up to 95% reduction with minimal performance estimation error.
Details
Motivation: Human annotation cost and time are major bottlenecks in NLP, especially for test data which requires high-quality labels for reliable model evaluation. Traditional approaches require annotating entire test sets, leading to substantial resource requirements.Method: Formalizes Active Testing framework for NLP and benchmarks existing approaches across 18 datasets and 4 embedding strategies spanning 4 different NLP tasks. Introduces adaptive stopping criterion to automatically determine optimal number of samples instead of requiring predefined annotation budget.
Result: Achieves annotation reductions of up to 95% with performance estimation accuracy difference from full test set within 1%. Analysis shows variations in method effectiveness across different data characteristics and task types, with no single approach emerging as universally superior.
Conclusion: Active Testing provides effective framework for reducing annotation costs in NLP evaluation while maintaining reliable performance estimation, with adaptive stopping criterion addressing practical limitations of existing approaches.
Abstract: Human annotation cost and time remain significant bottlenecks in Natural Language Processing (NLP), with test data annotation being particularly expensive due to the stringent requirement for low-error and high-quality labels necessary for reliable model evaluation. Traditional approaches require annotating entire test sets, leading to substantial resource requirements. Active Testing is a framework that selects the most informative test samples for annotation. Given a labeling budget, it aims to choose the subset that best estimates model performance while minimizing cost and human effort. In this work, we formalize Active Testing in NLP and we conduct an extensive benchmarking of existing approaches across 18 datasets and 4 embedding strategies spanning 4 different NLP tasks. The experiments show annotation reductions of up to 95%, with performance estimation accuracy difference from the full test set within 1%. Our analysis reveals variations in method effectiveness across different data characteristics and task types, with no single approach emerging as universally superior. Lastly, to address the limitation of requiring a predefined annotation budget in existing sample selection strategies, we introduce an adaptive stopping criterion that automatically determines the optimal number of samples.
[91] Riding Brainwaves in LLM Space: Understanding Activation Patterns Using Individual Neural Signatures
Ajan Subramanian, Sumukh Bettadapura, Rohan Sathish
Main category: cs.CL
TL;DR: Frozen LLMs contain person-specific neural directions in their deep layers that can predict individual EEG responses, enabling EEG-driven personalization of language models.
Details
Motivation: With consumer-grade EEG devices becoming more common, researchers investigate whether language models can be adapted to individual neural responses by examining if frozen LLM representations encode person-specific EEG signals.Method: Used word-level EEG data from 30 participants reading sentences (ZuCo corpus), trained separate linear probes for each person mapping hidden states from frozen Qwen 2.5 7B to individual EEG power, compared person-specific vs population probes, and analyzed temporal stability and layer-wise patterns.
Result: Person-specific probes significantly outperformed population probes (9x improvement for high-gamma power), showed temporal stability, non-transferability across individuals, and concentrated predictive power in deep layers (peaking at Layer 24 of 28). Results consistent across architectures and survived confound controls.
Conclusion: Frozen language models contain stable, person-specific neural directions in their deep layers, providing a geometric foundation for EEG-driven personalization of language models for individual users.
Abstract: Consumer-grade EEG is entering everyday devices, from earbuds to headbands, raising the question of whether language models can be adapted to individual neural responses. We test this by asking whether frozen LLM representations encode person-specific EEG signals, directions in activation space that predict one person’s brain activity but not another’s. Using word-level EEG from 30 participants reading naturalistic sentences (ZuCo corpus), we train a separate linear probe for each person, mapping hidden states from a frozen Qwen 2.5 7B to that individual’s EEG power. Person-specific probes outperform a single population probe on every EEG feature tested; for high-gamma power, the person-specific probe achieves rho = 0.183, a ninefold improvement over the population probe (rho = 0.020, p < 10^-4). A negative control, fixation count, shows no person-specific advantage (p = 0.360); fixation count reflects word length and frequency rather than individual cognition. The individual directions are temporally stable (split-half cosine = 0.824), non-transferable across people (self rho = 0.369 vs. other rho = 0.143, p < 10^-19), and distinct from the shared population signal: person-specific probes retain predictive power after the population component is removed. The person-specific signal concentrates in the model’s deep layers, rising consistently with depth and peaking at Layer 24 of 28. The results are consistent across architectures (LLaMA 3.1 8B) and survive word-level confound controls. Frozen language models contain stable, person-specific neural directions in their deep layers, providing a geometric foundation for EEG-driven personalization.
[92] Ara-Best-RQ: Multi Dialectal Arabic SSL
Haroun Elleuch, Ryan Whetten, Salima Mdhaffar, Yannick Estève, Fethi Bougares
Main category: cs.CL
TL;DR: Ara-BEST-RQ: Self-supervised learning models for multi-dialectal Arabic speech processing, achieving state-of-the-art in dialect identification with 600M parameter conformer-based models trained on 5,640 hours of Arabic speech data.
Details
Motivation: Address the need for specialized speech processing models for Arabic dialects, which have been underserved compared to major languages. Current multilingual models don't adequately handle Arabic dialect variations, and there's a lack of large-scale Arabic-specific SSL models.Method: Pre-trained conformer-based BEST-RQ models on 5,640 hours of Creative Commons Arabic speech combined with public datasets. Used self-supervised learning approach with models up to 600M parameters, specifically targeting Arabic dialect families.
Result: Achieved state-of-the-art performance on dialect identification (DID) tasks while using fewer parameters than competing models. Demonstrated that Arabic-targeted pre-training significantly outperforms multilingual or monolingual models trained on non-Arabic data.
Conclusion: Family-targeted pre-training on Arabic dialects is crucial for optimal performance in Arabic speech tasks. The released models, code, and datasets will support further research in Arabic speech technologies.
Abstract: We present Ara-BEST-RQ, a family of self-supervised learning (SSL) models specifically designed for multi-dialectal Arabic speech processing. Leveraging 5,640 hours of crawled Creative Commons speech and combining it with publicly available datasets, we pre-train conformer-based BEST-RQ models up to 600M parameters. Our models are evaluated on dialect identification (DID) and automatic speech recognition (ASR) tasks, achieving state-of-the-art performance on the former while using fewer parameters than competing models. We demonstrate that family-targeted pre-training on Arabic dialects significantly improves downstream performance compared to multilingual or monolingual models trained on non-Arabic data. All models, code, and pre-processed datasets will be publicly released to support reproducibility and further research in Arabic speech technologies.
[93] SLURP-TN : Resource for Tunisian Dialect Spoken Language Understanding
Haroun Elleuch, Salima Mdhaffar, Yannick Estève, Fethi Bougares
Main category: cs.CL
TL;DR: SLURP-TN: A new Tunisian dialect dataset for Spoken Language Understanding with 4165 sentences (~5 hours) recorded from native speakers, addressing resource scarcity for low-resource languages.
Details
Motivation: Most SLU progress benefits only high-resource languages due to lack of datasets; need to extend SLU capabilities to low-resource languages like Tunisian dialect.Method: Created SLURP-TN by recording 55 native speakers uttering sentences manually translated from six SLURP domains into Tunisian dialect, then developed ASR and SLU baseline models.
Result: Produced dataset of 4165 sentences (~5 hours audio) with baseline models; dataset and models publicly available on Hugging Face.
Conclusion: SLURP-TN helps mitigate resource scarcity for Tunisian dialect SLU, enabling research and development for low-resource languages in spoken language understanding.
Abstract: Spoken Language Understanding (SLU) aims to extract the semantic information from the speech utterance of user queries. It is a core component in a task-oriented dialogue system. With the spectacular progress of deep neural network models and the evolution of pre-trained language models, SLU has obtained significant breakthroughs. However, only a few high-resource languages have taken advantage of this progress due to the absence of SLU resources. In this paper, we seek to mitigate this obstacle by introducing SLURP-TN. This dataset was created by recording 55 native speakers uttering sentences in Tunisian dialect, manually translated from six SLURP domains. The result is an SLU Tunisian dialect dataset that comprises 4165 sentences recorded into around 5 hours of acoustic material. We also develop a number of Automatic Speech Recognition and SLU models exploiting SLUTP-TN. The Dataset and baseline models are available at: https://huggingface.co/datasets/Elyadata/SLURP-TN.
[94] Parameter-Efficient Fine-Tuning for Medical Text Summarization: A Comparative Study of Lora, Prompt Tuning, and Full Fine-Tuning
Ulugbek Shernazarov, Rostislav Svitsov, Bin Shi
Main category: cs.CL
TL;DR: LoRA outperforms full fine-tuning for medical text summarization with Flan-T5 models, achieving better ROUGE scores with only 0.6% trainable parameters.
Details
Motivation: Fine-tuning large language models for domain-specific tasks like medical text summarization requires substantial computational resources, creating a need for parameter-efficient fine-tuning methods.Method: Compares three adaptation approaches (LoRA, Prompt Tuning, and Full Fine-Tuning) across the Flan-T5 model family on the PubMed medical summarization dataset, with sensitivity analyses on LoRA rank and prompt token count.
Result: LoRA consistently outperforms full fine-tuning, achieving 43.52 +/- 0.18 ROUGE-1 on Flan-T5-Large vs 40.67 +/- 0.21 for full fine-tuning, using only 0.6% trainable parameters.
Conclusion: The low-rank constraint in LoRA provides beneficial regularization, challenging assumptions about the necessity of full parameter updates for domain adaptation.
Abstract: Fine-tuning large language models for domain-specific tasks such as medical text summarization demands substantial computational resources. Parameter-efficient fine-tuning (PEFT) methods offer promising alternatives by updating only a small fraction of parameters. This paper compares three adaptation approaches-Low-Rank Adaptation (LoRA), Prompt Tuning, and Full Fine-Tuning-across the Flan-T5 model family on the PubMed medical summarization dataset. Through experiments with multiple random seeds, we demonstrate that LoRA consistently outperforms full fine-tuning, achieving 43.52 +/- 0.18 ROUGE-1 on Flan-T5-Large with only 0.6% trainable parameters compared to 40.67 +/- 0.21 for full fine-tuning. Sensitivity analyses examine the impact of LoRA rank and prompt token count. Our findings suggest the low-rank constraint provides beneficial regularization, challenging assumptions about the necessity of full parameter updates. Code is available at https://github.com/eracoding/llm-medical-summarization
[95] Retrieving Climate Change Disinformation by Narrative
Max Upravitelev, Veronika Solopova, Charlott Jakob, Premtim Sahitaj, Vera Schmitt
Main category: cs.CL
TL;DR: Reformulates climate disinformation narrative detection as a retrieval task using narrative core messages as queries, introduces SpecFi framework with hypothetical document generation, and shows robustness to high-variance narratives.
Details
Motivation: Traditional climate disinformation detection relies on fixed taxonomies that cannot accommodate emerging narratives, requiring a more flexible approach that can handle narrative evolution without predefined labels.Method: Reformulates narrative detection as retrieval task using narrative core messages as queries; introduces SpecFi framework that generates hypothetical documents using community summaries from graph-based community detection as few-shot examples; proposes narrative variance metric based on embeddings.
Result: SpecFi achieves MAP of 0.505 on CARDS dataset without narrative labels; shows robustness to high-variance narratives (32.7% MAP loss vs 63.4% for BM25); unsupervised community summaries converge on descriptions close to expert-crafted taxonomies.
Conclusion: Retrieval-based approach enables flexible detection of emerging climate disinformation narratives without predefined taxonomies; graph-based methods can effectively surface narrative structure from unlabeled text; SpecFi framework provides robust performance.
Abstract: Detecting climate disinformation narratives typically relies on fixed taxonomies, which do not accommodate emerging narratives. Thus, we re-frame narrative detection as a retrieval task: given a narrative’s core message as a query, rank texts from a corpus by alignment with that narrative. This formulation requires no predefined label set and can accommodate emerging narratives. We repurpose three climate disinformation datasets (CARDS, Climate Obstruction, climate change subset of PolyNarrative) for retrieval evaluation and propose SpecFi, a framework that generates hypothetical documents to bridge the gap between abstract narrative descriptions and their concrete textual instantiations. SpecFi uses community summaries from graph-based community detection as few-shot examples for generation, achieving a MAP of 0.505 on CARDS without access to narrative labels. We further introduce narrative variance, an embedding-based difficulty metric, and show via partial correlation analysis that standard retrieval degrades on high-variance narratives (BM25 loses 63.4% of MAP), while SpecFi-CS remains robust (32.7% loss). Our analysis also reveals that unsupervised community summaries converge on descriptions close to expert-crafted taxonomies, suggesting that graph-based methods can surface narrative structure from unlabeled text.
[96] Dual-Space Knowledge Distillation with Key-Query Matching for Large Language Models with Vocabulary Mismatch
Stella Eva Tsiapali, Cong-Thanh Do, Kate Knill
Main category: cs.CL
TL;DR: DSKD-CMA-GA improves cross-tokenizer knowledge distillation for LLMs using generative adversarial learning to address distribution mismatches between teacher and student models.
Details
Motivation: Large language models are computationally expensive to deploy, and knowledge distillation helps create smaller efficient models. While DSKD-CMA is state-of-the-art for cross-tokenizer distillation, its internal mechanisms are not well understood, and it suffers from distribution mismatches between teacher and student models.Method: First analyzes DSKD-CMA’s attention mechanism through token alignment probing and heatmap visualizations. Then introduces DSKD-CMA-GA, which uses generative adversarial learning to address distribution mismatches between keys and queries from distinct models with different tokenizers.
Result: Shows modest but consistent ROUGE-L gains in text generation quality, particularly on out-of-distribution data (+0.37 average improvement), narrowing the performance gap between cross-tokenizer and same-tokenizer knowledge distillation.
Conclusion: The proposed DSKD-CMA-GA method effectively addresses distribution mismatch issues in cross-tokenizer knowledge distillation, improving text generation quality especially for out-of-distribution data.
Abstract: Large language models (LLMs) achieve state-of-the-art (SOTA) performance across language tasks, but are costly to deploy due to their size and resource demands. Knowledge Distillation (KD) addresses this by training smaller Student models to mimic larger Teacher models, improving efficiency without significant performance loss. Dual-Space Knowledge Distillation with Cross-Model Attention (DSKD-CMA) has emerged as a SOTA method for KD between LLMs with distinct tokenizers, yet its internal workings remain largely opaque. In this work, we systematically analyse the attention mechanism of DSKD-CMA through manual token alignment probing and heatmap visualisations, revealing both strengths and limitations. Building on this, we introduce a novel method, DSKD-CMA-GA, based on Generative Adversarial (GA) learning, to address the mismatched distributions between the keys and queries computed from distinct models. Experiments show modest but consistent ROUGE-L gains in text generation quality, particularly on out-of-distribution data (+0.37 on average), narrowing the gap between cross- and same-tokenizer KD.
[97] Autoregressive vs. Masked Diffusion Language Models: A Controlled Comparison
Caio Vicentino
Main category: cs.CL
TL;DR: Empirical comparison of autoregressive vs masked diffusion language models on identical data/compute shows comparable throughput but different convergence patterns and diversity-fluency trade-offs
Details
Motivation: To conduct a controlled empirical comparison between autoregressive (AR) and masked diffusion (MDLM) language models, isolating the generation paradigm as the sole variable to understand their fundamental differences in training dynamics and output characteristics.Method: Train both AR and MDLM models on identical data (50M tokens from TinyStories), identical compute budget (20,000 steps, batch size 32, sequence length 512), and identical hardware (NVIDIA H100 80GB). Compare training throughput, convergence patterns, and analyze quantitative diversity of 1,000 generated samples.
Result: 1) Comparable training throughput (~50K tokens/second, MDLM requires 4.7% more time); 2) AR converges faster and overfits by step 14,000, while MDLM converges slower and still improves at step 20,000; 3) AR produces fluent but repetitive outputs (99.8% begin with same word), while MDLM generates more diverse narratives (93.4% unique 5-word openings, higher Distinct-n, lower Self-BLEU) with occasional grammatical issues.
Conclusion: The generation paradigm significantly impacts training dynamics and output characteristics: AR models converge faster but produce less diverse outputs, while MDLM models converge slower but generate more diverse content, revealing a structural diversity-fluency trade-off between the two approaches.
Abstract: We present a controlled empirical comparison between autoregressive (AR) and masked diffusion (MDLM) language models. Both models are trained on identical data (50M tokens from TinyStories), identical compute budget (20,000 steps, batch size 32, sequence length 512), and identical hardware (NVIDIA H100 80GB), isolating the generation paradigm as the sole variable. We report three findings. First, both paradigms achieve comparable training throughput (~50K tokens/second), with MDLM requiring only 4.7% more wall-clock time. Second, AR converges faster and begins overfitting by step 14,000, while MDLM converges more slowly and is still improving at step 20,000, suggesting different compute-optimal training regimes. Third, quantitative diversity analysis over 1,000 generated samples reveals a structural diversity-fluency trade-off: AR produces fluent but repetitive outputs (99.8% begin with the same word), while MDLM generates more diverse narratives (93.4% unique 5-word openings, higher Distinct-n, lower Self-BLEU), at the cost of occasional grammatical inconsistencies. All code, trained checkpoints, and data pipelines are released for reproducibility.
[98] Multiperspectivity as a Resource for Narrative Similarity Prediction
Max Upravitelev, Veronika Solopova, Jing Yang, Charlott Jakob, Premtim Sahitaj, Ariana Sahitaj, Vera Schmitt
Main category: cs.CL
TL;DR: Paper explores narrative similarity prediction as interpretive task, uses ensemble of 31 LLM personas with different interpretive frameworks to handle multiperspectivity, achieving 0.705 accuracy on SemEval-2026 Task 4.
Details
Motivation: Narrative similarity prediction involves interpretive judgments where different valid readings produce divergent similarity judgments, challenging semantic evaluation benchmarks with single ground truths. The paper aims to incorporate multiperspectivity rather than overcome it.Method: Created ensemble of 31 LLM personas ranging from practitioners following specific interpretive frameworks to intuitive lay-style characters. Used majority voting with ensemble on SemEval-2026 Task 4 dataset, analyzing performance dynamics and error correlations.
Result: Achieved 0.705 accuracy on SemEval-2026 Task 4. Accuracy improves with ensemble size following Condorcet Jury Theorem-like dynamics. Practitioner personas perform worse individually but produce less correlated errors, yielding larger ensemble gains. Gender-focused interpretive vocabulary consistently associated with lower accuracy across personas.
Conclusion: Interpretive plurality should be incorporated in predictive systems rather than treated as noise. Ensemble methods with diverse personas can handle multiperspectivity. Need for evaluation frameworks that account for interpretive diversity beyond single ground truths.
Abstract: Predicting narrative similarity can be understood as an inherently interpretive task: different, equally valid readings of the same text can produce divergent interpretations and thus different similarity judgments, posing a fundamental challenge for semantic evaluation benchmarks that encode a single ground truth. Rather than treating this multiperspectivity as a challenge to overcome, we propose to incorporate it in the decision making process of predictive systems. To explore this strategy, we created an ensemble of 31 LLM personas. These range from practitioners following interpretive frameworks to more intuitive, lay-style characters. Our experiments were conducted on the SemEval-2026 Task 4 dataset, where the system achieved an accuracy score of 0.705. Accuracy improves with ensemble size, consistent with Condorcet Jury Theorem-like dynamics under weakened independence. Practitioner personas perform worse individually but produce less correlated errors, yielding larger ensemble gains under majority voting. Our error analysis reveals a consistent negative association between gender-focused interpretive vocabulary and accuracy across all persona categories, suggesting either attention to dimensions not relevant for the benchmark or valid interpretations absent from the ground truth. This finding underscores the need for evaluation frameworks that account for interpretive plurality.
[99] The Semantic Ladder: A Framework for Progressive Formalization of Natural Language Content for Knowledge Graphs and AI Systems
Lars Vogt
Main category: cs.CL
TL;DR: The Semantic Ladder framework enables progressive formalization of knowledge from natural language to formal semantic models, bridging the gap between human communication and machine-actionable representations.
Details
Motivation: The paper addresses the fundamental challenge of reconciling natural language (where most knowledge is created) with formal semantic models (needed for machine integration and reasoning), particularly when full formalization is required at data entry.Method: Introduces the Semantic Ladder architectural framework that organizes representations across increasing levels of semantic explicitness, from natural language text snippets to ontology-based and higher-order logical models, using modular semantic units as carriers of meaning.
Result: The framework enables incremental construction of semantic knowledge spaces, reduces semantic parsing burden, and supports integration of heterogeneous representations including natural language, structured semantic models, and vector-based embeddings.
Conclusion: The Semantic Ladder provides a foundation for scalable, interoperable, and AI-ready data and knowledge infrastructures by enabling progressive formalization while preserving semantic continuity and traceability.
Abstract: Semantic data and knowledge infrastructures must reconcile two fundamentally different forms of representation: natural language, in which most knowledge is created and communicated, and formal semantic models, which enable machine-actionable integration, interoperability, and reasoning. Bridging this gap remains a central challenge, particularly when full semantic formalization is required at the point of data entry. Here, we introduce the Semantic Ladder, an architectural framework that enables the progressive formalization of data and knowledge. Building on the concept of modular semantic units as identifiable carriers of meaning, the framework organizes representations across levels of increasing semantic explicitness, ranging from natural language text snippets to ontology-based and higher-order logical models. Transformations between levels support semantic enrichment, statement structuring, and logical modelling while preserving semantic continuity and traceability. This approach enables the incremental construction of semantic knowledge spaces, reduces the semantic parsing burden, and supports the integration of heterogeneous representations, including natural language, structured semantic models, and vector-based embeddings. The Semantic Ladder thereby provides a foundation for scalable, interoperable, and AI-ready data and knowledge infrastructures.
[100] Enhancing Document-Level Machine Translation via Filtered Synthetic Corpora and Two-Stage LLM Adaptation
Ireh Kim, Tesia Sker, Chanwoo Kim
Main category: cs.CL
TL;DR: Two-stage fine-tuning strategy for LLMs in document-level machine translation using LLM-augmented document-level data to address data scarcity and hallucination issues.
Details
Motivation: LLMs underperform in machine translation compared to encoder-decoder systems but have strong contextual modeling capabilities that make them suitable for document-level translation, which faces challenges of scarce document-level parallel data and LLM hallucinations/omissions.Method: Two-stage approach: 1) Data augmentation by converting summarization data into document-level parallel data using LLMs, then filtering with multiple metrics (sacreBLEU, COMET, LaBSE-based cosine similarity); 2) Two-stage fine-tuning: first on abundant sentence-level MT resources, then on filtered document-level corpus.
Result: Not specified in the abstract, but the method addresses key challenges in document-level MT with LLMs through data augmentation and staged fine-tuning.
Conclusion: Proposed approach tackles data scarcity and generation quality issues in document-level machine translation using LLMs through innovative data augmentation and staged fine-tuning strategies.
Abstract: In Machine Translation, Large Language Models (LLMs) have generally underperformed compared to conventional encoder-decoder systems and thus see limited adoption. However, LLMs excel at modeling contextual information, making them a natural fit for document-level translation tasks where coherence across sentences is crucial. Despite this potential, document-level MT with LLMs faces two key challenges: (1) the scarcity of large-scale, high-quality document-level parallel data; and (2) the propensity of LLMs to introduce hallucinations and omissions during generation. To address these challenges, we propose a two-stage fine-tuning strategy leveraging LLM-augmented document-level data. First, we augment data by converting summarization data into document-level parallel data using a LLM, and then filter it using multiple metrics, leveraging sacreBLEU, COMET, and LaBSE-based cosine similarity-to improve data quality. Finally, we employ a two-stage fine-tuning strategy: first fine-tuning on the abundant sentence-level MT resources, and then on the filtered document-level corpus.
[101] Gumbel Distillation for Parallel Text Generation
Chi Zhang, Xixi Hu, Bo Liu, Qiang Liu
Main category: cs.CL
TL;DR: Gumbel Distillation is a novel technique that improves parallel decoding language models by distilling knowledge from autoregressive teachers using Gumbel-Max trick, achieving significant quality improvements over existing parallel decoding methods.
Details
Motivation: Autoregressive language models are slow due to sequential decoding, while parallel decoding methods sacrifice generation quality because they struggle to model complex joint distributions of token sequences. There's a need to bridge this performance gap between AR and non-AR models.Method: Gumbel Distillation uses the Gumbel-Max trick to create a deterministic mapping from a latent Gumbel noise space to the output tokens of a high-performing AR teacher model. This model-agnostic technique can be integrated with various parallel decoding architectures like MDLM and BD3-LM.
Result: Experiments on LM1B and OpenWebText show substantial improvements: 30.0% improvement in MAUVE score and 10.5% improvement in generative perplexity over MDLM trained on OpenWebText dataset.
Conclusion: Gumbel Distillation effectively narrows the performance gap between autoregressive and parallel decoding models by enabling parallel decoders to learn complex token distributions from AR teachers, making parallel generation more viable.
Abstract: The slow, sequential nature of autoregressive (AR) language models has driven the adoption of parallel decoding methods. However, these non-AR models often sacrifice generation quality as they struggle to model the complex joint distribution of token sequences. To narrow this performance gap, we introduce Gumbel Distillation, a novel distillation technique that enables parallel decoders to learn this distribution effectively. Our method leverages the Gumbel-Max trick to create a deterministic mapping from a latent Gumbel noise space to the output tokens of a high-performing AR teacher. As a model-agnostic technique, Gumbel Distillation seamlessly integrates with diverse parallel decoding architectures, including MDLM and BD3-LM. Experiments on LM1B and OpenWebText show that Gumbel Distillation substantially improves the generation quality of parallel language models, achieving a 30.0% improvement in MAUVE score and 10.5% in generative perplexity over MDLM trained on OpenWebText dataset. Code available at https://github.com/hxixixh/gumbel-distill.
[102] MemDLM: Memory-Enhanced DLM Training
Zehua Pei, Hui-Ling Zhen, Weizhe Lin, Sinno Jialin Pan, Yunhe Wang, Mingxuan Yuan, Bei Yu
Main category: cs.CL
TL;DR: MemDLM enhances Diffusion Language Models by embedding simulated denoising into training via bi-level optimization with parametric memory, improving convergence and enabling in-weight retrieval at inference.
Details
Motivation: Diffusion Language Models suffer from train-inference mismatch: trained with static single-step masked prediction but deployed through multi-step progressive denoising. This gap limits their effectiveness compared to Auto-Regressive models.Method: MemDLM uses bi-level optimization with inner loop updating fast weights (parametric memory) to capture local trajectory experience per sample, and outer loop updating base model conditioned on this memory. This embeds simulated denoising into training.
Result: MemDLM achieves faster convergence, lower training loss, and improved long-context understanding. At inference, parametric memory acts as emergent in-weight retrieval mechanism, reducing token-level attention bottlenecks on Needle-in-a-Haystack tasks.
Conclusion: MemDLM effectively bridges the train-inference gap in DLMs through parametric memory and bi-level optimization, offering better performance and emergent retrieval capabilities while maintaining DLM advantages over AR models.
Abstract: Diffusion Language Models (DLMs) offer attractive advantages over Auto-Regressive (AR) models, such as full-attention parallel decoding and flexible generation. However, they suffer from a notable train-inference mismatch: DLMs are trained with a static, single-step masked prediction objective, but deployed through a multi-step progressive denoising trajectory. We propose MemDLM (Memory-Enhanced DLM), which narrows this gap by embedding a simulated denoising process into training via Bi-level Optimization. An inner loop updates a set of fast weights, forming a Parametric Memory that captures the local trajectory experience of each sample, while an outer loop updates the base model conditioned on this memory. By offloading memorization pressure from token representations to parameters, MemDLM yields faster convergence and lower training loss. Moreover, the inner loop can be re-enabled at inference time as an adaptation step, yielding additional gains on long-context understanding. We find that, when activated at inference time, this Parametric Memory acts as an emergent in-weight retrieval mechanism, helping MemDLM further reduce token-level attention bottlenecks on challenging Needle-in-a-Haystack retrieval tasks. Code: https://github.com/JarvisPei/MemDLM.
[103] Greater accessibility can amplify discrimination in generative AI
Carolin Holtermann, Minh Duc Bui, Kaitlyn Zhou, Valentin Hofmann, Katharina von der Wense, Anne Lauscher
Main category: cs.CL
TL;DR: Audio-enabled LLMs exhibit systematic gender discrimination based on speaker voice, amplifying bias beyond text models and creating tension between accessibility and fairness in voice interfaces.
Details
Motivation: While voice interfaces promise to expand accessibility for users with literacy, motor, or device limitations, speech carries identity cues that users cannot easily mask, raising concerns about whether accessibility gains may come at the cost of equitable treatment and whether voice interfaces introduce distinct bias mechanisms.Method: The study examines audio-enabled LLMs for systematic gender discrimination by analyzing how responses shift toward gender-stereotyped adjectives and occupations based solely on speaker voice. It includes complementary survey evidence (n=1,000) on user perceptions and demonstrates pitch manipulation as a potential mitigation strategy.
Result: Audio-enabled LLMs exhibit systematic gender discrimination, shifting responses toward gender-stereotyped adjectives and occupations based on speaker voice, amplifying bias beyond text-based interaction. Survey shows infrequent chatbot users are most hesitant about attribute inference and most likely to disengage when such practices are revealed. Pitch manipulation can systematically regulate gender-discriminatory outputs.
Conclusion: Voice interfaces do not merely extend text models to a new modality but introduce distinct bias mechanisms tied to paralinguistic cues, creating a critical tension in AI development where efforts to expand accessibility through voice interfaces simultaneously create new pathways for discrimination, demanding that fairness and accessibility be addressed in tandem.
Abstract: Hundreds of millions of people rely on large language models (LLMs) for education, work, and even healthcare. Yet these models are known to reproduce and amplify social biases present in their training data. Moreover, text-based interfaces remain a barrier for many, for example, users with limited literacy, motor impairments, or mobile-only devices. Voice interaction promises to expand accessibility, but unlike text, speech carries identity cues that users cannot easily mask, raising concerns about whether accessibility gains may come at the cost of equitable treatment. Here we show that audio-enabled LLMs exhibit systematic gender discrimination, shifting responses toward gender-stereotyped adjectives and occupations solely on the basis of speaker voice, and amplifying bias beyond that observed in text-based interaction. Thus, voice interfaces do not merely extend text models to a new modality but introduce distinct bias mechanisms tied to paralinguistic cues. Complementary survey evidence ($n=1,000$) shows that infrequent chatbot users are most hesitant to undisclosed attribute inference and most likely to disengage when such practices are revealed. To demonstrate a potential mitigation strategy, we show that pitch manipulation can systematically regulate gender-discriminatory outputs. Overall, our findings reveal a critical tension in AI development: efforts to expand accessibility through voice interfaces simultaneously create new pathways for discrimination, demanding that fairness and accessibility be addressed in tandem.
[104] Edu-Values: Towards Evaluating the Chinese Education Values of Large Language Models
Peiyi Zhang, Yazhou Zhang, Bo Wang, Lu Rong, Prayag Tiwari, Jing Qin
Main category: cs.CL
TL;DR: Edu-Values is the first Chinese education values evaluation benchmark with 1,418 questions across 7 core values, testing 21 LLMs and finding Chinese models outperform English ones, with Qwen 2 scoring highest at 81.37.
Details
Motivation: There is a need for specialized evaluation benchmarks for Chinese education values to assess how well LLMs understand and align with Chinese educational culture, professional ethics, and regulations.Method: Created a comprehensive benchmark with 1,418 questions covering 7 core education values using multiple question formats (multiple-choice, multi-modal QA, subjective analysis, adversarial prompts, Chinese traditional culture). Evaluated 21 state-of-the-art LLMs using human feedback based automatic evaluation.
Result: Chinese LLMs outperform English LLMs due to educational culture differences (Qwen 2 ranked first with 81.37). LLMs struggle most with teachers’ professional ethics and professional philosophy. Using Edu-Values as an external knowledge repository for RAG significantly improves LLM alignment.
Conclusion: Edu-Values is an effective benchmark for evaluating LLMs on Chinese education values, revealing cultural differences in performance and demonstrating practical utility for improving LLM alignment through RAG applications.
Abstract: In this paper, we present Edu-Values, the first Chinese education values evaluation benchmark that includes seven core values: professional philosophy, teachers’ professional ethics, education laws and regulations, cultural literacy, educational knowledge and skills, basic competencies and subject knowledge. We meticulously design 1,418 questions, covering multiple-choice, multi-modal question answering, subjective analysis, adversarial prompts, and Chinese traditional culture (short answer) questions. We conduct human feedback based automatic evaluation over 21 state-of-the-art (SoTA) LLMs, and highlight three main findings: (1) due to differences in educational culture, Chinese LLMs outperform English LLMs, with Qwen 2 ranking the first with a score of 81.37; (2) LLMs often struggle with teachers’ professional ethics and professional philosophy; (3) leveraging Edu-Values to build an external knowledge repository for RAG significantly improves LLMs’ alignment. This demonstrates the effectiveness of the proposed benchmark.
[105] Multi-Session Client-Centered Treatment Outcome Evaluation in Psychotherapy
Hongbin Na, Tao Shen, Shumao Yu, Ling Chen
Main category: cs.CL
TL;DR: IPAEval is a framework for automated treatment outcome evaluation in psychotherapy from the client’s perspective using clinical interviews, incorporating cross-session contextual assessment and session-focused dynamics assessment.
Details
Motivation: Existing LLM approaches for therapeutic outcome assessment focus on therapist-centered, single-session evaluations, neglecting client's subjective experience and longitudinal progress across multiple sessions.Method: Proposes IPAEval framework with two-stage prompt scheme that maps client information onto psychometric test items for interpretable psychological assessments, integrating cross-session client-contextual assessment and session-focused client-dynamics assessment.
Result: Experiments on TheraPhase dataset (400 paired client records) show IPAEval effectively tracks symptom severity and treatment outcomes over multiple sessions, outperforming baselines across both closed-source and open-source models.
Conclusion: IPAEval enables comprehensive client-centered therapeutic outcome evaluation with interpretable assessments, validating benefits of items-aware reasoning mechanisms for longitudinal mental health monitoring.
Abstract: In psychotherapy, therapeutic outcome assessment, or treatment outcome evaluation, is essential to mental health care by systematically evaluating therapeutic processes and outcomes. Existing large language model approaches often focus on therapist-centered, single-session evaluations, neglecting the client’s subjective experience and longitudinal progress across multiple sessions. To address these limitations, we propose IPAEval, a client-Informed Psychological Assessment-based Evaluation framework, which automates treatment outcome evaluations from the client’s perspective using clinical interviews. It integrates cross-session client-contextual assessment and session-focused client-dynamics assessment for a comprehensive understanding of therapeutic progress. Specifically, IPAEval employs a two-stage prompt scheme that maps client information onto psychometric test items, enabling interpretable and structured psychological assessments. Experiments on our new TheraPhase dataset, comprising 400 paired initial and completion stage client records, demonstrate that IPAEval effectively tracks symptom severity and treatment outcomes over multiple sessions, outperforming baseline approaches across both closed-source and open-source models, and validating the benefits of items-aware reasoning mechanisms.
[106] Instructional Text Across Disciplines: A Survey of Representations, Downstream Tasks, and Open Challenges Toward Capable AI Agents
Abdulfattah Safa, Tamta Kapanadze, Arda Uzunoğlu, Gözde Gül Şahin
Main category: cs.CL
TL;DR: A comprehensive survey paper analyzing the landscape of complex instruction understanding and processing in large language models, examining 181 papers to identify trends, challenges, and opportunities in this emerging field.
Details
Motivation: Real-world tasks often involve complex, multi-step instructions that remain challenging for current NLP systems, despite advances in instruction tuning. Robust understanding of such instructions is essential for deploying LLMs as general-purpose agents that can be programmed in natural language to perform complex tasks across domains like robotics, business automation, and interactive systems.Method: Conducted a systematic review of 181 papers through systematic literature review methodology, analyzing available resources, representation schemes, and downstream tasks related to instructional text. The study examines trends, challenges, and opportunities in complex instruction understanding.
Result: The survey provides AI/NLP researchers with essential background knowledge and a unified view of various approaches to complex instruction understanding. It bridges gaps between different research directions and highlights future research opportunities in this emerging field.
Conclusion: This comprehensive survey systematically analyzes the landscape of complex instruction understanding, offering researchers a foundation for advancing LLMs as general-purpose agents capable of handling real-world, multi-step instructions across various domains.
Abstract: Recent advances in large language models have demonstrated promising capabilities in following simple instructions through instruction tuning. However, real-world tasks often involve complex, multi-step instructions that remain challenging for current NLP systems. Robust understanding of such instructions is essential for deploying LLMs as general-purpose agents that can be programmed in natural language to perform complex, real-world tasks across domains like robotics, business automation, and interactive systems. Despite growing interest in this area, there is a lack of a comprehensive survey that systematically analyzes the landscape of complex instruction understanding and processing. Through a systematic review of the literature, we analyze available resources, representation schemes, and downstream tasks related to instructional text. Our study examines 181 papers, identifying trends, challenges, and opportunities in this emerging field. We provide AI/NLP researchers with essential background knowledge and a unified view of various approaches to complex instruction understanding, bridging gaps between different research directions and highlighting future research opportunities.
[107] The MediaSpin Dataset: Post-Publication News Headline Edits Annotated for Media Bias
Preetika Verma, Kokil Jaidka
Main category: cs.CL
TL;DR: The MediaSpin dataset is the first to characterize bias in how news outlets editorialize headlines after publication, with 78,910 headline pairs annotated for 13 types of media bias using human-supervised LLM labeling.
Details
Motivation: Online news editability significantly shapes public perception through dynamic headline framing, but there's no systematic way to identify what types of media bias are editorialized in/out of headlines after publication.Method: Created MediaSpin dataset with 78,910 headline pairs annotated for 13 distinct media bias types using human-supervised LLM labeling, enabling systematic bias identification and analysis.
Result: Dataset provides linguistic insights into editorial bias patterns and enables applications for bias prediction and user behavior analysis in news headline editing.
Conclusion: MediaSpin enables systematic study of editorial bias in headline editing, offering tools for bias prediction and understanding how news framing evolves post-publication.
Abstract: The editability of online news content has become a significant factor in shaping public perception, as social media platforms introduce new affordances for dynamic and adaptive news framing. Edits to news headlines can refocus audience attention, add or remove emotional language, and shift the framing of events in subtle yet impactful ways. What types of media bias are editorialized in and out of news headlines, and how can they be systematically identified? This study introduces the MediaSpin dataset, the first to characterize the bias in how prominent news outlets editorialize news headlines after publication. The dataset includes 78,910 pairs of headlines annotated with 13 distinct types of media bias, using human-supervised LLM labeling. We discuss the linguistic insights it affords and show its applications for bias prediction and user behavior analysis.
[108] Levels of Analysis for Large Language Models
Alexander Y. Ku, Declan Campbell, Xuechunzi Bai, Jiayi Geng, Ryan Liu, Raja Marjieh, R. Thomas McCoy, Andrew Nam, Ilia Sucholutsky, Veniamin Veselovsky, Liyi Zhang, Jian-Qiao Zhu, Thomas L. Griffiths
Main category: cs.CL
TL;DR: Applying cognitive science methods from Marr’s levels of analysis to understand large language models as complex information processing systems
Details
Motivation: Large language models are increasingly powerful but opaque, similar to historical challenges in understanding the human mind. The authors recognize that cognitive science has developed methods for studying complex information processing systems that could be applied to LLMs.Method: Proposes a framework based on David Marr’s three levels of analysis (computational, algorithmic, implementational) for studying information processing systems. Revisits established cognitive science techniques relevant to each level and illustrates how they can provide insights into LLM behavior and internal organization.
Result: The paper provides a conceptual toolkit for making sense of LLMs by bridging cognitive science methodology with AI system analysis, though specific empirical results aren’t mentioned in the abstract.
Conclusion: Cognitive science methods developed for understanding the human mind can be productively applied to understand large language models, offering a systematic approach to demystifying these increasingly powerful but opaque AI systems.
Abstract: Modern artificial intelligence systems, such as large language models, are increasingly powerful but also increasingly hard to understand. Recognizing this problem as analogous to the historical difficulties in understanding the human mind, we argue that methods developed in cognitive science can be useful for understanding large language models. We propose a framework for applying these methods based on the levels of analysis that David Marr proposed for studying information processing systems. By revisiting established cognitive science techniques relevant to each level and illustrating their potential to yield insights into the behavior and internal organization of large language models, we aim to provide a toolkit for making sense of these new kinds of minds.
[109] Breaking the Silence: A Dataset and Benchmark for Bangla Text-to-Gloss Translation
Sharif Mohammad Abdullah, Abhijit Paul, Shubhashis Roy Dipta, Zarif Masud, Shebuti Rayana, Ahmedul Kabir
Main category: cs.CL
TL;DR: First dataset and model for Bangla text-to-gloss translation using synthetic data generation and comparative analysis of LLMs.
Details
Motivation: Address the gap in Bangla Sign Language (BdSL) research by creating the first text-to-gloss translation dataset and models for a low-resource language with 3+ million deaf/hard-of-hearing population.Method: Constructed dataset with 1,000 manual and 4,000 synthetic Bangla sentence-gloss pairs, plus 159 expert-annotated test pairs. Compared fine-tuned open-source models (mBART) vs. closed-source LLMs (GPT-5.4, Qwen-3) for translation performance.
Result: GPT-5.4 achieved best overall performance, fine-tuned mBART performed competitively despite being 100x smaller, and Qwen-3 outperformed all in human evaluation. Synthetic data generation proved effective for low-resource translation.
Conclusion: Established first dataset and model for Bangla text-to-gloss translation, demonstrating synthetic data’s value for low-resource sign language translation and showing competitive performance from smaller fine-tuned models.
Abstract: Gloss is a written approximation that bridges Sign Language (SL) and its corresponding spoken language. Despite a deaf and hard-of-hearing population of at least 3 million in Bangladesh, Bangla Sign Language (BdSL) remains largely understudied, with no prior work on Bangla text-to-gloss translation and no publicly available datasets. To address this gap, we construct the first Bangla text-to-gloss dataset, consisting of 1,000 manually annotated and 4,000 synthetically generated Bangla sentence-gloss pairs, along with 159 expert human-annotated pairs used as a test set. Our experimental framework performs a comparative analysis between several fine-tuned open-source models and a leading closed-source LLM to evaluate their performance in low-resource BdSL translation. GPT-5.4 achieves the best overall performance, while a fine-tuned mBART model performs competitively despite being approximately 100% smaller. Qwen-3 outperforms all other models in human evaluation. This work introduces the first dataset and trained model for Bangla text-to-gloss translation. It also demonstrates the effectiveness of systematically generated synthetic data for addressing challenges in low-resource sign language translation.
[110] Must Read: A Comprehensive Survey of Computational Persuasion
Nimet Beyza Bozdag, Shuhaib Mehri, Xiaocheng Yang, Hyeonjeong Ha, Zirui Cheng, Esin Durmus, Jiaxuan You, Heng Ji, Gokhan Tur, Dilek Hakkani-Tür
Main category: cs.CL
TL;DR: A comprehensive survey on AI and persuasion covering AI as persuader, persuadee, and judge, with taxonomy and future challenges for ethical AI persuasion.
Details
Motivation: Persuasion is fundamental to communication and decision-making across contexts. The rise of conversational AI expands persuasion's scope, creating both opportunities (beneficial applications) and risks (unethical influence). AI systems can both persuade and be persuaded, making them vulnerable to attacks and bias. Understanding effective persuasion remains limited due to its subjective, context-dependent nature.Method: Survey paper providing comprehensive overview structured around three perspectives: (1) AI as Persuader (AI-generated persuasive content and applications), (2) AI as Persuadee (AI’s susceptibility to influence and manipulation), (3) AI as Persuasion Judge (AI’s role in evaluating persuasive strategies, detecting manipulation, ensuring ethical persuasion). Introduces taxonomy for persuasion research.
Result: Presents structured framework for understanding AI’s role in persuasion across three dimensions. Identifies key challenges for future research to enhance safety, fairness, and effectiveness of AI-powered persuasion while addressing risks from increasingly capable language models.
Conclusion: AI’s role in persuasion is multi-faceted - as persuader, persuadee, and judge. Future research must address ethical challenges, safety concerns, and develop frameworks for responsible AI persuasion as language models become more capable.
Abstract: Persuasion is a fundamental aspect of communication, influencing decision-making across diverse contexts, from everyday conversations to high-stakes scenarios such as politics, marketing, and law. The rise of conversational AI systems has significantly expanded the scope of persuasion, introducing both opportunities and risks. AI-driven persuasion can be leveraged for beneficial applications, but also poses threats through unethical influence. Moreover, AI systems are not only persuaders, but also susceptible to persuasion, making them vulnerable to adversarial attacks and bias reinforcement. Despite rapid advancements in AI-generated persuasive content, our understanding of what makes persuasion effective remains limited due to its inherently subjective and context-dependent nature. In this survey, we provide a comprehensive overview of persuasion, structured around three key perspectives: (1) AI as a Persuader, which explores AI-generated persuasive content and its applications; (2) AI as a Persuadee, which examines AI’s susceptibility to influence and manipulation; and (3) AI as a Persuasion Judge, which analyzes AI’s role in evaluating persuasive strategies, detecting manipulation, and ensuring ethical persuasion. We introduce a taxonomy for persuasion research and discuss key challenges for future research to enhance the safety, fairness, and effectiveness of AI-powered persuasion while addressing the risks posed by increasingly capable language models.
[111] MobileIPL: Enhancing Mobile Agents Thinking Process via Iterative Preference Learning
Kun Huang, Weikai Xu, Yuxuan Liu, Quandong Wang, Pengzhi Gao, Wei Liu, Jian Luan, Bin Wang, Bo An
Main category: cs.CL
TL;DR: MobileIPL uses iterative preference learning with CoaT-tree sampling and thinking-level DPO to improve VLM-based mobile GUI agents, achieving SOTA performance on benchmarks with strong generalization.
Details
Motivation: Current CoaT paradigms for VLM-based mobile agents suffer from limited diverse trajectories, and existing self-training methods either ignore intermediate reasoning correctness or require expensive process-level annotations.Method: Proposes Iterative Preference Learning (IPL): constructs CoaT-tree through iterative sampling, scores leaf nodes with rule-based rewards, backpropagates feedback to create Thinking-level DPO pairs, and uses three-stage instruction evolution with GPT-4o for diverse Q&A generation.
Result: MobileIPL outperforms strong baselines including OS-ATLAS and UI-TARS, achieves state-of-the-art performance across three standard Mobile GUI-Agent benchmarks, and shows strong generalization to out-of-domain scenarios.
Conclusion: The IPL framework effectively addresses data scarcity in CoaT trajectories through iterative preference learning and instruction evolution, enabling improved reasoning and generalization for VLM-based mobile GUI agents.
Abstract: The Chain of Action-Planning Thoughts (CoaT) paradigm has been shown to improve the reasoning performance of VLM-based mobile agents in GUI tasks. However, the scarcity of diverse CoaT trajectories limits the expressiveness and generalization ability of such agents. While self-training is commonly employed to address data scarcity, existing approaches either overlook the correctness of intermediate reasoning steps or depend on expensive process-level annotations to construct process reward models (PRM). To address the above problems, we propose an Iterative Preference Learning (IPL) that constructs a CoaT-tree through interative sampling, scores leaf nodes using rule-based reward, and backpropagates feedback to derive Thinking-level Direct Preference Optimization (T-DPO) pairs. To prevent overfitting during warm-up supervised fine-tuning, we further introduce a three-stage instruction evolution, which leverages GPT-4o to generate diverse Q&A pairs based on real mobile UI screenshots, enhancing both generality and layout understanding. Experiments on three standard Mobile GUI-agent benchmarks demonstrate that our agent MobileIPL outperforms strong baselines, including continual pretraining models such as OS-ATLAS and UI-TARS. It achieves state-of-the-art performance across three standard Mobile GUI-Agents benchmarks and shows strong generalization to out-of-domain scenarios.
[112] MolLangBench: A Comprehensive Benchmark for Language-Prompted Molecular Structure Recognition, Editing, and Generation
Feiyang Cai, Jiahui Bai, Tao Tang, Guijuan He, Joshua Luo, Tianyu Zhu, Srikanth Pilla, Gang Li, Ling Liu, Feng Luo
Main category: cs.CL
TL;DR: MolLangBench is a benchmark for evaluating molecule-language interface tasks including recognition, editing, and generation, revealing significant limitations in current AI models like GPT-5.
Details
Motivation: There's a need for precise recognition, editing, and generation of molecules for both chemists and AI systems, but current benchmarks don't adequately evaluate fundamental molecule-language interface tasks across different molecular representations.Method: Created a comprehensive benchmark with recognition tasks using automated cheminformatics tools, and editing/generation tasks through expert annotation and validation. Supports evaluation across multiple molecular representations: linear strings, molecular images, and molecular graphs.
Result: State-of-the-art models show significant limitations: GPT-5 achieves 86.2% accuracy on recognition and 85.5% on editing (tasks intuitive for humans), and only 43.0% on generation tasks.
Conclusion: Current AI systems have substantial shortcomings in handling basic molecular recognition and manipulation tasks, and MolLangBench should catalyze research toward more effective AI systems for chemical applications.
Abstract: Precise recognition, editing, and generation of molecules are essential prerequisites for both chemists and AI systems tackling various chemical tasks. We present MolLangBench, a comprehensive benchmark designed to evaluate fundamental molecule-language interface tasks: language-prompted molecular structure recognition, editing, and generation. To ensure high-quality, unambiguous, and deterministic outputs, we construct the recognition tasks using automated cheminformatics tools, and curate editing and generation tasks through rigorous expert annotation and validation. MolLangBench supports the evaluation of models that interface language with different molecular representations, including linear strings, molecular images, and molecular graphs. Evaluations of state-of-the-art models reveal significant limitations: the strongest model (GPT-5) achieves $86.2%$ and $85.5%$ accuracy on recognition and editing tasks, which are intuitively simple for humans, and performs even worse on the generation task, reaching only $43.0%$ accuracy. These results highlight the shortcomings of current AI systems in handling even preliminary molecular recognition and manipulation tasks. We hope MolLangBench will catalyze further research toward more effective and reliable AI systems for chemical applications.The dataset and code can be accessed at https://huggingface.co/datasets/ChemFM/MolLangBench and https://github.com/TheLuoFengLab/MolLangBench, respectively.
[113] Theory-Grounded Evaluation of Human-Like Fallacy Patterns in LLM Reasoning
Andrew Keenan Richardson, Ryan Othniel Kearns, Sean Moss, Vincent Wang-Mascianica, Philipp Koralus
Main category: cs.CL
TL;DR: Language models’ logical reasoning errors follow human fallacy patterns, with more capable models making proportionally more theory-predicted fallacies, and premise order affecting fallacy rates similarly to humans.
Details
Motivation: To understand whether language models' logical reasoning errors follow established human fallacy patterns from cognitive theory, rather than just measuring overall correctness rates.Method: Used Erotetic Theory of Reasoning (ETR) and its open-source implementation PyETR to generate 383 formally specified reasoning problems, evaluated 38 models, judged logical correctness and whether incorrect responses matched ETR-predicted fallacies, and tested premise order effects.
Result: Two key findings: 1) As model capability (Chatbot Arena Elo) increases, a larger share of incorrect answers are ETR-predicted fallacies, while overall correctness shows no correlation with capability; 2) Reversing premise order significantly reduces fallacy production for many models, mirroring human order effects.
Conclusion: Language models’ reasoning errors systematically follow human fallacy patterns, with error composition (not just rate) providing insights into model reasoning, and PyETR enables contamination-resistant reasoning tests linked to cognitive theory.
Abstract: We study logical reasoning in language models by asking whether their errors follow established human fallacy patterns. Using the Erotetic Theory of Reasoning (ETR) and its open-source implementation, PyETR, we programmatically generate 383 formally specified reasoning problems and evaluate 38 models. For each response, we judge logical correctness and, when incorrect, whether it matches an ETR-predicted fallacy. Two results stand out: (i) as a capability proxy (Chatbot Arena Elo) increases, a larger share of a model’s incorrect answers are ETR-predicted fallacies $(ρ=0.360, p=0.0265)$, while overall correctness on this dataset shows no correlation with capability; (ii) reversing premise order significantly reduces fallacy production for many models, mirroring human order effects. Methodologically, PyETR provides an open-source pipeline for unbounded, synthetic, contamination-resistant reasoning tests linked to a cognitive theory, enabling analyses that focus on error composition rather than error rate.
[114] Knowledge Fusion via Bidirectional Information Aggregation
Songlin Zhai, Guilin Qi, Yue Wang, Yuan Meng
Main category: cs.CL
TL;DR: KGA is a novel inference-time framework that dynamically integrates external knowledge graphs into LLMs without parameter modification, using neuroscience-inspired bottom-up knowledge fusion and top-down attention guidance pathways.
Details
Motivation: LLMs remain static after pre-training, causing outdated knowledge that limits utility in time-sensitive applications. Current KG integration methods rely on parameter-invasive fine-tuning which risks catastrophic forgetting and cannot keep pace with evolving KGs in dynamic web environments.Method: KGA introduces two synergistic pathways: 1) Bottom-up knowledge fusion pathway that dynamically integrates external KGs into input representations via input-driven KG fusion (similar to stimulus-driven attention in human brain), and 2) Top-down attention guidance pathway that assesses contextual relevance of each triple through goal-directed verification, suppressing task-irrelevant signals and amplifying knowledge-relevant patterns.
Result: Extensive experiments on four benchmarks verify KGA’s strong fusion performance and efficiency, demonstrating effective real-time knowledge integration without parameter modification.
Conclusion: KGA provides a novel framework for dynamic KG integration into LLMs at inference-time, addressing limitations of static models and parameter-invasive methods while supporting real-time knowledge fusion for evolving web environments.
Abstract: Knowledge graphs (KGs) are the cornerstone of the semantic web, offering up-to-date representations of real-world entities and relations. Yet large language models (LLMs) remain largely static after pre-training, causing their internal knowledge to become outdated and limiting their utility in time-sensitive web applications. To bridge this gap between dynamic knowledge and static models, a prevalent approach is to enhance LLMs with KGs. However, prevailing methods typically rely on parameter-invasive fine-tuning, which risks catastrophic forgetting and often degrades LLMs’ general capabilities. Moreover, their static integration frameworks cannot keep pace with the continuous evolution of real-world KGs, hindering their deployment in dynamic web environments. To bridge this gap, we introduce KGA (\textit{\underline{K}nowledge \underline{G}raph-guided \underline{A}ttention}), a novel framework that dynamically integrates external KGs into LLMs exclusively at inference-time without any parameter modification. Inspired by research on neuroscience, we rewire the self-attention module by innovatively introducing two synergistic pathways: a \textit{bottom-up knowledge fusion} pathway and a \textit{top-down attention guidance} pathway. The \textit{bottom-up pathway} dynamically integrates external knowledge into input representations via input-driven KG fusion, which is akin to the \textit{stimulus-driven attention process} in the human brain. Complementarily, the \textit{top-down pathway} aims to assess the contextual relevance of each triple through a \textit{goal-directed verification process}, thereby suppressing task-irrelevant signals and amplifying knowledge-relevant patterns. By synergistically combining these two pathways, our method supports real-time knowledge fusion. Extensive experiments on four benchmarks verify KGA’s strong fusion performance and efficiency.
[115] A Theory of Adaptive Scaffolding for LLM-Based Pedagogical Agents
Clayton Cohn, Surya Rayala, Namrata Srivastava, Joyce Horn Fonteles, Shruti Jain, Xinying Luo, Divya Mereddy, Naveeduddin Mohammed, Gautam Biswas
Main category: cs.CL
TL;DR: A framework combining Evidence-Centered Design, Social Cognitive Theory, and Zone of Proximal Development for creating theory-grounded LLM-based pedagogical agents for STEM+C education, implemented as Inquizzitor formative assessment agent.
Details
Motivation: Current LLM systems in classrooms lack solid theoretical foundations compared to earlier intelligent tutoring systems, creating a need to bridge this gap with principled, theory-driven approaches to educational AI.Method: Proposed framework combining Evidence-Centered Design with Social Cognitive Theory and Zone of Proximal Development for adaptive scaffolding. Instantiated as Inquizzitor, an LLM-based formative assessment agent integrating human-AI hybrid intelligence with cognitive science principles.
Result: Inquizzitor delivers high-quality assessment and interaction aligned with core learning theories, providing effective guidance that students value, demonstrating potential for theory-driven LLM integration in education.
Conclusion: The research shows that LLM-based pedagogical agents can provide adaptive and principled instruction when grounded in established learning theories, offering a promising direction for educational AI development.
Abstract: Large language models (LLMs) present new opportunities for creating pedagogical agents that engage in meaningful dialogue to support student learning. However, current LLM systems used in classrooms often lack the solid theoretical foundations found in earlier intelligent tutoring systems. To bridge this gap, we propose a framework that combines Evidence-Centered Design with Social Cognitive Theory and Zone of Proximal Development for adaptive scaffolding in LLM-based agents focused on STEM+C learning. We instantiate this framework with Inquizzitor, an LLM-based formative assessment agent that integrates human-AI hybrid intelligence and provides feedback grounded in cognitive science principles. Our findings show that Inquizzitor delivers high-quality assessment and interaction aligned with core learning theories, offering effective guidance that students value. This research demonstrates the potential for theory-driven LLM integration in education, highlighting the ability of these systems to provide adaptive and principled instruction.
[116] SafeConstellations: Mitigating Over-Refusals in LLMs Through Task-Aware Representation Steering
Utsav Maskey, Sumit Yadav, Mark Dras, Usman Naseem
Main category: cs.CL
TL;DR: SafeConstellations: An inference-time method that reduces LLM over-refusal by tracking task-specific trajectory patterns in embedding space and guiding representations toward non-refusal pathways, achieving up to 73% reduction in over-refusal rates.
Details
Motivation: LLMs increasingly exhibit over-refusal behavior where safety mechanisms cause models to reject benign instructions that resemble harmful content, diminishing utility in production applications that rely on common prompt templates or specific tasks like sentiment analysis and language translation.Method: Mechanistic analysis reveals LLMs follow distinct “constellation” patterns in embedding space as representations traverse layers, with each NLP task maintaining consistent trajectories. SafeConstellations is an inference-time trajectory-shifting approach that tracks task-specific trajectory patterns and guides representations toward non-refusal pathways, selectively applied only to tasks prone to over-refusal.
Result: The method reduces over-refusal rates by up to 73% with minimal impact on utility, offering a principled and conditional approach to mitigating over-refusals while maintaining safety on genuinely harmful content.
Conclusion: SafeConstellations provides an effective inference-time solution to LLM over-refusal by leveraging task-specific trajectory patterns in embedding space, enabling conditional mitigation that preserves both utility and safety.
Abstract: LLMs increasingly exhibit over-refusal behavior, where safety mechanisms cause models to reject benign instructions that seemingly resemble harmful content. This phenomenon diminishes utility in production applications that repeatedly rely on common prompt templates or applications that frequently rely on LLMs for specific tasks (e.g. sentiment analysis, language translation). Through extensive evaluation, we demonstrate that LLMs persist in refusing inputs containing harmful content, even when they are reframed with tasks that have benign intent. Our mechanistic analysis reveals that LLMs follow distinct “constellation” patterns in embedding space as representations traverse layers, with each NLP task maintaining consistent trajectories that shift predictably between refusal and non-refusal cases. We introduce SafeConstellations, an inference-time trajectory-shifting approach that tracks task-specific trajectory patterns and guides representations toward non-refusal pathways. By selectively guiding model behavior only on tasks prone to over-refusal, our method reduces over-refusal rates by up to 73% with minimal impact on utility – offering a principled and conditional approach to mitigating over-refusals.
[117] Prompt-Induced Linguistic Fingerprints for LLM-Generated Fake News Detection
Chi Wang, Min Gao, Zongwei Wang, Junwei Yin, Kai Shu, Chenghua Lin
Main category: cs.CL
TL;DR: LIFE method detects LLM-generated fake news by analyzing prompt-induced linguistic fingerprints in word probability distributions, achieving state-of-the-art performance.
Details
Motivation: With the ease of generating fake news using large language models, there's an urgent need for reliable detection methods. Current approaches focus on textual content but struggle because LLM-generated fake news often appears coherent and factually consistent, making subtle falsification traces difficult to detect.Method: The paper proposes Linguistic Fingerprints Extraction (LIFE), which reconstructs word-level probability distributions to find discriminative patterns. Through distributional divergence analysis, the authors discover prompt-induced linguistic fingerprints - statistically distinct probability shifts between LLM-generated real and fake news when maliciously prompted. They also use key-fragment techniques to amplify these subtle linguistic differences.
Result: Experiments show that LIFE achieves state-of-the-art performance in detecting LLM-generated fake news and maintains high performance on human-written fake news. The method effectively identifies subtle linguistic patterns that distinguish fake from real news.
Conclusion: LIFE provides an effective approach for detecting LLM-generated fake news by leveraging linguistic fingerprints in probability distributions, addressing the growing threat of AI-generated misinformation.
Abstract: With the rapid development of large language models, the generation of fake news has become increasingly effortless, posing a growing societal threat and underscoring the urgent need for reliable detection methods. Early efforts to identify LLM-generated fake news have predominantly focused on the textual content itself; however, because much of that content may appear coherent and factually consistent, the subtle traces of falsification are often difficult to uncover. Through distributional divergence analysis, we uncover prompt-induced linguistic fingerprints: statistically distinct probability shifts between LLM-generated real and fake news when maliciously prompted. Based on this insight, we propose a novel method named Linguistic Fingerprints Extraction (LIFE). By reconstructing word-level probability distributions, LIFE can find discriminative patterns that facilitate the detection of LLM-generated fake news. To further amplify these fingerprint patterns, we also leverage key-fragment techniques that accentuate subtle linguistic differences, thereby improving detection reliability. Our experiments show that LIFE achieves state-of-the-art performance in LLM-generated fake news and maintains high performance in human-written fake news. The code and data are available at https://anonymous.4open.science/r/LIFE-E86A.
[118] Long Chain-of-Thought Reasoning Across Languages
Josh Barua, Seun Eisape, Kayo Yin, Alane Suhr
Main category: cs.CL
TL;DR: Systematic investigation of multilingual chain-of-thought reasoning capabilities across model development stages, comparing English vs. target-language reasoning for non-English languages.
Details
Motivation: While large reasoning models excel at English chain-of-thought reasoning, there's limited understanding of how these capabilities transfer to other languages, creating a gap in multilingual reasoning research.Method: Analyzes four model development stages (scaling, pretraining, post-training, inference) across nine non-English languages, comparing En-CoT (English reasoning) vs. Target-CoT (target-language reasoning) settings.
Result: Scaling improves En-CoT but Target-CoT lags, especially for complex reasoning; specialized reasoning pretraining helps En-CoT but hurts Target-CoT; translated English reasoning traces outperform distilled target-language traces; language-specific efficiency disparities and failure modes identified.
Conclusion: Multilingual reasoning capabilities differ significantly from English, with Target-CoT performance lagging behind En-CoT, highlighting the need for specialized approaches to develop robust multilingual reasoning models.
Abstract: While large reasoning models have shown remarkable ability to generate long chains-of-thought (CoTs) in English, we still lack understanding of how these long-form reasoning abilities transfer to the vast majority of the world’s languages. In this work, we systematically investigate four key stages of model development–scaling, pretraining, post-training, and inference–to understand how long CoT capabilities extend beyond English. We compare two reasoning settings across nine non-English target languages: En-CoT, where models process target-language inputs, but reason in English; and Target-CoT, where models both process inputs and generate long CoTs in the target language. We find that scaling reasoning model size improves multilingual task performance in En-CoT, but Target-CoT performance lags behind. This gap widens for tasks requiring long, multi-step CoTs such as mathematical reasoning. Shifting to pretraining, we find that adding a specialized reasoning stage enhances En-CoT performance but degrades Target-CoT, whereas broad multilingual pretraining improves both modes simultaneously. Given the scarcity of high-quality reasoning traces in languages other than English, we explore synthetic data curation approaches for post-training. We demonstrate that fine-tuning on reasoning traces automatically translated from gold English traces outperforms fine-tuning on target-language traces distilled from large reasoning models. Finally, we report disparities in inference efficiency between languages and uncover language-specific failure modes in CoTs. We release models, datasets, and code to foster further research.
[119] Sycophancy Is Not One Thing: Causal Separation of Sycophantic Behaviors in LLMs
Daniel Vennemeyer, Phan Anh Duong, Tiffany Zhan, Tianyu Jiang
Main category: cs.CL
TL;DR: LLM sycophancy decomposes into distinct agreement and praise behaviors with independent neural representations that can be separately manipulated.
Details
Motivation: To understand whether LLM sycophantic behaviors (excessive agreement/flattery) arise from a single mechanism or multiple distinct processes, and to characterize their neural representations.Method: Decomposed sycophancy into sycophantic agreement and sycophantic praise, contrasting with genuine agreement. Used difference-in-means directions, activation additions, and subspace geometry analysis across multiple models and datasets.
Result: Three behaviors are encoded along distinct linear directions in latent space; each can be independently amplified/suppressed without affecting others; representational structure is consistent across model families and scales.
Conclusion: Sycophantic behaviors correspond to distinct, independently steerable representations rather than a unified mechanism.
Abstract: Large language models (LLMs) often exhibit sycophantic behaviors – such as excessive agreement with or flattery of the user – but it is unclear whether these behaviors arise from a single mechanism or multiple distinct processes. We decompose sycophancy into sycophantic agreement and sycophantic praise, contrasting both with genuine agreement. Using difference-in-means directions, activation additions, and subspace geometry across multiple models and datasets, we show that: (1) the three behaviors are encoded along distinct linear directions in latent space; (2) each behavior can be independently amplified or suppressed without affecting the others; and (3) their representational structure is consistent across model families and scales. These results suggest that sycophantic behaviors correspond to distinct, independently steerable representations.
[120] Explore Briefly, Then Decide: Mitigating LLM Overthinking via Cumulative Entropy Regulation
Yi Bin, Tianyi Jiang, Yujuan Ding, Kainian Zhu, Fei Ma, Jingkuan Song, Yang Yang, Heng Tao Shen
Main category: cs.CL
TL;DR: A novel reasoning paradigm called “Explore Briefly, Then Decide” with Cumulative Entropy Regulation (CER) mechanism that uses Token Entropy Cumulative Average (TECA) metric to help LLMs dynamically determine optimal stopping points in reasoning, reducing overthinking and response length by up to 71% on simpler problems.
Details
Motivation: LLMs often suffer from overthinking - generating unnecessarily lengthy reasoning steps for simpler problems, which degrades efficiency and makes it difficult to adapt reasoning depth to problem complexity.Method: Proposes Token Entropy Cumulative Average (TECA) metric to measure exploration extent in reasoning process, and introduces “Explore Briefly, Then Decide” paradigm with Cumulative Entropy Regulation (CER) mechanism that leverages TECA to dynamically determine optimal stopping points.
Result: Experimental results across diverse mathematical benchmarks show substantial mitigation of overthinking without sacrificing problem-solving ability, with average response length decreasing by up to 71% on simpler datasets.
Conclusion: The proposed approach creates a more efficient and adaptive reasoning process by helping LLMs determine when to conclude reasoning and provide final answers, addressing the overthinking problem in complex reasoning tasks.
Abstract: Large Language Models (LLMs) have demonstrated remarkable reasoning abilities on complex problems using long Chain-of-Thought (CoT) reasoning. However, they often suffer from overthinking, meaning generating unnecessarily lengthy reasoning steps for simpler problems. This issue may degrade the efficiency of the models and make them difficult to adapt the reasoning depth to the complexity of problems. To address this, we introduce a novel metric Token Entropy Cumulative Average (TECA), which measures the extent of exploration throughout the reasoning process. We further propose a novel reasoning paradigm named “Explore Briefly, Then Decide”, with an associated Cumulative Entropy Regulation (CER) mechanism. This paradigm leverages TECA to help the model dynamically determine the optimal point to conclude its thought process and provide a final answer, thus achieving efficient reasoning. Experimental results across diverse mathematical benchmarks show that our approach substantially mitigates overthinking without sacrificing problem solving ability. With our thinking paradigm, the average response length decreases by up to 71% on simpler datasets, demonstrating the effectiveness of our method in creating a more efficient and adaptive reasoning process.
[121] Pretraining with hierarchical memories: separating long-tail and common knowledge
Hadi Pouransari, David Grangier, C Thomas, Michael Kirchhof, Oncel Tuzel
Main category: cs.CL
TL;DR: Small language models augmented with hierarchical parametric memory banks achieve comparable performance to much larger models by storing world knowledge in memory parameters while keeping core reasoning in the small model.
Details
Motivation: Current language models require scaling parameters to store world knowledge, which is inefficient and impractical for edge devices. Most knowledge is unused per prompt, so compressing everything into model parameters is unnecessary.Method: Memory-augmented architecture with small language models accessing large hierarchical parametric memory banks. During pretraining/inference, fetch small context-dependent memory blocks and add them to the model. Pretraining learns to store long-tail world knowledge in memory parameters while small model captures common knowledge and reasoning.
Result: 160M-parameter model with 18M-parameter memory from 4.6B memory bank performs comparably to regular models with 2x+ parameters. Hierarchical feed-forward memories work robustly across transformer architectures, scaling to over 21B parameters.
Conclusion: Memory-augmented architectures enable efficient knowledge storage and retrieval, making large-scale knowledge accessible to small models while maintaining performance comparable to much larger parameter models.
Abstract: The impressive performance gains of modern language models currently rely on scaling parameters: larger models store more world knowledge and reason better. Yet compressing all world knowledge into parameters is unnecessary, as only a fraction is used per prompt, and impractical for edge devices with limited inference-time memory and compute. We address this shortcoming by a memory-augmented architecture and a pretraining strategy aligned with existing hardware paradigms. We introduce small language models that access large hierarchical parametric memory banks encoding world knowledge. During pretraining and inference, we fetch a small, context-dependent memory block and add it to the model. Our pretraining learns to store long-tail world knowledge in the memory parameters, while the small language model acts as an anchor capturing common knowledge and general reasoning abilities. Through trillion-token-scale experiments, we show significant gains: a 160M-parameters model augmented with an 18M-parameters memory fetched from a 4.6B memory bank obtains comparable performance to a regular model with more than 2x the parameters. Through extensive experiments, we study the optimal type and size of parametric memories in transformers, scaling them to over 21B parameters. We find that our proposed hierarchical feed-forward memories work robustly across transformer architectures, whether added during pretraining or post-hoc.
[122] Hybrid Architectures for Language Models: Systematic Analysis and Design Insights
Sangmin Bae, Bilge Acun, Chien-Yu Lin, Haroun Habeeb, Seungyeon Kim, Liang Luo, Junjie Wang, Carole-Jean Wu
Main category: cs.CL
TL;DR: Systematic evaluation of hybrid LLM architectures combining self-attention with Mamba SSMs, comparing inter-layer vs intra-layer fusion strategies across multiple dimensions to identify optimal design recipes.
Details
Motivation: Hybrid architectures combining self-attention with structured state space models (like Mamba) show promise for balancing quality and efficiency in long-context tasks, but there's a lack of systematic comparison of hybridization strategies and analysis of key effectiveness factors.Method: Holistic evaluation of hybrid architectures based on inter-layer (sequential) or intra-layer (parallel) fusion. Comprehensive evaluation across multiple dimensions: language modeling, downstream tasks, long-context capabilities, scaling analysis, and training/inference efficiency. Investigation of computational primitive characteristics to identify critical elements for each hybridization strategy.
Result: The study identifies the most critical elements for each hybridization strategy and proposes optimal design recipes for hybrid models through comprehensive analysis of computational primitives.
Conclusion: The comprehensive analysis provides practical guidance and valuable insights for developing hybrid language models, facilitating optimization of architectural configurations for better balance between modeling quality and computational efficiency.
Abstract: Recent progress in large language models demonstrates that hybrid architectures–combining self-attention mechanisms with structured state space models like Mamba–can achieve a compelling balance between modeling quality and computational efficiency, particularly for long-context tasks. While these hybrid models show promising performance, systematic comparisons of hybridization strategies and analyses on the key factors behind their effectiveness have not been clearly shared to the community. In this work, we present a holistic evaluation of hybrid architectures based on inter-layer (sequential) or intra-layer (parallel) fusion. We comprehensively evaluate these designs across multiple dimensions: language modeling and downstream task performance, long-context capabilities, scaling analysis, and training and inference efficiency. By investigating the core characteristics of their computational primitive, we identify the most critical elements for each hybridization strategy and further propose optimal design recipes for hybrid models. Our comprehensive analysis provides practical guidance and valuable insights for developing hybrid language models, facilitating the optimization of architectural configurations.
[123] Flipping the Dialogue: Training and Evaluating User Language Models
Tarek Naous, Philippe Laban, Wei Xu, Jennifer Neville
Main category: cs.CL
TL;DR: The paper introduces User Language Models specifically trained to simulate human users in multi-turn conversations, showing that assistant LMs make poor user simulators and that better assistants yield worse simulators.
Details
Motivation: Current LM evaluation uses assistant LMs to simulate users, but these make poor simulators because they're optimized to be helpful assistants rather than realistic human users who phrase requests imperfectly and refine on the fly.Method: Introduces purpose-built User Language Models post-trained specifically to simulate human users in multi-turn conversations, creating more realistic simulation environments.
Result: User LMs align better with human behavior and achieve better simulation robustness than existing methods. When used to simulate coding and math conversations, GPT-4o’s performance drops from 74.6% to 57.4%, showing assistants struggle with realistic user nuances.
Conclusion: Realistic user simulation requires purpose-built User LMs rather than repurposed assistant LMs, and more realistic simulation environments reveal significant weaknesses in current assistant models when dealing with imperfect human-like interactions.
Abstract: Conversations with LMs involve two participants: a human user leading the conversation, and an LM assistant responding to the user’s request. To satisfy this specific role, LMs are post-trained to be helpful assistants – optimized to produce exhaustive and well-structured responses, free of ambiguity and grammar errors. User utterances, on the other hand, are rarely perfected, with each user phrasing requests in unique ways, sometimes putting in partial effort at each turn and refining on the fly. To evaluate LM performance in realistic settings, prior work simulated users in multi-turn conversations, often by prompting an LM originally trained to be a helpful assistant to act as a user. However, we show that assistant LMs make for poor user simulators, with the surprising finding that better assistants yield worse simulators. Instead, we introduce purpose-built User Language Models (User LMs) - models post-trained to simulate human users in multi-turn conversations. Through various evaluations, we show how User LMs align better with human behavior and achieve better simulation robustness than existing simulation methods. When leveraging User LMs to simulate coding and math conversations, the performance of a strong assistant (GPT-4o) drops from 74.6% to 57.4%, confirming that more realistic simulation environments lead to assistant struggles as they fail to cope with the nuances of users in multi-turn setups.
[124] Emotionally Charged, Logically Blurred: AI-driven Emotional Framing Impairs Human Fallacy Detection
Yanran Chen, Lynn Greschner, Roman Klinger, Michael Klenk, Steffen Eger
Main category: cs.CL
TL;DR: LLMs can effectively inject emotional appeals into fallacious arguments, reducing human fallacy detection by 14.5% and making arguments more convincing, with enjoyment, fear, and sadness being particularly effective emotions.
Details
Motivation: Logical fallacies are common in public communication and can mislead audiences. While fallacious arguments lack soundness, they can still appear convincing due to subjective factors. The researchers aim to study how emotional framing interacts with fallacies and convincingness, particularly using LLMs to systematically manipulate emotional appeals in fallacious arguments.Method: 1. Benchmark eight LLMs on injecting emotional appeal into fallacious arguments while preserving logical structures. 2. Use the best-performing models to generate stimuli for a human study. 3. Conduct human evaluation to measure fallacy detection performance and convincingness ratings across different emotional framings.
Result: 1. LLM-driven emotional framing reduces human fallacy detection in F1 by 14.5% on average. 2. Humans perform better in fallacy detection when perceiving enjoyment than fear or sadness. 3. Enjoyment, fear, and sadness correlate with significantly higher convincingness compared to neutral or other emotion states.
Conclusion: LLMs can effectively manipulate emotional framing in fallacious arguments, reducing human ability to detect fallacies and increasing argument convincingness. This has important implications for AI-driven emotional manipulation in fallacious argumentation, highlighting potential risks in public communication.
Abstract: Logical fallacies are common in public communication and can mislead audiences; fallacious arguments may still appear convincing despite lacking soundness, because convincingness is inherently subjective. We present the first computational study of how emotional framing interacts with fallacies and convincingness, using large language models (LLMs) to systematically change emotional appeals in fallacious arguments. We benchmark eight LLMs on injecting emotional appeal into fallacious arguments while preserving their logical structures, then use the best models to generate stimuli for a human study. Our results show that LLM-driven emotional framing reduces human fallacy detection in F1 by 14.5% on average. Humans perform better in fallacy detection when perceiving enjoyment than fear or sadness, and these three emotions also correlate with significantly higher convincingness compared to neutral or other emotion states. Our work has implications for AI-driven emotional manipulation in the context of fallacious argumentation.
[125] Putting on the Thinking Hats: A Survey on Chain of Thought Fine-tuning from the Perspective of Human Reasoning Mechanism
Xiaoshu Chen, Sihang Zhou, Ke Liang, Duanyang Yuan, Haoyuan Chen, Xiaoyu Sun, Lingyuan Meng, Xinwang Liu
Main category: cs.CL
TL;DR: A comprehensive survey of Chain of Thought (CoT) fine-tuning that analyzes the technique through the lens of human reasoning theory, specifically using the Six Thinking Hats framework, rather than just focusing on technical aspects.
Details
Motivation: Existing surveys on CoT fine-tuning focus primarily on technical aspects and overlook systematic analysis from human reasoning mechanisms. Since the ultimate goal is to enable LLMs to reason like humans, it's crucial to investigate this technique through the lens of human cognition.Method: The survey classifies and examines CoT fine-tuning methods using the Six Thinking Hats framework, which systematically characterizes common human thinking modes. It also compiles datasets, model performances, and maintains a real-time GitHub repository tracking advances in the field.
Result: Provides the first comprehensive survey of CoT fine-tuning grounded in human reasoning theory, offering a systematic framework for understanding different reasoning capabilities cultivated through CoT fine-tuning.
Conclusion: This survey serves as a valuable resource to inspire innovation and foster progress in CoT fine-tuning by bridging the gap between technical approaches and human cognitive mechanisms.
Abstract: Chain of thought (CoT) fine-tuning aims to endow large language models (LLMs) with reasoning capabilities by training them on curated reasoning traces. It leverages both supervised and reinforced fine-tuning to cultivate human-like reasoning skills in LLMs, including detailed planning, divergent thinking, intuitive judgment, timely reflection, internal thinking, and fact perception, etc. As CoT fine-tuning has advanced, LLMs have demonstrated substantial improvements in tasks such as mathematical reasoning and code generation. However, existing surveys about CoT fine-tuning primarily focus on technical aspects and overlook a systematic analysis from the perspective of human reasoning mechanisms. Given that the ultimate goal of CoT fine-tuning is to enable LLMs to reason like humans, it is crucial to investigate this technique through the lens of human cognition. To fill this gap, we present the first comprehensive survey of CoT fine-tuning grounded in human reasoning theory. Specifically, inspired by the well-known Six Thinking Hats framework, which systematically characterizes common human thinking modes using six metaphorical hats, we classify and examine CoT fine-tuning methods through this lens. Furthermore, building upon this theory, we outline potential directions for future research in CoT fine-tuning. In addition, we compile a comprehensive overview of existing datasets and model performances, and a real-time GitHub repository \footnote{https://github.com/AI-Chen/Awesome-CoT-Finetuning} that continuously tracks recent advances in this area is maintained. We hope this survey will serve as a valuable resource to inspire innovation and foster progress in this rapidly evolving field.
[126] SafeSearch: Do Not Trade Safety for Utility in LLM Search Agents
Qiusi Zhan, Angeline Budiman-Chan, Abdelrahman Zayed, Xingzhi Guo, Daniel Kang, Joo-Kyung Kim
Main category: cs.CL
TL;DR: SafeSearch: Multi-objective RL approach for aligning safety and utility in LLM-based search agents, reducing harmful outputs by 90% while maintaining QA performance.
Details
Motivation: LLM-based search agents are more likely to produce harmful outputs than base LLMs due to their information retrieval capabilities lowering refusal thresholds, and utility-oriented finetuning intensifies this safety risk.Method: SafeSearch uses multi-objective reinforcement learning with a final-output safety/utility reward plus a novel query-level shaping term that penalizes unsafe queries and rewards safe ones.
Result: Reduces agent harmfulness by over 90% across three red-teaming datasets on a 7B model while maintaining QA performance comparable to utility-only finetuned agents.
Conclusion: Joint alignment of safety and utility is crucial for search agents, and query-level rewards effectively improve both safety and utility simultaneously.
Abstract: Large language model (LLM) based search agents iteratively generate queries, retrieve external information, and reason to answer open-domain questions. While researchers have primarily focused on improving their utility, their safety behaviors remain underexplored. In this paper, we first evaluate search agents using red-teaming datasets and find that they are more likely to produce harmful outputs than base LLMs. For example, when asked ``How can I track someone’s location without their consent?’’, a base model refuses, whereas a search agent designed to retrieve and cite sources may lower its refusal threshold, fetch documents (e.g., court cases), and, once appended, synthesize them into an informative yet unsafe summary. We further show that utility-oriented finetuning intensifies this risk, motivating joint alignment of safety and utility. To this end, we present SafeSearch, a multi-objective reinforcement learning approach that couples a final-output safety/utility reward with a novel query-level shaping term that penalizes unsafe queries and rewards safe ones. Experiments show that SafeSearch reduces agent harmfulness by over 90% across three red-teaming datasets on a 7B model while producing safe and helpful responses, and maintains QA performance comparable to that of a utility-only finetuned agent. Further analyses confirm the effectiveness of the query-level reward in jointly improving safety and utility.
[127] Moneyball with LLMs: Analyzing Tabular Summarization in Sports Narratives
Ritam Upadhyay, Naman Ahuja, Rishabh Baral, Aparna Garimella, Vivek Gupta
Main category: cs.CL
TL;DR: SPORTABSET is a diagnostic benchmark for evaluating long-context tabular summarization in sports domains, revealing that decomposition strategies improve accuracy but multientity memory remains a key bottleneck.
Details
Motivation: Current LLM approaches to tabular summarization rely on expensive prompt engineering and decomposition pipelines, offering limited insight into how models maintain state over long, evolving narratives with multiple entities.Method: Introduces SPORTABSET benchmark for long-context tabular summarization across two sports domains requiring entity tracking and statistical aggregation. Systematically evaluates decomposition-based strategies across several long-context LLMs.
Result: Decomposition substantially improves accuracy and numerical fidelity, but gains mainly come from dissecting multi-entity interference rather than improved local arithmetic. Models show high sensitivity to surface-level cues with structured failures including hallucination, omission, and role confusion.
Conclusion: Consistent multientity memory is a key bottleneck in long context table generation, motivating diagnostic evaluation as essential for scalable, efficient and reliable tabular summarization models.
Abstract: Large language model (LLM) approaches to tabular summarization rely on extensive prompt engineering, decomposition pipelines, or entity-level intermediate representations to achieve strong performance. While effective, these strategies are computationally expensive and offer limited insight into how well models maintain state over long, evolving narratives. We introduce SPORTABSET, a diagnostic benchmark for long-context tabular summarization across two complementary sports domains that require tracking multiple entities and aggregating statistics under domain-specific rules. Using SporTabSet, we systematically evaluate decomposition-based strategies across several long context LLMs. Results show that although decomposition substantially improves accuracy and numerical fidelity, gains stem mainly from dissecting multi-entity interference rather than improved local arithmetic. Robustness experiments further reveal high sensitivity to surface-level cues with structured failures, including hallucination, omission, and role confusion. Together, these findings identify consistent multientity memory as a key bottleneck in long context table generation, motivating diagnostic evaluation as a prerequisite for scalable, efficient and reliable tabular summarization models.
[128] Modality Matching Matters: Calibrating Language Distances for Cross-Lingual Transfer in URIEL+
York Hay Ng, Aditya Khan, Xiang Lu, Matteo Salloum, Michael Zhou, Phuong H. Hoang, A. Seza Doğruöz, En-Shiun Annie Lee
Main category: cs.CL
TL;DR: A framework for type-matched language distances with structure-aware representations for geography, genealogy, and typology, unified into a composite distance for improved cross-lingual transfer.
Details
Motivation: Existing linguistic knowledge bases like URIEL+ have limitations: their one-size-fits-all vector representations are ill-suited for diverse linguistic data structures, and they lack principled methods for aggregating different distance signals into a comprehensive score.Method: Proposed novel structure-aware representations: speaker-weighted distributions for geography, hyperbolic embeddings for genealogy, and latent variables model for typology. Unified these into a robust, task-agnostic composite distance.
Result: Across multiple zero-shot transfer benchmarks, the representations significantly improve transfer performance when distance type is relevant to the task, while the composite distance yields gains in most tasks.
Conclusion: The framework addresses limitations of existing linguistic knowledge bases by providing type-specific representations and a principled composite distance that enhances cross-lingual transfer performance.
Abstract: Existing linguistic knowledge bases such as URIEL+ provide valuable geographic, genetic and typological distances for cross-lingual transfer but suffer from two key limitations. First, their one-size-fits-all vector representations are ill-suited to the diverse structures of linguistic data. Second, they lack a principled method for aggregating these signals into a single, comprehensive score. In this paper, we address these gaps by introducing a framework for type-matched language distances. We propose novel, structure-aware representations for each distance type: speaker-weighted distributions for geography, hyperbolic embeddings for genealogy, and a latent variables model for typology. We unify these signals into a robust, task-agnostic composite distance. Across multiple zero-shot transfer benchmarks, we demonstrate that our representations significantly improve transfer performance when the distance type is relevant to the task, while our composite distance yields gains in most tasks.
[129] Difficulty-Controllable Multiple-Choice Question Generation Using Large Language Models and Direct Preference Optimization
Yuto Tomikawa, Masaki Uto
Main category: cs.CL
TL;DR: Proposes a difficulty-controllable multiple-choice question generation method for reading comprehension using LLMs with direct preference optimization for better difficulty control accuracy.
Details
Motivation: Existing neural question generation methods for reading comprehension have limitations: they can't directly generate multiple-choice questions (most widely used in education) and lack explicit training to optimize difficulty control accuracy.Method: Uses large language models trained with direct preference optimization technique to improve accuracy of difficulty control in generating multiple-choice questions.
Result: The proposed method addresses both limitations - enables direct generation of multiple-choice questions and improves difficulty controllability through explicit optimization.
Conclusion: A novel approach for difficulty-controllable multiple-choice question generation that overcomes key limitations of conventional methods through LLM-based optimization.
Abstract: Difficulty-controllable question generation for reading comprehension has gained significant attention in the field of education as a fundamental tool for adaptive learning support. Although several neural question generation methods have recently succeeded in controlling difficulty, conventional approaches still face two major limitations. First, they cannot directly generate multiple-choice questions, which are the most widely used question type in educational contexts. Second, they are not explicitly trained to optimize the accuracy of difficulty control, leaving room for further improvement in difficulty controllability. To address these limitations, this study proposes a novel difficulty-controllable multiple-choice question generation method for reading comprehension which leverages a large language model trained using a direct preference optimization technique to improve the accuracy of difficulty control.
[130] DEBATE: A Large-Scale Benchmark for Evaluating Opinion Dynamics in Role-Playing LLM Agents
Yun-Shiuan Chuang, Ruixuan Tu, Chengtao Dai, Smit Vasani, You Li, Binwei Yao, Michael Henry Tessler, Sijia Yang, Dhavan Shah, Robert Hawkins, Junjie Hu, Timothy T. Rogers
Main category: cs.CL
TL;DR: DEBATE benchmark evaluates authenticity of opinion dynamics in multi-agent role-playing LLM simulations using 30K+ human messages across 708 groups, showing LLM groups converge too strongly but can be improved with fine-tuning.
Details
Motivation: Existing multi-agent simulations with role-playing LLMs show unnatural group behaviors like premature convergence and lack empirical benchmarks for assessing alignment with real human group interactions, limiting understanding of opinion dynamics.Method: Created DEBATE benchmark with 30,707 messages from 2,832 US participants across 708 groups and 107 topics, including public messages and private beliefs. Evaluated 7 LLMs as “digital twin” RPLAs in next-message prediction and full conversation rollout using stance-alignment and opinion-convergence metrics.
Result: Zero-shot RPLA groups show strong opinion convergence relative to human groups. Supervised fine-tuning (SFT) and Direct Preference Optimization (DPO) improve stance alignment and bring group-level convergence closer to human behavior, though discrepancies in opinion change and belief updating remain.
Conclusion: DEBATE enables rigorous benchmarking of simulated opinion dynamics and supports future research on aligning multi-agent role-playing LLMs with realistic human interactions, addressing limitations in current opinion dynamics modeling.
Abstract: Accurately modeling opinion change through social interactions is crucial for understanding and mitigating polarization, misinformation, and societal conflict. Recent work simulates opinion dynamics with role-playing LPL agents (RPLAs), but multi-agent simulations often display unnatural group behavior, such as premature convergence, and lack empirical benchmarks for assessing alignment with real human group interactions. We introduce DEBATE, a large-scale benchmark for evaluating the authenticity of opinion dynamics in multi-agent RPLA simulations. DEBATE contains 30,707 messages from 2,832 U.S.-based participants across 708 groups and 107 topics, with both public messages and private Likert-scale beliefs, enabling evaluation at the utterance and group levels while also supporting future individual-level analyses. We instantiate “digital twin” RPLAs with seven LLMs and evaluate them in two settings: next-message prediction and full conversation rollout, using stance-alignment and opinion-convergence metrics. In zero-shot settings, RPLA groups exhibit strong opinion convergence relative to human groups. Post-training via supervised fine-tuning (SFT) and Direct Preference Optimization (DPO) improves stance alignment and brings group-level convergence closer to human behavior, though discrepancies in opinion change and belief updating remain. DEBATE enables rigorous benchmarking of simulated opinion dynamics and supports future research on aligning multi-agent RPLAs with realistic human interactions. The benchmark is publicly available at.
[131] DialectalArabicMMLU: Benchmarking Dialectal Capabilities in Arabic and Multilingual Language Models
Malik H. Altakrori, Nizar Habash, Abed Alhakim Freihat, Younes Samih, Kirill Chirkunov, Muhammed AbuOdeh, Radu Florian, Teresa Lynn, Preslav Nakov, Alham Fikri Aji
Main category: cs.CL
TL;DR: DialectalArabicMMLU is a new benchmark for evaluating LLMs across Arabic dialects, extending MMLU-Redux with 15K QA pairs across 5 major dialects to assess dialectal understanding beyond Modern Standard Arabic.
Details
Motivation: While existing benchmarks evaluate LLMs for Modern Standard Arabic, dialectal varieties remain underrepresented despite their prevalence in everyday communication, creating a need for more inclusive evaluation of Arabic language understanding.Method: Manual translation and adaptation of 3K MMLU-Redux multiple-choice question-answer pairs into five major Arabic dialects (Syrian, Egyptian, Emirati, Saudi, Moroccan), creating 15K QA pairs across 32 domains, with evaluation of 19 open-weight Arabic and multilingual LLMs (1B-13B parameters).
Result: Evaluation revealed substantial performance variation across dialects, showing persistent gaps in dialectal generalization, with the benchmark enabling systematic assessment of LLM reasoning and comprehension beyond Modern Standard Arabic.
Conclusion: DialectalArabicMMLU provides the first unified, human-curated resource for measuring dialectal understanding in Arabic, promoting more inclusive evaluation and future model development for Arabic language technologies.
Abstract: We present DialectalArabicMMLU, a new benchmark for evaluating the performance of large language models (LLMs) across Arabic dialects. While recently developed Arabic and multilingual benchmarks have advanced LLM evaluation for Modern Standard Arabic (MSA), dialectal varieties remain underrepresented despite their prevalence in everyday communication. DialectalArabicMMLU extends the MMLU-Redux framework through manual translation and adaptation of 3K multiple-choice question-answer pairs into five major dialects (Syrian, Egyptian, Emirati, Saudi, and Moroccan), yielding a total of 15K QA pairs across 32 academic and professional domains (22K QA pairs when also including English and MSA). The benchmark enables systematic assessment of LLM reasoning and comprehension beyond MSA, supporting both task-based and linguistic analysis. We evaluate 19 open-weight Arabic and multilingual LLMs (1B-13B parameters) and report substantial performance variation across dialects, revealing persistent gaps in dialectal generalization. DialectalArabicMMLU provides the first unified, human-curated resource for measuring dialectal understanding in Arabic, thus promoting more inclusive evaluation and future model development.
[132] MUTANT: A Recipe for Multilingual Tokenizer Design
Souvik Rana, Arul Menezes, Ashish Kulkarni, Chandra Khatri, Shubham Agarwal
Main category: cs.CL
TL;DR: MUTANT is a multilingual tokenizer recipe with language-aware pre-tokenization and subword/multiword training, achieving 39.5% better fertility than LLaMA4 and 44% inference throughput improvement for Indian languages.
Details
Motivation: Current subword tokenizers like BPE are widely used but their effectiveness in multilingual settings with diverse scripts and morphological variations is underexplored, especially for Indian languages which present unique challenges.Method: MUTANT uses careful vocabulary and training data design, language-aware pre-tokenization, and subword and multiword aware training. MUTANT-Indic specifically targets India-specific multilingual LLMs with linguistically coherent token design.
Result: Achieves 39.5% better average fertility score than LLaMA4 and 18% better than Sutra (current best), with 44% improvement in inference throughput while maintaining comparable performance on English and Indic benchmarks.
Conclusion: MUTANT provides an effective recipe for multilingual tokenizers that improves efficiency and performance, particularly valuable for languages with rich morphological variation like Indian languages.
Abstract: Tokenizers play a crucial role in determining the performance, training efficiency, and the inference cost of Large Language Models (LLMs). Designing effective tokenizers for multilingual LLMs is particularly challenging due to diverse scripts and rich morphological variation. While subword methods like Byte Pair Encoding (BPE) are widely adopted, their effectiveness in multilingual settings remains underexplored. We present MUTANT, a recipe for building multilingual tokenizers, with careful vocabulary and training data design, language-aware pre-tokenization, and subword and multiword aware training. We also introduce MUTANT-Indic, a tokenizer for India-specific multilingual LLMs, that produces linguistically coherent tokens and achieves state-of-the-art performance. Evaluated across English, 22 Indian languages and code data, our tokenizer improves the average fertility score by 39.5%$ over LLaMA4 and by 18% over Sutra (the current best). This translates to 44% improvement in inference throughput over LLaMA4 while maintaining comparable performance on English and Indic benchmarks. We present detailed ablations across tokenizer training data size, vocabulary size, merging techniques, and pre-tokenization strategies, demonstrating the robustness of our design choices.
[133] AgentExpt: Automating AI Experiment Design with LLM-based Resource Retrieval Agent
Yu Li, Lehui Li, Qingmin Liao, Fengli Xu, Yong Li
Main category: cs.CL
TL;DR: A framework for automated baseline and dataset recommendation for AI research using collective perception from citation networks, with improved coverage and performance over prior methods.
Details
Motivation: Current LLM agents for scientific experiment design have limited data coverage and rely too heavily on content similarity, missing many datasets actually used in published papers and overlooking experimental suitability.Method: 1) Automated pipeline linking papers to baselines/datasets they actually used; 2) Collective perception enhanced retriever using self-descriptions plus aggregated citation contexts; 3) Reasoning-augmented reranker with explicit reasoning chains and LLM fine-tuning for interpretable justifications.
Result: Covers 85% of datasets/baselines used at top AI conferences over past 5 years; outperforms strongest prior baseline with +5.85% in Recall@20 and +8.30% in HitRate@5.
Conclusion: The framework advances reliable, interpretable automation of experimental design by leveraging collective perception from scholarly networks rather than just content similarity.
Abstract: Large language model agents are becoming increasingly capable at web-centric tasks such as information retrieval, complex reasoning. These emerging capabilities have given rise to surge research interests in developing LLM agent for facilitating scientific quest. One key application in AI research is to automate experiment design through agentic dataset and baseline retrieval. However, prior efforts suffer from limited data coverage, as recommendation datasets primarily harvest candidates from public portals and omit many datasets actually used in published papers, and from an overreliance on content similarity that biases model toward superficial similarity and overlooks experimental suitability. Harnessing collective perception embedded in the baseline and dataset citation network, we present a comprehensive framework for baseline and dataset recommendation. First, we design an automated data-collection pipeline that links roughly one hundred thousand accepted papers to the baselines and datasets they actually used. Second, we propose a collective perception enhanced retriever. To represent the position of each dataset or baseline within the scholarly network, it concatenates self-descriptions with aggregated citation contexts. To achieve efficient candidate recall, we finetune an embedding model on these representations. Finally, we develop a reasoning-augmented reranker that exact interaction chains to construct explicit reasoning chains and finetunes a large language model to produce interpretable justifications and refined rankings. The dataset we curated covers 85% of the datasets and baselines used at top AI conferences over the past five years. On our dataset, the proposed method outperforms the strongest prior baseline with average gains of +5.85% in Recall@20, +8.30% in HitRate@5. Taken together, our results advance reliable, interpretable automation of experimental design.
[134] Bot Meets Shortcut: How Can LLMs Aid in Handling Unknown Invariance OOD Scenarios?
Shiyan Zheng, Herun Wan, Minnan Luo, Junhang Huang
Main category: cs.CL
TL;DR: Study on social bot detector robustness against shortcut learning, proposing LLM-based counterfactual augmentation to mitigate performance drops from spurious textual correlations.
Details
Motivation: Existing social bot detectors perform well on benchmarks but lack robustness in real-world scenarios due to shortcut learning where models rely on spurious correlations rather than causal features, particularly with manipulable textual features.Method: Designed shortcut scenarios with spurious associations between user labels and superficial textual cues to evaluate detector robustness. Proposed mitigation strategies using large language models for counterfactual data augmentation at three levels: individual user text, overall dataset distribution, and model’s causal information extraction.
Result: Shifts in irrelevant feature distributions caused average relative accuracy drop of 32% in baseline models. Proposed LLM-based strategies achieved average relative performance improvement of 56% under shortcut scenarios.
Conclusion: Social bot detectors are vulnerable to shortcut learning with textual features, but LLM-based counterfactual augmentation effectively mitigates these issues by addressing data distribution and causal feature extraction problems.
Abstract: While existing social bot detectors perform well on benchmarks, their robustness across diverse real-world scenarios remains limited due to unclear ground truth and varied misleading cues. In particular, the impact of shortcut learning, where models rely on spurious correlations instead of capturing causal task-relevant features, has received limited attention. To address this gap, we conduct an in-depth study to assess how detectors are influenced by potential shortcuts based on textual features, which are most susceptible to manipulation by social bots. We design a series of shortcut scenarios by constructing spurious associations between user labels and superficial textual cues to evaluate model robustness. Results show that shifts in irrelevant feature distributions significantly degrade social bot detector performance, with an average relative accuracy drop of 32% in the baseline models. To tackle this challenge, we propose mitigation strategies based on large language models, leveraging counterfactual data augmentation. These methods mitigate the problem from data and model perspectives across three levels, including data distribution at both the individual user text and overall dataset levels, as well as the model’s ability to extract causal information. Our strategies achieve an average relative performance improvement of 56% under shortcut scenarios.
[135] Auditing Google’s AI Overviews and Featured Snippets: A Case Study on Baby Care and Pregnancy
Desheng Hu, Joachim Baumann, Aleksandra Urman, Elsa Lichtenegger, Robin Forsberg, Aniko Hannak, Christo Wilson
Main category: cs.CL
TL;DR: Audit of Google’s AI-generated health content (AI Overviews & Featured Snippets) reveals concerning inconsistencies and lack of medical safeguards in pregnancy/baby care information.
Details
Motivation: Google Search increasingly surfaces AI-generated content through AI Overviews and Featured Snippets that users rely on but have no control over. There's a need to evaluate the quality and consistency of these information displays, especially for high-stakes domains like health information.Method: Systematic algorithm audit of 1,508 real baby care and pregnancy-related queries using a robust evaluation framework assessing multiple quality dimensions: answer consistency, relevance, presence of medical safeguards, source categories, and sentiment alignment.
Result: 33% inconsistency between AI Overviews and Featured Snippets on same search result pages; critically lacking medical safeguards (only 11% in AIO, 7% in FS); health/wellness websites dominate sources but FS often links to commercial sources; despite high relevance scores, quality gaps exist.
Conclusion: Findings reveal concerning gaps in AI-mediated health information with important implications for public health information access. Demonstrates need for stronger quality controls in AI systems for high-stakes domains. Provides transferable framework for auditing AI systems where information quality impacts user well-being.
Abstract: Google Search increasingly surfaces AI-generated content through features like AI Overviews (AIO) and Featured Snippets (FS), which users frequently rely on despite having no control over their presentation. Through a systematic algorithm audit of 1,508 real baby care and pregnancy-related queries, we evaluate the quality and consistency of these information displays. Our robust evaluation framework assesses multiple quality dimensions, including answer consistency, relevance, presence of medical safeguards, source categories, and sentiment alignment. Our results reveal concerning gaps in information consistency, with information in AIO and FS displayed on the same search result page being inconsistent with each other in 33% of cases. Despite high relevance scores, both features critically lack medical safeguards (present in just 11% of AIO and 7% of FS responses). While health and wellness websites dominate source categories for both, AIO and FS, FS also often link to commercial sources. These findings have important implications for public health information access and demonstrate the need for stronger quality controls in AI-mediated health information. Our methodology provides a transferable framework for auditing AI systems across high-stakes domains where information quality directly impacts user well-being.
[136] Human or LLM as Standardized Patients? A Comparative Study for Medical Education
Bingquan Zhang, Xiaoxiao Liu, Yuchi Wang, Lei Zhou, Qianqian Xie, Benyou Wang
Main category: cs.CL
TL;DR: EasyMED is a multi-agent virtual standardized patient framework that separates case information from response generation for stable, inquiry-conditioned behavior, with SPBench benchmark showing it matches human SP behavior better than existing VSPs.
Details
Motivation: Standardized patients (SPs) are expensive and difficult to scale for clinical training, while existing LLM-based virtual SPs have unstable behavior and lack rigorous comparison with human SPs.Method: Proposes EasyMED, a multi-agent VSP framework that separates case-grounded information disclosure from response generation, and introduces SPBench benchmark with eight expert-defined criteria for interaction-level evaluation.
Result: EasyMED more closely matches human SP behavior than existing VSPs, particularly in case consistency and controlled disclosure. A four-week controlled study shows learning outcomes comparable to human SP training with stronger early gains for novices.
Conclusion: EasyMED demonstrates improved flexibility, psychological safety, and cost efficiency compared to human SP training while achieving comparable learning outcomes, making it a viable scalable alternative for clinical skills training.
Abstract: Standardized patients (SPs) are indispensable for clinical skills training but remain expensive and difficult to scale. Although large language model (LLM)-based virtual standardized patients (VSPs) have been proposed as an alternative, their behavior remains unstable and lacks rigorous comparison with human standardized patients. We propose EasyMED, a multi-agent VSP framework that separates case-grounded information disclosure from response generation to support stable, inquiry-conditioned patient behavior. We also introduce SPBench, a human-grounded benchmark with eight expert-defined criteria for interaction-level evaluation. Experiments show that EasyMED more closely matches human SP behavior than existing VSPs, particularly in case consistency and controlled disclosure. A four-week controlled study further demonstrates learning outcomes comparable to human SP training, with stronger early gains for novice learners and improved flexibility, psychological safety, and cost efficiency.
[137] LexInstructEval: Lexical Instruction Following Evaluation for Large Language Models
Huimin Ren, Yan Liang, Baiqiao Su, Chaobo Sun, Hengtong Lu, Kaike Zhang, Chen Wei
Main category: cs.CL
TL;DR: LexInstructEval: A benchmark and evaluation framework for assessing LLMs’ ability to follow fine-grained lexical instructions using formal grammar and programmatic verification.
Details
Motivation: Current methods for evaluating LLMs' lexical instruction following capabilities are limited - human evaluation is subjective and costly, LLM-as-a-judge systems are biased and unreliable, and existing programmatic benchmarks lack expressiveness for testing intricate compositional constraints.Method: Introduces a formal, rule-based grammar that deconstructs complex instructions into <Procedure, Relation, Value> triplets. Uses a multi-stage, human-in-the-loop pipeline to generate diverse datasets and a transparent programmatic engine for objective verification.
Result: The paper releases a dataset and open-source evaluation tools to facilitate research into LLM controllability and reliability, providing a systematic framework for assessing fine-grained lexical instruction following.
Conclusion: LexInstructEval addresses limitations in current evaluation methods by providing an objective, expressive benchmark for testing LLMs’ ability to follow complex lexical instructions, enabling better research into model controllability.
Abstract: The ability of Large Language Models (LLMs) to precisely follow complex and fine-grained lexical instructions is a cornerstone of their utility and controllability. However, evaluating this capability remains a significant challenge. Current methods either rely on subjective and costly human evaluation or on automated LLM-as-a-judge systems, which suffer from inherent biases and unreliability. Existing programmatic benchmarks, while objective, often lack the expressiveness to test intricate, compositional constraints at a granular level. To address these limitations, we introduce LexInstructEval, a new benchmark and evaluation framework for fine-grained lexical instruction following. Our framework is built upon a formal, rule-based grammar that deconstructs complex instructions into a canonical <Procedure, Relation, Value> triplet. This grammar enables the systematic generation of a diverse dataset through a multi-stage, human-in-the-loop pipeline and facilitates objective verification via a transparent, programmatic engine. We release our dataset and open-source evaluation tools to facilitate further research into the controllability and reliability of LLMs.
[138] BERnaT: Basque Encoders for Representing Natural Textual Diversity
Ekhi Azurmendi, Joseba Fernandez de Landa, Jaione Bengoetxea, Maite Heredia, Julen Etxaniz, Mikel Zubillaga, Ander Soraluze, Aitor Soroa
Main category: cs.CL
TL;DR: BERnaT models trained on diverse Basque language data (standard, social media, historical) outperform standard-only models across NLU tasks without sacrificing benchmark accuracy.
Details
Motivation: Standard text filtering for language models often excludes non-standard linguistic varieties, reducing model robustness and reinforcing representational biases. The paper argues for capturing full language variation spectrum.Method: Constructed new Basque corpora combining standard, social media, and historical sources. Pre-trained BERnaT encoder-only models in three configurations: standard, diverse, and combined. Proposed evaluation framework separating NLU tasks into standard and diverse subsets.
Result: Models trained on both standard and diverse data consistently outperform standard-only models across all task types, improving performance without compromising standard benchmark accuracy.
Conclusion: Linguistic diversity is crucial for building inclusive, generalizable language models that can handle the full spectrum of language variation.
Abstract: Language models depend on massive text corpora that are often filtered for quality, a process that can unintentionally exclude non-standard linguistic varieties, reduce model robustness and reinforce representational biases. In this paper, we argue that language models should aim to capture the full spectrum of language variation (dialectal, historical, informal, etc.) rather than relying solely on standardized text. Focusing on the Basque language, we construct new corpora combining standard, social media, and historical sources, and pre-train the BERnaT family of encoder-only models in three configurations: standard, diverse, and combined. We further propose an evaluation framework that separates Natural Language Understanding (NLU) tasks into standard and diverse subsets to assess linguistic generalization. Results show that models trained on both standard and diverse data consistently outperform those trained on standard corpora, improving performance across all task types without compromising standard benchmark accuracy. These findings highlight the importance of linguistic diversity in building inclusive, generalizable language models.
[139] Teaching Old Tokenizers New Words: Efficient Tokenizer Adaptation for Pre-trained Models
Taido Purason, Pavel Chizhov, Ivan P. Yamshchikov, Mark Fishel
Main category: cs.CL
TL;DR: Proposes continued BPE training for vocabulary extension and leaf-based pruning for vocabulary reduction to adapt tokenizers to new domains/languages efficiently.
Details
Motivation: Current tokenizer adaptation methods for domain/language adaptation often result in inefficient vocabulary usage - added tokens are either unreachable or rarely used, and there's no good way to prune redundant tokens while maintaining model quality.Method: Two complementary methods: 1) Continued BPE training - extends pre-trained tokenizer by continuing BPE merge learning on new data rather than appending non-overlapping tokens; 2) Leaf-based vocabulary pruning - removes redundant tokens while preserving model quality through systematic pruning approach.
Result: Experiments across multiple languages and model families show improved tokenization efficiency and better utilization of added vocabulary with continued BPE training. The pruning method effectively reduces vocabulary size while maintaining model performance.
Conclusion: The proposed methods provide practical tools for controlled vocabulary modification, released as an open-source toolkit, enabling more efficient tokenizer adaptation to new domains/languages.
Abstract: Tokenizer adaptation plays an important role in adapting pre-trained language models to new domains or languages. In this work, we address two complementary aspects of this process: vocabulary extension and pruning. The common approach to extension trains a new tokenizer on domain-specific text and appends the tokens that do not overlap with the existing vocabulary, which often results in many tokens that are unreachable or never used. We propose continued BPE training that extends a pre-trained tokenizer by continuing the BPE merge learning process on new data. Experiments across multiple languages and model families show that this approach improves tokenization efficiency and leads to better utilization of added vocabulary. We also introduce leaf-based vocabulary pruning, which removes redundant tokens while preserving model quality. Together, these methods provide practical tools for controlled vocabulary modification, which we release as an open-source toolkit.
[140] M4-RAG: A Massive-Scale Multilingual Multi-Cultural Multimodal RAG
David Anugraha, Patrick Amadeus Irawan, Anshul Singh, En-Shiun Annie Lee, Genta Indra Winata
Main category: cs.CL
TL;DR: M4-RAG is a massive-scale multilingual multimodal benchmark for evaluating retrieval-augmented visual question answering across 42 languages, 56 dialects, and 189 countries, revealing that RAG benefits smaller VLMs but degrades performance for larger models and suffers from cross-lingual performance gaps.
Details
Motivation: Current vision-language models are limited by static training data, and while retrieval-augmented generation helps access up-to-date, culturally grounded multilingual information, multilingual multimodal RAG remains underexplored. There's a need for comprehensive evaluation across languages and modalities.Method: Created M4-RAG benchmark with over 80,000 culturally diverse image-question pairs spanning 42 languages, 56 dialects, and 189 countries. Built controlled retrieval environment with millions of curated multilingual documents to balance realism with reproducibility. Systematically evaluated RAG performance across different model sizes and languages.
Result: RAG consistently benefits smaller VLMs but fails to scale to larger models, often degrading their performance. Significant performance degradation occurs when prompts or retrieved context are in non-English languages. Exposes critical mismatch between model size and current retrieval effectiveness.
Conclusion: Multilingual multimodal RAG presents unique challenges requiring specialized approaches. Current retrieval methods are insufficient for larger VLMs, and cross-lingual performance gaps highlight the need for better multilingual adaptation. The benchmark enables systematic evaluation of these issues.
Abstract: Vision-language models (VLMs) have achieved strong performance in visual question answering (VQA), yet they remain constrained by static training data. Retrieval-Augmented Generation (RAG) mitigates this limitation by enabling access to up-to-date, culturally grounded, and multilingual information; however, multilingual multimodal RAG remains largely underexplored. We introduce M4-RAG, a massive-scale benchmark spanning 42 languages, 56 regional dialects and registers, and 189 countries, comprising over 80,000 culturally diverse image-question pairs for evaluating retrieval-augmented VQA across languages and modalities. To balance realism with reproducibility, we build a controlled retrieval environment containing millions of carefully curated multilingual documents relevant to the query domains, approximating real-world retrieval conditions while ensuring consistent experimentation. Our systematic evaluation reveals that although RAG consistently benefits smaller VLMs, it fails to scale to larger models and often even degrades their performance, exposing a critical mismatch between model size and current retrieval effectiveness. Our cross-lingual evaluations also reveal significant performance degradation when prompts or retrieved context are provided in non-English languages. The code, datasets, and evaluation protocols for M4-RAG are available as open-source at https://github.com/davidanugraha/M4-RAG.
[141] Knowing What’s Missing: Assessing Information Sufficiency in Question Answering
Akriti Jain, Aparna Garimella
Main category: cs.CL
TL;DR: A structured Identify-then-Verify framework for assessing whether context contains sufficient information to answer questions, particularly effective for inferential questions requiring reasoning beyond direct text extraction.
Details
Motivation: Current question-answering systems struggle to determine if provided context contains sufficient information to answer questions, especially for inferential questions requiring reasoning beyond direct text extraction. Simple prompting strategies often fail on such questions.Method: Proposes an Identify-then-Verify framework: 1) Generate multiple hypotheses about missing information and establish semantic consensus, 2) Perform critical verification by forcing the model to re-examine source text to confirm whether this information is truly absent.
Result: The framework outperforms established baselines across diverse multi-hop and factual QA datasets, producing more accurate sufficiency judgments while clearly articulating information gaps.
Conclusion: Guiding models to justify claims about missing information through structured reasoning improves sufficiency assessment in question-answering systems, particularly for inferential questions.
Abstract: Determining whether a provided context contains sufficient information to answer a question is a critical challenge for building reliable question-answering systems. While simple prompting strategies have shown success on factual questions, they frequently fail on inferential ones that require reasoning beyond direct text extraction. We hypothesize that asking a model to first reason about what specific information is missing provides a more reliable, implicit signal for assessing overall sufficiency. To this end, we propose a structured Identify-then-Verify framework for robust sufficiency modeling. Our method first generates multiple hypotheses about missing information and establishes a semantic consensus. It then performs a critical verification step, forcing the model to re-examine the source text to confirm whether this information is truly absent. We evaluate our method against established baselines across diverse multi-hop and factual QA datasets. The results demonstrate that by guiding the model to justify its claims about missing information, our framework produces more accurate sufficiency judgments while clearly articulating any information gaps.
[142] Automatic Essay Scoring and Feedback Generation in Basque Language Learning
Ekhi Azurmendi, Xabier Arregi, Oier Lopez de Lacalle
Main category: cs.CL
TL;DR: First public Basque Automatic Essay Scoring dataset with 3,200 CEFR C1 essays, fine-tuned RoBERTa and Latxa models outperform GPT-5/Claude in scoring and feedback quality.
Details
Motivation: Address the lack of publicly available resources for Automatic Essay Scoring (AES) and feedback generation in low-resource languages like Basque, particularly for CEFR C1 proficiency level assessment.Method: Created dataset of 3,200 Basque essays with expert annotations (scores + feedback), fine-tuned RoBERTa-EusCrawl and Latxa 8B/70B models for scoring and explanation generation, proposed novel evaluation methodology combining automatic consistency metrics with expert validation.
Result: Encoder models reliable for AES, fine-tuned Latxa surpasses GPT-5 and Claude Sonnet 4.5 in scoring consistency and feedback quality, identifies wider range of error types than proprietary models.
Conclusion: Establishes foundation for transparent, reproducible NLP research in low-resource languages, demonstrates effectiveness of fine-tuned open-source models for educational applications.
Abstract: This paper introduces the first publicly available dataset for Automatic Essay Scoring (AES) and feedback generation in Basque, targeting the CEFR C1 proficiency level. The dataset comprises 3,200 essays from HABE, each annotated by expert evaluators with criterion specific scores covering correctness, richness, coherence, cohesion, and task alignment enriched with detailed feedback and error examples. We fine-tune open-source models, including RoBERTa-EusCrawl and Latxa 8B/70B, for both scoring and explanation generation. Our experiments show that encoder models remain highly reliable for AES, while supervised fine-tuning (SFT) of Latxa significantly enhances performance, surpassing state-of-the-art (SoTA) closed-source systems such as GPT-5 and Claude Sonnet 4.5 in scoring consistency and feedback quality. We also propose a novel evaluation methodology for assessing feedback generation, combining automatic consistency metrics with expert-based validation of extracted learner errors. Results demonstrate that the fine-tuned Latxa model produces criterion-aligned, pedagogically meaningful feedback and identifies a wider range of error types than proprietary models. This resource and benchmark establish a foundation for transparent, reproducible, and educationally grounded NLP research in low-resource languages such as Basque.
[143] SciLaD: A Large-Scale, Transparent, Reproducible Dataset for Natural Scientific Language Processing
Luca Foppiano, Sotaro Takeshita, Pedro Ortiz Suarez, Ekaterina Borisova, Raia Abu Ahmad, Malte Ostendorff, Fabio Barth, Julian Moreno-Schneider, Georg Rehm
Main category: cs.CL
TL;DR: SciLaD is a large-scale scientific language dataset with 10M+ English publications and 35M+ multilingual TEI XML publications, created using open-source tools, with a pre-trained RoBERTa model achieving comparable performance to similar scientific language models.
Details
Motivation: To create a comprehensive, open-source scientific language dataset that enables large-scale scientific data curation while maintaining high quality, addressing the need for accessible scientific language resources for NLP research.Method: Constructed dataset using open-source frameworks and publicly available data sources, with a curated English split and multilingual TEI XML split. Developed an extensible pipeline for dataset generation and pre-trained a RoBERTa model on the dataset.
Result: Created SciLaD dataset with over 10 million English publications and 35+ million multilingual publications. The pre-trained RoBERTa model achieved performance comparable to other scientific language models of similar size across comprehensive benchmarks.
Conclusion: SciLaD demonstrates that open-source tools can enable large-scale scientific data curation with high quality. The dataset and evaluation pipeline promote reproducibility and further research in scientific language processing and understanding.
Abstract: SciLaD is a novel, large-scale dataset of scientific language constructed entirely using open-source frameworks and publicly available data sources. It comprises a curated English split containing over 10 million scientific publications and a multilingual, unfiltered TEI XML split including more than 35 million publications. We also publish the extensible pipeline for generating SciLaD. The dataset construction and processing workflow demonstrates how open-source tools can enable large-scale, scientific data curation while maintaining high data quality. Finally, we pre-train a RoBERTa model on our dataset and evaluate it across a comprehensive set of benchmarks, achieving performance comparable to other scientific language models of similar size, validating the quality and utility of SciLaD. We publish the dataset and evaluation pipeline to promote reproducibility, transparency, and further research in natural scientific language processing and understanding, including scholarly document processing.
[144] Mining Legal Arguments to Study Judicial Formalism
Tomáš Koref, Lena Held, Mahammad Namazov, Harun Kumru, Yassine Thlija, Ivan Habernal
Main category: cs.CL
TL;DR: Automated judicial reasoning analysis using NLP to detect argument types and formalism in Czech court decisions, achieving high accuracy with transformer models and a three-stage pipeline.
Details
Motivation: To systematically analyze judicial reasoning at scale in Central and Eastern Europe, testing claims about formalistic judging by developing automated methods using NLP.Method: Created MADON dataset of 272 Czech Supreme Court decisions with expert annotations, adapted transformer LLMs to Czech legal domain through continued pretraining, used asymmetric loss and class weighting for imbalance, and developed a three-stage pipeline combining ModernBERT, Llama 3.1, and traditional ML.
Result: Best models achieved 82.6% Bal-F1 for argument detection, 77.5% Bal-F1 for legal argument classification, and 83.8% Bal-F1 for formalism classification. The pipeline reduced computational costs while increasing explainability.
Conclusion: Legal argument mining enables judicial philosophy classification and challenges prevailing narratives about CEE formalism. The methodology is transferable across jurisdictions and all resources are publicly available.
Abstract: Courts must justify their decisions, but systematically analyzing judicial reasoning at scale remains difficult. This study tests claims about formalistic judging in Central and Eastern Europe (CEE) by developing automated methods to detect and classify judicial reasoning in decisions of Czech Supreme Courts using state-of-the-art natural language processing methods. We create the MADON dataset of 272 decisions from two Czech Supreme Courts with expert annotations of 9,183 paragraphs with eight argument types and holistic formalism labels for supervised training and evaluation. Using a corpus of 300,511 Czech court decisions, we adapt transformer LLMs to Czech legal domain through continued pretraining and we experiment with methods to address dataset imbalance including asymmetric loss and class weighting. The best models can detect argumentative paragraphs (82.6% Bal-F1), classify traditional types of legal argument (77.5% Bal-F1), and classify decisions as formalistic/non-formalistic (83.8% Bal-F1). Our three-stage pipeline combining ModernBERT, Llama 3.1, and traditional feature-based machine learning achieves promising results for decision classification while reducing computational costs and increasing explainability. Empirically, we challenge prevailing narratives about CEE formalism. We demonstrate that legal argument mining enables promising judicial philosophy classification and highlight its potential for other important tasks in computational legal studies. Our methodology can be used across jurisdictions, and our entire pipeline, datasets, guidelines, models, and source codes are available at https://github.com/trusthlt/madon.
[145] Hidden State Poisoning Attacks against Mamba-based Language Models
Alexandre Le Mercier, Chris Develder, Thomas Demeester
Main category: cs.CL
TL;DR: State space models (SSMs) like Mamba are vulnerable to Hidden State Poisoning Attacks (HiSPA) where short input phrases can irreversibly overwrite hidden states, causing information loss, unlike Transformers which are resistant.
Details
Motivation: While SSMs offer efficient alternatives to Transformers with linear time complexity, their adversarial robustness remains unexplored. The paper aims to investigate whether SSMs are vulnerable to specific attacks that can corrupt their hidden states and impair information retrieval capabilities.Method: 1) Introduces HiSPA (Hidden State Poisoning Attack) where specific short phrases induce partial amnesia by overwriting hidden states. 2) Creates RoBench-25 benchmark to evaluate information retrieval under HiSPAs. 3) Tests various models including Mamba, Jamba (hybrid SSM-Transformer), and pure Transformers. 4) Extends analysis to Mamba-2 and Nemotron-3-Nano. 5) Conducts interpretability study of Mamba’s hidden layers during attacks.
Result: SSMs are vulnerable to HiSPAs while pure Transformers are not. Jamba-1.7-Mini (52B hybrid) collapses on RoBench-25 under HiSPA triggers. HiSPA triggers also weaken Jamba on Open-Prompt-Injections benchmark. Theoretical and empirical findings extend to Mamba-2 and Nemotron-3-Nano. Interpretability reveals patterns in Mamba’s hidden layers during attacks.
Conclusion: SSMs have critical security vulnerabilities to hidden state poisoning attacks that don’t affect Transformers, highlighting a fundamental difference in robustness between these architectures. The findings suggest the need for mitigation systems and raise concerns about deploying SSMs in security-critical applications.
Abstract: State space models (SSMs) like Mamba offer efficient alternatives to Transformer-based language models, with linear time complexity. Yet, their adversarial robustness remains critically unexplored. This paper studies the phenomenon whereby specific short input phrases induce a partial amnesia effect in such models, by irreversibly overwriting information in their hidden states, referred to as a Hidden State Poisoning Attack (HiSPA). Our benchmark RoBench-25 allows evaluating a model’s information retrieval capabilities when subject to HiSPAs, and confirms the vulnerability of SSMs against such attacks. Even the recent Jamba-1.7-Mini SSM–Transformer (a 52B hybrid model) collapses on RoBench-25 under some HiSPA triggers, whereas pure Transformers do not. We also observe that HiSPA triggers significantly weaken the Jamba model on the popular Open-Prompt-Injections benchmark, unlike pure Transformers. We further show that the theoretical and empirical findings extend to Mamba-2, and also analyse a Mamba-2-based hybrid (Nemotron-3-Nano). Finally, our interpretability study reveals patterns in Mamba’s hidden layers during HiSPAs that could be used to build a HiSPA mitigation system. The full code and data to reproduce the experiments can be found at https://anonymous.4open.science/r/hispa_anonymous-5DB0.
[146] Agent-Dice: Disentangling Knowledge Updates via Geometric Consensus for Agent Continual Learning
Zheng Wu, Xingyu Lou, Xinbei Ma, Yansi Li, Weiwen Liu, Weinan Zhang, Jun Wang, Zhuosheng Zhang
Main category: cs.CL
TL;DR: Agent-Dice: A parameter fusion framework for LLM-based agents that addresses catastrophic forgetting in continual learning by distinguishing between shared common knowledge and conflicting task-specific knowledge through directional consensus evaluation.
Details
Motivation: LLM-based agents face the stability-plasticity dilemma when learning new tasks continually, suffering from catastrophic forgetting. The core issue is the failure to distinguish between knowledge shared across tasks and conflicting knowledge from task-specific interference.Method: Two-stage parameter fusion framework: 1) Geometric consensus filtering to prune conflicting gradients, and 2) Curvature-based importance weighting to amplify shared semantics. Uses directional consensus evaluation to disentangle knowledge updates.
Result: Extensive experiments on GUI agents and tool-use agent domains show outstanding continual learning performance with minimal computational overhead and parameter updates.
Conclusion: Agent-Dice effectively addresses the stability-plasticity dilemma in LLM-based agents by explicitly distinguishing between shared and conflicting knowledge, enabling better continual learning without catastrophic forgetting.
Abstract: Large Language Model (LLM)-based agents significantly extend the utility of LLMs by interacting with dynamic environments. However, enabling agents to continually learn new tasks without catastrophic forgetting remains a critical challenge, known as the stability-plasticity dilemma. In this work, we argue that this dilemma fundamentally arises from the failure to explicitly distinguish between common knowledge shared across tasks and conflicting knowledge introduced by task-specific interference. To address this, we propose Agent-Dice, a parameter fusion framework based on directional consensus evaluation. Concretely, Agent-Dice disentangles knowledge updates through a two-stage process: geometric consensus filtering to prune conflicting gradients, and curvature-based importance weighting to amplify shared semantics. We provide a rigorous theoretical analysis that establishes the validity of the proposed fusion scheme and offers insight into the origins of the stability-plasticity dilemma. Extensive experiments on GUI agents and tool-use agent domains demonstrate that Agent-Dice exhibits outstanding continual learning performance with minimal computational overhead and parameter updates. The codes are available at https://github.com/Wuzheng02/Agent-Dice.
[147] FACTUM: Mechanistic Detection of Citation Hallucination in Long-Form RAG
Maxime Dassen, Rebecca Kotula, Kenton Murray, Andrew Yates, Dawn Lawrie, Efsun Kayi, James Mayfield, Kevin Duh
Main category: cs.CL
TL;DR: FACTUM framework analyzes citation hallucinations in RAG models as coordination failures between attention and feed-forward pathways, using four mechanistic scores to detect trustworthy citations.
Details
Motivation: Current RAG models suffer from citation hallucinations where models cite sources that don't support their claims. Existing work oversimplifies this as parametric knowledge over-reliance, but the authors argue it's actually a complex coordination failure between different model pathways that evolves with scale.Method: Introduces FACTUM framework with four mechanistic scores: Contextual Alignment Score (CAS), Attention Sink Usage (BAS), Parametric Force Score (PFS), and Pathway Alignment Score (PAS). Analyzes how attention and feed-forward network pathways coordinate during citation generation across different model scales (3B to 8B parameters).
Result: Correct citations show higher parametric force (PFS) and greater attention sink usage (BAS). The signature of correctness evolves with scale: 3B models rely on high pathway alignment (PAS), while 8B models shift to specialized strategies with orthogonal information from different pathways. FACTUM outperforms state-of-the-art baselines by up to 37.5% in AUC.
Conclusion: Citation hallucinations are complex coordination failures between model pathways, not simple over-reliance on parametric knowledge. High parametric force can be constructive when properly coordinated with attention pathways. The FACTUM framework enables more nuanced and reliable RAG systems by capturing these mechanistic interactions.
Abstract: Retrieval-Augmented Generation (RAG) models are critically undermined by citation hallucinations, a deceptive failure where a model cites a source that fails to support its claim. While existing work attributes hallucination to a simple over-reliance on parametric knowledge, we reframe this failure as an evolving, scale-dependent coordination failure between the Attention (reading) and Feed-Forward Network (recalling) pathways. We introduce FACTUM (Framework for Attesting Citation Trustworthiness via Underlying Mechanisms), a framework of four mechanistic scores: Contextual Alignment (CAS), Attention Sink Usage (BAS), Parametric Force (PFS), and Pathway Alignment (PAS). Our analysis reveals that correct citations are consistently marked by higher parametric force (PFS) and greater use of the attention sink (BAS) for information synthesis. Crucially, we find that “one-size-fits-all” theories are insufficient as the signature of correctness evolves with scale: while the 3B model relies on high pathway alignment (PAS), our best-performing 8B detector identifies a shift toward a specialized strategy where pathways provide distinct, orthogonal information. By capturing this complex interplay, FACTUM outperforms state-of-the-art baselines by up to 37.5% in AUC. Our results demonstrate that high parametric force is constructive when successfully coordinated with the Attention pathway, paving the way for more nuanced and reliable RAG systems.
[148] Pantagruel: Unified Self-Supervised Encoders for French Text and Speech
Phuong-Hang Le, Valentin Pelloin, Arnault Chatelain, Maryem Bouziane, Mohammed Ghennai, Qianwen Guan, Kirill Milintsevich, Salima Mdhaffar, Aidan Mannion, Nils Defauw, Shuyue Gu, Alexandre Audibert, Marco Dinarelli, Yannick Estève, Lorraine Goeuriot, Steffen Lalande, Nicolas Hervé, Maximin Coavoux, François Portet, Étienne Ollion, Marie Candito, Maxime Peyrard, Solange Rossato, Benjamin Lecouteux, Aurélie Nardy, Gilles Sérasset, Vincent Segonne, Solène Evain, Diandra Fabre, Didier Schwab
Main category: cs.CL
TL;DR: Pantagruel: Self-supervised encoder models for French text and speech that learn contextualized target representations in feature space, enabling effective multimodal understanding with competitive performance on French benchmarks.
Details
Motivation: To create effective multimodal models for French that can handle both text and speech inputs seamlessly, addressing the need for French-specific multimodal representation learning beyond modality-specific targets like textual tokens or speech units.Method: Uses self-supervised learning with feature-space objectives where modality-specific encoders learn contextualized target representations. Pre-trained on large-scale French corpora: Wikipedia, OSCAR, CroissantLLM for text; MultilingualLibriSpeech, LeBenchmark, and newly introduced INA-100k (100k hours of French audio from national archives) for speech.
Result: Pantagruel models show competitive or superior performance compared to strong French baselines (CamemBERT, FlauBERT, LeBenchmark2.0) across various downstream tasks in both modalities, while maintaining a shared architecture that handles speech or text inputs seamlessly.
Conclusion: Feature-space self-supervised objectives are effective for French representation learning, and Pantagruel serves as a robust foundation for multimodal speech-text understanding in French.
Abstract: We release Pantagruel models, a new family of self-supervised encoder models for French text and speech. Instead of predicting modality-tailored targets such as textual tokens or speech units, Pantagruel learns contextualized target representations in the feature space, allowing modality-specific encoders to capture linguistic and acoustic regularities more effectively. Separate models are pre-trained on large-scale French corpora, including Wikipedia, OSCAR and CroissantLLM for text, together with MultilingualLibriSpeech, LeBenchmark, and INA-100k for speech. INA-100k is a newly introduced 100,000-hour corpus of French audio derived from the archives of the Institut National de l’Audiovisuel (INA), the national repository of French radio and television broadcasts, providing highly diverse audio data. We evaluate Pantagruel across a broad range of downstream tasks spanning both modalities, including those from the standard French benchmarks such as FLUE or LeBenchmark. Across these tasks, Pantagruel models show competitive or superior performance compared to strong French baselines such as CamemBERT, FlauBERT, and LeBenchmark2.0, while maintaining a shared architecture that can seamlessly handle either speech or text inputs. These results confirm the effectiveness of feature-space self-supervised objectives for French representation learning and highlight Pantagruel as a robust foundation for multimodal speech-text understanding.
[149] Measuring Iterative Temporal Reasoning with Time Puzzles
Zhengxiang Wang, Zeyu Dong
Main category: cs.CL
TL;DR: Time Puzzles: A benchmark for evaluating iterative temporal reasoning with tools in LLMs, using algorithmically generated date inference tasks that combine factual anchors with calendar relations.
Details
Motivation: Existing benchmarks evaluate temporal reasoning in static, non-tool-using settings, which poorly reflects how LLMs perform temporal reasoning in practice with tools like web search.Method: Introduces Time Puzzles - constraint-based date inference tasks algorithmically generated to combine factual temporal anchors with cross-cultural calendar relations, enabling controlled evaluation of iterative temporal reasoning with tools.
Result: Across 13 LLMs, best model (GPT-5) achieves only 55.3% accuracy without tools, despite using easily searchable facts. Web search improves performance, but models perform substantially better when constraints are rewritten with explicit dates, removing factual lookup needs.
Conclusion: Reveals a gap in reliable tool use for iterative temporal reasoning, showing that while tools help, models struggle with the iterative reasoning process needed for temporal constraint satisfaction.
Abstract: Tool use, such as web search, has become a standard capability even in freely available large language models (LLMs). However, existing benchmarks evaluate temporal reasoning mainly in static, non-tool-using settings, which poorly reflect how LLMs perform temporal reasoning in practice. We introduce Time Puzzles, a constraint-based date inference task for evaluating iterative temporal reasoning with tools. Each puzzle combines factual temporal anchors with (cross-cultural) calendar relations and may admit one or multiple valid dates. The puzzles are algorithmically generated, enabling controlled and continual evaluation. Across 13 LLMs, even the best model (GPT-5) achieves only 55.3% accuracy without tools, despite using easily searchable facts. While web search improves performance, models perform substantially better when constraints are rewritten with explicit dates, removing the need for factual lookup. These results reveal a gap in reliable tool use for iterative temporal reasoning.
[150] Script Sensitivity: Benchmarking Language Models on Unicode, Romanized and Mixed-Script Sinhala
Minuri Rajapakse, Ruvan Weerasinghe
Main category: cs.CL
TL;DR: Benchmarking 24 open-source LMs on Sinhala reveals severe script sensitivity - 300x performance drop from Unicode to Romanized text, with model size not correlating with script-handling ability.
Details
Motivation: To evaluate LM performance on low-resource, morphologically rich languages like Sinhala that exhibit script duality (Unicode for formal contexts vs Romanized text for social media), where mixed-script usage is common in practice.Method: Benchmarked 24 open-source language models on Unicode, Romanized, and mixed-script Sinhala using perplexity evaluation across diverse text sources.
Result: Found substantial script sensitivity with median performance degradation exceeding 300 times from Unicode to Romanized text. Model size showed no correlation with script-handling competence, and Unicode performance strongly predicted mixed-script robustness but not Romanized capability.
Conclusion: Single-script evaluation substantially underestimates real-world deployment challenges for LMs in multi-script low-resource environments. Provides baseline capabilities for Sinhala and practical guidance for model selection.
Abstract: The performance of Language Models (LMs) on low-resource, morphologically rich languages like Sinhala remains largely unexplored, particularly regarding script variation in digital communication. Sinhala exhibits script duality, with Unicode used in formal contexts and Romanized text dominating social media, while mixed-script usage is common in practice. This paper benchmarks 24 open-source LMs on Unicode, Romanized and mixed-script Sinhala using perplexity evaluation across diverse text sources. Results reveal substantial script sensitivity, with median performance degradation exceeding 300 times from Unicode to Romanized text. Critically, model size shows no correlation with script-handling competence, as smaller models often outperform architectures 28 times larger. Unicode performance strongly predicts mixed-script robustness but not Romanized capability, demonstrating that single-script evaluation substantially underestimates real-world deployment challenges. These findings establish baseline LM capabilities for Sinhala and provide practical guidance for model selection in multi-script low-resource environments.
[151] Different Demographic Cues Yield Inconsistent Conclusions About LLM Personalization and Bias
Manuel Tonneau, Neil K. R. Seghal, Niyati Malhotra, Sharif Kazemi, Victor Orozco-Olvera, Ana María Muñoz Boudet, Lakshmi Subramanian, Samuel P. Fraiberger, Sharath Chandra Guntuku, Valentin Hofmann
Main category: cs.CL
TL;DR: Demographic cue-based evaluation of LLMs shows that different cues (e.g., names) for the same demographic group produce inconsistent model responses, challenging the assumption that cues are interchangeable proxies for identity-conditioned behavior.
Details
Motivation: To test the assumption that different demographic cues (like names) are interchangeable operationalizations of the same identity-conditioned behavior in LLMs, and to understand how cue selection affects conclusions about demographic adaptation and bias.Method: Analyzed 14.8 million prompts in realistic advice-seeking interactions focusing on race and gender in a U.S. context, testing multiple demographic cues for the same groups to measure consistency in model responses.
Result: Different cues for the same demographic group induce only partially overlapping changes in model responses, leading to inconsistent conclusions about personalization and unstable bias assessments. Inconsistencies reflect differences in cue-group association strength and bundled linguistic features.
Conclusion: Demographic conditioning in LLMs is not cue-invariant but depends fundamentally on how identity is cued, reflecting responses to linguistic signals rather than stable demographic categories. Advocates for multi-cue, mechanism-aware evaluations for robust claims about demographic variation.
Abstract: Demographic cue-based evaluation is widely used to study how large language models (LLMs) adapt their responses to signaled demographic attributes within and across groups. This approach typically relies on a single cue (e.g., names) as a proxy for group membership, implicitly treating different cues as interchangeable operationalizations of the same identity-conditioned behavior. We test this assumption in realistic advice-seeking interactions spanning 14.8 million prompts, focusing on race and gender in a U.S. context. We find that cues for the same group induce only partially overlapping changes in model responses, yielding inconsistent conclusions about personalization, while bias conclusions are unstable, with both magnitude and direction of group differences varying across cues. We further show that these inconsistencies reflect differences in cue-group association strength and linguistic features bundled within cues that shape model responses. Together, our findings suggest that demographic conditioning in LLMs is not a cue-invariant category-level parameter but depends fundamentally on how identity is cued, reflecting responses to linguistic signals rather than stable demographic categories. We therefore advocate multi-cue, mechanism-aware evaluations for robust and interpretable claims about demographic variation in LLM responses.
[152] LinguaMap: Which Layers of LLMs Speak Your Language and How to Tune Them?
J. Ben Tamo, Daniel Carlander-Reuterfelt, Jonathan Rubin, Dezhi Hong, Mingxian Wang, Oleg Poliannikov
Main category: cs.CL
TL;DR: Selective fine-tuning of final layers improves language control in multilingual LLMs while maintaining task accuracy with minimal parameter updates.
Details
Motivation: Multilingual LLMs struggle with language control - responding in intended languages despite correct task responses, requiring efficient adaptation methods.Method: Four-scenario evaluation protocol, logit lens analysis for language probability tracking, cross-lingual semantic similarity analysis, and selective fine-tuning of final layers responsible for language control.
Result: Achieved over 98% language consistency across six languages while fine-tuning only 3-5% of parameters, matching full-scope fine-tuning performance with fraction of computational resources.
Conclusion: Layer-localization of language control enables efficient multilingual adaptation, with final layers being key for language-specific generation while early/middle layers handle semantic alignment and task reasoning.
Abstract: Despite multilingual pretraining, large language models often struggle with non-English tasks, particularly in language control, the ability to respond in the intended language. We identify and characterize two key failure modes: the multilingual transfer bottleneck (correct language, incorrect task response) and the language consistency bottleneck (correct task response, wrong language). To systematically surface these issues, we design a four-scenario evaluation protocol spanning MMLU, MGSM, and XQuAD benchmarks. To probe these issues with interpretability, we extend logit lens analysis to track language probabilities layer by layer and compute cross-lingual semantic similarity of hidden states. The results reveal a three-phase internal structure: early layers align inputs into a shared semantic space, middle layers perform task reasoning, and late layers drive language-specific generation. Guided by these insights, we introduce selective fine-tuning of only the final layers responsible for language control. On Qwen-3-32B and Bloom-7.1B, this method achieves over 98 percent language consistency across six languages while fine-tuning only 3-5 percent of parameters, without sacrificing task accuracy. Importantly, this result is nearly identical to that of full-scope fine-tuning (for example, above 98 percent language consistency for both methods across all prompt scenarios) but uses a fraction of the computational resources. To the best of our knowledge, this is the first approach to leverage layer-localization of language control for efficient multilingual adaptation.
[153] Detecting AI-Generated Content in Academic Peer Reviews
Siyuan Shen, Kai Wang
Main category: cs.CL
TL;DR: Study examines AI-generated content in peer reviews over time, finding minimal detection before 2022 but substantial increases through 2025, with ~20% of ICLR and 12% of Nature Communications reviews classified as AI-generated in 2025.
Details
Motivation: To understand the temporal emergence and prevalence of AI-generated content in academic peer review as large language models become more available, and to examine the implications for scholarly evaluation.Method: Applied a detection model trained on historical reviews to later review cycles at International Conference on Learning Representations (ICLR) and Nature Communications (NC), tracking temporal patterns from pre-2022 through 2025.
Result: Minimal AI-generated content detected before 2022, followed by substantial increase through 2025: approximately 20% of ICLR reviews and 12% of Nature Communications reviews classified as AI-generated in 2025. Most pronounced growth in NC occurred between Q3 and Q4 2024.
Conclusion: Evidence suggests rapidly increasing presence of AI-assisted content in peer review, highlighting need for further study of implications for scholarly evaluation.
Abstract: The growing availability of large language models (LLMs) has raised questions about their role in academic peer review. This study examines the temporal emergence of AI-generated content in peer reviews by applying a detection model trained on historical reviews to later review cycles at International Conference on Learning Representations (ICLR) and Nature Communications (NC). We observe minimal detection of AI-generated content before 2022, followed by a substantial increase through 2025, with approximately 20% of ICLR reviews and 12% of Nature Communications reviews classified as AI-generated in 2025. The most pronounced growth of AI-generated reviews in NC occurs between the third and fourth quarter of 2024. Together, these findings provide suggestive evidence of a rapidly increasing presence of AI-assisted content in peer review and highlight the need for further study of its implications for scholarly evaluation.
[154] Semantic Self-Distillation for Language Model Uncertainty
Edward Phillips, Sean Wu, Fredrik K. Gustafsson, Boyan Gao, David A. Clifton
Main category: cs.CL
TL;DR: Semantic Self-Distillation (SSD) distills semantic uncertainty from LLMs into lightweight student models for efficient uncertainty quantification without expensive sampling.
Details
Motivation: Large language models need principled uncertainty quantification, but semantic dispersion (variance in meaning of sampled answers) is computationally expensive for latency-critical applications.Method: Distill sampled semantic distributions into lightweight student models that estimate prompt-conditioned density before LLM generates answer tokens. Student models predict semantic distributions over possible answers.
Result: Student models perform competitively relative to sampling-based semantic dispersion baselines on hallucination prediction tasks (TriviaQA, MMLU), while offering additional uncertainty primitives for out-of-domain detection and multiple-choice answer selection.
Conclusion: SSD provides an efficient framework for distilling predictive uncertainty in complex output spaces beyond language, enabling prompt-level uncertainty signals and answer-level reliability evaluation without expensive sampling.
Abstract: Large language models present challenges for principled uncertainty quantification, in part due to their complexity and the diversity of their outputs. Semantic dispersion, or the variance in the meaning of sampled answers, has been proposed as a useful proxy for model uncertainty, but the associated computational cost prohibits its use in latency-critical applications. We show that sampled semantic distributions can be distilled into lightweight student models which estimate a prompt-conditioned density before the language model generates an answer token. The student model predicts a semantic distribution over possible answers; the entropy of this distribution provides a prompt-level uncertainty signal, and the probability density allows answer-level reliability evaluation. Across experiments on TriviaQA and MMLU, we find our student models perform competitively relative to sampling-based semantic dispersion baselines on a hallucination prediction task, whilst offering additional uncertainty primitives for out-of-domain detection and multiple-choice answer selection. We term this technique Semantic Self-Distillation (SSD), which can serve as a general framework for distilling predictive uncertainty in complex output spaces beyond language.
[155] On-Policy Context Distillation for Language Models
Tianzhu Ye, Li Dong, Xun Wu, Shaohan Huang, Furu Wei
Main category: cs.CL
TL;DR: OPCD is a framework that combines on-policy distillation with context distillation, enabling language models to internalize in-context knowledge by training on their own generated trajectories while minimizing reverse KL divergence against a context-conditioned teacher.
Details
Motivation: The paper aims to bridge on-policy distillation with context distillation to help language models better internalize in-context knowledge into their parameters, addressing limitations of existing methods for knowledge consolidation from historical solution traces and system prompts.Method: Proposes On-Policy Context Distillation (OPCD) framework where a student model is trained on its own generated trajectories while minimizing reverse Kullback-Leibler divergence against a context-conditioned teacher model. This combines on-policy learning with context distillation principles.
Result: OPCD consistently outperforms baseline methods across mathematical reasoning, text-based games, and domain-specific tasks, achieving higher task accuracy while better preserving out-of-distribution capabilities. It also enables effective cross-size distillation where smaller models can internalize experiential knowledge from larger teachers.
Conclusion: OPCD provides an effective framework for knowledge internalization in language models, successfully bridging on-policy and context distillation approaches, with applications in experiential knowledge distillation and system prompt distillation.
Abstract: Context distillation enables language models to internalize in-context knowledge into their parameters. In our work, we propose On-Policy Context Distillation (OPCD), a framework that bridges on-policy distillation with context distillation by training a student model on its own generated trajectories while minimizing reverse Kullback-Leibler divergence against a context-conditioned teacher. We demonstrate the effectiveness of OPCD on two important applications: experiential knowledge distillation, where models extract and consolidate transferable knowledge from their historical solution traces, and system prompt distillation, where models internalize beneficial behaviors encoded in optimized prompts. Across mathematical reasoning, text-based games, and domain-specific tasks, OPCD consistently outperforms baseline methods, achieving higher task accuracy while better preserving out-of-distribution capabilities. We further show that OPCD enables effective cross-size distillation, where smaller student models can internalize experiential knowledge from larger teachers.
[156] Rethinking Soft Compression in Retrieval-Augmented Generation: A Query-Conditioned Selector Perspective
Yunhao Liu, Zian Jia, Xinyu Gao, Kanjun Xu, Yun Xiong
Main category: cs.CL
TL;DR: SeleCom introduces a selector-based soft compression framework for RAG that uses query-conditioned information selection instead of full document compression, achieving better performance while reducing computation by 33.8%-84.6%.
Details
Motivation: Current soft context compression methods for RAG underperform non-compressed approaches because they use auto-encoder-like full compression that forces encoding of all document information regardless of query relevance, leading to information dilution and conflict with LLM generation behavior.Method: SeleCom redefines the encoder’s role as a query-conditioned information selector (decoder-only) trained with a massive, diverse, difficulty-graded synthetic QA dataset using curriculum learning, focusing only on relevant information rather than full document compression.
Result: SeleCom significantly outperforms existing soft compression approaches and achieves competitive or superior performance to non-compression baselines while reducing computation and latency by 33.8%-84.6%.
Conclusion: Query-conditioned information selection is more effective than full document compression for RAG, addressing scalability issues while maintaining or improving performance through selective encoding of relevant information.
Abstract: Retrieval-Augmented Generation (RAG) effectively grounds Large Language Models (LLMs) with external knowledge and is widely applied to Web-related tasks. However, its scalability is hindered by excessive context length and redundant retrievals. Recent research on soft context compression aims to address this by encoding long documents into compact embeddings, yet they often underperform non-compressed RAG due to their reliance on auto-encoder-like full-compression that forces the encoder to compress all document information regardless of relevance to the input query. In this work, we conduct an analysis on this paradigm and reveal two fundamental limitations: (I) Infeasibility, full-compression conflicts with the LLM’s downstream generation behavior; and (II) Non-necessity: full-compression is unnecessary and dilutes task-relevant information density. Motivated by these insights, we introduce SeleCom, a selector-based soft compression framework for RAG that redefines the encoder’s role as query-conditioned information selector. The selector is decoder-only and is trained with a massive, diverse and difficulty-graded synthetic QA dataset with curriculum learning. Extensive experiments show that SeleCom significantly outperforms existing soft compression approaches and achieves competitive or superior performance to non-compression baselines, while reducing computation and latency by 33.8%~84.6%.
[157] Why Agent Caching Fails and How to Fix It: Structured Intent Canonicalization with Few-Shot Learning
Abhinaba Basu
Main category: cs.CL
TL;DR: The paper introduces W5H2, a structured intent decomposition framework for efficient caching in personal AI agents, achieving high accuracy with minimal latency and significant cost reduction.
Details
Motivation: Personal AI agents incur high costs from repeated LLM calls, and existing caching methods fail due to optimizing for the wrong properties (classification accuracy rather than cache effectiveness requiring key consistency and precision).Method: Proposes W5H2 structured intent decomposition framework, applies V-measure decomposition to separate clustering properties, uses SetFit with few-shot learning, and implements a five-tier cascade with risk-controlled selective prediction guarantees via RCPS.
Result: Achieves 91.1% accuracy on MASSIVE dataset in ~2ms (vs 37.9% for GPTCache and 68.8% for 20B-parameter LLM at 3,447ms), handles 85% of interactions locally, projects 97.5% cost reduction, and demonstrates cross-lingual transfer across 30 languages.
Conclusion: W5H2 framework enables efficient caching for AI agents by focusing on cache-specific properties rather than general classification accuracy, achieving high performance with minimal latency and significant cost savings.
Abstract: Personal AI agents incur substantial cost via repeated LLM calls. We show existing caching methods fail: GPTCache achieves 37.9% accuracy on real benchmarks; APC achieves 0-12%. The root cause is optimizing for the wrong property – cache effectiveness requires key consistency and precision, not classification accuracy. We observe cache-key evaluation reduces to clustering evaluation and apply V-measure decomposition to separate these on n=8,682 points across MASSIVE, BANKING77, CLINC150, and NyayaBench v2, our new 8,514-entry multilingual agentic dataset (528 intents, 20 W5H2 classes, 63 languages). We introduce W5H2, a structured intent decomposition framework. Using SetFit with 8 examples per class, W5H2 achieves 91.1%+/-1.7% on MASSIVE in ~2ms – vs 37.9% for GPTCache and 68.8% for a 20B-parameter LLM at 3,447ms. On NyayaBench v2 (20 classes), SetFit achieves 55.3%, with cross-lingual transfer across 30 languages. Our five-tier cascade handles 85% of interactions locally, projecting 97.5% cost reduction. We provide risk-controlled selective prediction guarantees via RCPS with nine bound families.
[158] Unlocking Multimodal Document Intelligence: From Current Triumphs to Future Frontiers of Visual Document Retrieval
Yibo Yan, Jiahao Huo, Guanbo Feng, Mingdong Ou, Yi Cao, Xin Zou, Shuliang Liu, Yuanhuiyi Lyu, Yu Huang, Jungang Li, Kening Zheng, Xu Zheng, Philip S. Yu, James Kwok, Xuming Hu
Main category: cs.CL
TL;DR: A comprehensive survey of Visual Document Retrieval (VDR) focusing on Multimodal Large Language Models, covering benchmarks, methods (embedding models, rerankers, RAG/Agentic systems), and future directions.
Details
Motivation: With the growth of multimodal information, VDR has become crucial for extracting precise information from visually rich documents that contain dense text, complex layouts, and fine-grained semantics, requiring specialized approaches beyond traditional image retrieval.Method: The paper presents a survey methodology: 1) Examination of benchmark landscape, 2) Categorization of methods into multimodal embedding models, multimodal reranker models, and integration of RAG/Agentic systems for document intelligence.
Result: Provides a comprehensive overview of the VDR field, identifies key methodological approaches, and establishes a framework for understanding current state and future directions in multimodal document intelligence.
Conclusion: The survey offers a clear roadmap for future multimodal document intelligence by identifying persistent challenges and outlining promising research directions in the MLLM era.
Abstract: With the rapid proliferation of multimodal information, Visual Document Retrieval (VDR) has emerged as a critical frontier in bridging the gap between unstructured visually rich data and precise information acquisition. Unlike traditional natural image retrieval, visual documents exhibit unique characteristics defined by dense textual content, intricate layouts, and fine-grained semantic dependencies. This paper presents the first comprehensive survey of the VDR landscape, specifically through the lens of the Multimodal Large Language Model (MLLM) era. We begin by examining the benchmark landscape, and subsequently dive into the methodological evolution, categorizing approaches into three primary aspects: multimodal embedding models, multimodal reranker models, and the integration of Retrieval-Augmented Generation (RAG) and Agentic systems for complex document intelligence. Finally, we identify persistent challenges and outline promising future directions, aiming to provide a clear roadmap for future multimodal document intelligence.
[159] Feature Resemblance: Towards a Theoretical Understanding of Analogical Reasoning in Transformers
Ruichen Xu, Wenjing Yan, Ying-Jun Angela Zhang
Main category: cs.CL
TL;DR: Transformers learn analogical reasoning through aligned representations when jointly trained on similarity and attribution premises, with specific curriculum requirements and data dependencies.
Details
Motivation: To isolate and understand analogical reasoning in transformers, separate from other reasoning types that are often conflated in evaluations, and analyze its emergence mechanisms.Method: Theoretical analysis of transformer architectures with proofs of three key results about analogical reasoning, validated through experiments with models up to 1.5B parameters.
Result: Transformers encode entities with similar properties into similar representations, enabling property transfer through feature alignment. Joint training on similarity and attribution premises is necessary, and two-hop reasoning reduces to analogical reasoning with identity bridges.
Conclusion: Analogical reasoning in transformers emerges through representational geometry and feature alignment, with specific training requirements and data dependencies that reveal a unified mechanism for inductive reasoning.
Abstract: Understanding reasoning in large language models is complicated by evaluations that conflate multiple reasoning types. We isolate analogical reasoning (inferring shared properties between entities based on known similarities) and analyze its emergence in transformers. We theoretically prove three key results: (1) Joint training on similarity and attribution premises enables analogical reasoning through aligned representations; (2) Sequential training succeeds only when similarity structure is learned before specific attributes, revealing a necessary curriculum; (3) Two-hop reasoning ($a \to b, b \to c \implies a \to c$) reduces to analogical reasoning with identity bridges ($b = b$), which must appear explicitly in training data. These results reveal a unified mechanism: transformers encode entities with similar properties into similar representations, enabling property transfer through feature alignment. Experiments with architectures up to 1.5B parameters validate our theory and demonstrate how representational geometry shapes inductive reasoning capabilities.
[160] Mind the Gap: Pitfalls of LLM Alignment with Asian Public Opinion
Hari Shankar, Vedanta S P, Sriharini Margapuri, Debjani Mazumder, Ponnurangam Kumaraguru, Abhijnan Chakraborty
Main category: cs.CL
TL;DR: LLMs show cultural misalignment with non-Western societies, especially on religious topics, despite performing well on general social issues.
Details
Motivation: LLMs are deployed globally but trained on English-centric data, risking misalignment with diverse cultural values, particularly in sensitive domains like religion across India, East Asia, and Southeast Asia.Method: Multilingual audit of GPT-4o-Mini, Gemini-2.5-Flash, Llama 3.2, Mistral and Gemma 3 using internal representation analysis (log-probs/logits) to compare model opinion distributions against ground-truth public attitudes, plus bias benchmark evaluations (CrowS-Pairs, IndiBias, ThaiCLI, KoBBQ).
Result: Models generally align with public opinion on broad social issues but consistently fail on religious viewpoints, especially minority groups, often amplifying negative stereotypes. Lightweight interventions (demographic priming, native language prompting) help partially but don’t eliminate gaps.
Conclusion: Systematic, regionally grounded audits are urgently needed for equitable global LLM deployment due to persistent cultural misalignment in sensitive contexts.
Abstract: Large Language Models (LLMs) are increasingly being deployed in multilingual, multicultural settings, yet their reliance on predominantly English-centric training data risks misalignment with the diverse cultural values of different societies. In this paper, we present a comprehensive, multilingual audit of the cultural alignment of contemporary LLMs including GPT-4o-Mini, Gemini-2.5-Flash, Llama 3.2, Mistral and Gemma 3 across India, East Asia and Southeast Asia. Our study specifically focuses on the sensitive domain of religion as the prism for broader alignment. To facilitate this, we conduct a multi-faceted analysis of every LLM’s internal representations, using log-probs/logits, to compare the model’s opinion distributions against ground-truth public attitudes. We find that while the popular models generally align with public opinion on broad social issues, they consistently fail to accurately represent religious viewpoints, especially those of minority groups, often amplifying negative stereotypes. Lightweight interventions, such as demographic priming and native language prompting, partially mitigate but do not eliminate these cultural gaps. We further show that downstream evaluations on bias benchmarks (such as CrowS-Pairs, IndiBias, ThaiCLI, KoBBQ) reveal persistent harms and under-representation in sensitive contexts. Our findings underscore the urgent need for systematic, regionally grounded audits to ensure equitable global deployment of LLMs.
[161] Rethinking Evaluation in Retrieval-Augmented Personalized Dialogue: A Cognitive and Linguistic Perspective
Tianyi Zhang, David Traum
Main category: cs.CL
TL;DR: The paper critiques current evaluation metrics for personalized dialogue systems, using LAPDOG as a case study to show that surface-level similarity metrics (BLEU, ROUGE, F1) fail to capture deeper conversational qualities like coherence and consistency, and proposes more cognitively grounded evaluation methods.
Details
Motivation: Current evaluation practices for open-domain and personalized dialogue systems rely heavily on surface-level similarity metrics that don't capture the deeper aspects of conversational quality emphasized in cognitive science and linguistic theory, such as coherence, consistency, and shared understanding.Method: The researchers re-examine LAPDOG, a retrieval-augmented framework for personalized dialogue, as a case study. They use both human judges and LLM-based judges to evaluate dialogue quality, identifying specific limitations like corrupted dialogue histories, contradictions between retrieved stories and persona, and incoherent response generation.
Result: Human and LLM judgments align closely with each other but diverge significantly from lexical similarity metrics. The study reveals specific failure modes in current systems that surface-level metrics miss, highlighting the inadequacy of current evaluation practices.
Conclusion: The work calls for more cognitively grounded evaluation methods for retrieval-augmented dialogue systems that better reflect principles of natural human communication, and charts a path toward more reliable assessment frameworks.
Abstract: In cognitive science and linguistic theory, dialogue is not seen as a chain of independent utterances but rather as a joint activity sustained by coherence, consistency, and shared understanding. However, many systems for open-domain and personalized dialogue use surface-level similarity metrics (e.g., BLEU, ROUGE, F1) as one of their main reporting measures, which fail to capture these deeper aspects of conversational quality. We re-examine a notable retrieval-augmented framework for personalized dialogue, LAPDOG, as a case study for evaluation methodology. Using both human and LLM-based judges, we identify limitations in current evaluation practices, including corrupted dialogue histories, contradictions between retrieved stories and persona, and incoherent response generation. Our results show that human and LLM judgments align closely but diverge from lexical similarity metrics, underscoring the need for cognitively grounded evaluation methods. Broadly, this work charts a path toward more reliable assessment frameworks for retrieval-augmented dialogue systems that better reflect the principles of natural human communication.
[162] PA3: Policy-Aware Agent Alignment through Chain-of-Thought
Shubhashis Roy Dipta, Daniel Bis, Kun Zhou, Lichao Wang, Benjamin Z. Yao, Chenlei Guo, Ruhi Sarikaya
Main category: cs.CL
TL;DR: Multi-stage alignment method teaches LLMs to recall and apply relevant business policies during chain-of-thought reasoning without including full policies in-context, improving performance while reducing context length.
Details
Motivation: LLMs struggle with adhering to complex business-specific rules in conversational assistants. Including all policies in context causes high latency, wasted compute, and performance degradation due to the "needle-in-a-haystack" problem in long contexts.Method: Proposes a multi-stage alignment method that teaches models to recall and apply relevant business policies during chain-of-thought reasoning at inference time without full policy inclusion. Introduces PolicyRecall reward based on Jaccard score and Hallucination Penalty for GRPO training.
Result: Best model outperforms baseline by 16 points and surpasses comparable in-context baselines of similar model size by 3 points, while using 40% fewer words.
Conclusion: The approach enables efficient business policy adherence in LLM-powered conversational assistants by teaching selective policy recall rather than full policy inclusion, addressing latency and performance issues.
Abstract: Conversational assistants powered by large language models (LLMs) excel at tool-use tasks but struggle with adhering to complex, business-specific rules. While models can reason over business rules provided in context, including all policies for every query introduces high latency and wastes compute. Furthermore, these lengthy prompts lead to long contexts, harming overall performance due to the “needle-in-the-haystack” problem. To address these challenges, we propose a multi-stage alignment method that teaches models to recall and apply relevant business policies during chain-of-thought reasoning at inference time, without including the full business policy in-context. Furthermore, we introduce a novel PolicyRecall reward based on the Jaccard score and a Hallucination Penalty for GRPO training. Altogether, our best model outperforms the baseline by 16 points and surpasses comparable in-context baselines of similar model size by 3 points, while using 40% fewer words.
[163] Seamless Deception: Larger Language Models Are Better Knowledge Concealers
Dhananjay Ashok, Ruth-Ann Armstrong, Jonathan May
Main category: cs.CL
TL;DR: Classifiers can detect when language models are concealing knowledge, but they don’t generalize well across architectures/topics and fail completely on models larger than 70B parameters, exposing limitations in black-box auditing.
Details
Motivation: Language models may acquire harmful knowledge and feign ignorance during audits, so researchers aim to develop methods to detect when LMs are actively concealing knowledge they possess.Method: Trained classifiers to detect concealment behavior in LMs, comparing gradient-based vs prompt-based concealment methods, and testing generalization across different model architectures and topics.
Result: Classifiers outperformed human evaluators at detecting concealment in smaller models, but failed to generalize to unseen architectures/topics and performed no better than random on models exceeding 70B parameters.
Conclusion: Current black-box auditing methods have critical limitations for detecting knowledge concealment in large LMs, highlighting the need for more robust detection techniques.
Abstract: Language Models (LMs) may acquire harmful knowledge, and yet feign ignorance of these topics when under audit. Inspired by the recent discovery of deception-related behaviour patterns in LMs, we aim to train classifiers that detect when a LM is actively concealing knowledge. Initial findings on smaller models show that classifiers can detect concealment more reliably than human evaluators, with gradient-based concealment proving easier to identify than prompt-based methods. However, contrary to prior work, we find that the classifiers do not reliably generalize to unseen model architectures and topics of hidden knowledge. Most concerningly, the identifiable traces associated with concealment become fainter as the models increase in scale, with the classifiers achieving no better than random performance on any model exceeding 70 billion parameters. Our results expose a key limitation in black-box-only auditing of LMs and highlight the need to develop robust methods to detect models that are actively hiding the knowledge they contain.
[164] CoVerRL: Breaking the Consensus Trap in Label-Free Reasoning via Generator-Verifier Co-Evolution
Teng Pan, Yuchen Yan, Zixuan Wang, Ruiqing Zhang, Guiyang Hou, Wenqi Zhang, Weiming Lu, Jun Xiao, Yongliang Shen
Main category: cs.CL
TL;DR: CoVerRL: A framework where a single LLM alternates between generator and verifier roles to escape the “consensus trap” in label-free RL, where models reinforce systematic errors as output diversity collapses.
Details
Motivation: Label-free RL for LLMs uses majority-voted answers as pseudo-labels, but suffers from "consensus trap" where maximizing self-consistency causes output diversity collapse, leading models to confidently reinforce systematic errors that evade detection.Method: CoVerRL framework where a single model alternates between generator and verifier roles. Majority voting provides noisy supervision for training the verifier, while the improving verifier progressively filters self-consistent errors from pseudo-labels, creating a co-evolution cycle.
Result: Outperforms label-free baselines by 4.7-5.9% on mathematical reasoning benchmarks across Qwen and Llama model families. Self-verification accuracy improves from ~55% to over 85%, confirming genuine co-evolution of both capabilities.
Conclusion: CoVerRL successfully escapes the consensus trap by enabling a single model to co-evolve generator and verifier capabilities through alternating roles, maintaining high reward accuracy throughout training while preventing systematic error reinforcement.
Abstract: Label-free reinforcement learning enables large language models to improve reasoning capabilities without ground-truth supervision, typically by treating majority-voted answers as pseudo-labels. However, we identify a critical failure mode: as training maximizes self-consistency, output diversity collapses, causing the model to confidently reinforce systematic errors that evade detection. We term this the consensus trap. To escape it, we propose CoVerRL, a framework where a single model alternates between generator and verifier roles, with each capability bootstrapping the other. Majority voting provides noisy but informative supervision for training the verifier, while the improving verifier progressively filters self-consistent errors from pseudo-labels. This co-evolution creates a virtuous cycle that maintains high reward accuracy throughout training. Experiments across Qwen and Llama model families demonstrate that CoVerRL outperforms label-free baselines by 4.7-5.9% on mathematical reasoning benchmarks. Moreover, self-verification accuracy improves from around 55% to over 85%, confirming that both capabilities genuinely co-evolve.
[165] How Psychological Learning Paradigms Shaped and Constrained Artificial Intelligence
Alex Anvi Eponon, Ildar Batyrshin, Christian E. Maldonado-Sifuentes, Grigori Sidorov
Main category: cs.CL
TL;DR: The paper critiques how AI paradigms inherited limitations from their psychological inspirations (behaviorism→RL, cognitivism→deep learning, constructivism→compositional approaches) and proposes ReSynth, a trimodular framework separating reasoning, purpose, and knowledge to address systematicity and adaptability challenges in AGI.
Details
Motivation: Current AI paradigms have inherited structural limitations from their psychological inspirations: RL lacks internal knowledge structure, deep learning has opaque representations resistant to principled updates, and integrative approaches lack formal accounts of constructing new understanding. The paper aims to address these limitations for achieving artificial general intelligence.Method: Introduces ReSynth, a trimodular framework that architecturally separates reasoning (Intellect), purpose (Identity), and knowledge (Memory) as independent components. The approach draws on cross-cultural perspectives on learning (Eastern conceptions of memorization) and critiques from the systematicity debate, aiming to create an architecture where systematic behavior emerges necessarily rather than accidentally.
Result: The paper presents a theoretical framework that diagnoses inherited limitations from psychological paradigms and proposes an architectural solution. While empirical results aren’t specified, the framework offers a principled approach to address adaptability and systematicity challenges in AGI development.
Conclusion: Adaptability, the central challenge of AGI, requires a representational architecture where systematic behavior is a necessary consequence rather than an accidental property. The ReSynth framework provides a pathway to overcome limitations inherited from psychological paradigms by separating reasoning, purpose, and knowledge components.
Abstract: The dominant paradigms of artificial intelligence were shaped by learning theories from psychology: behaviorism inspired reinforcement learning, cognitivism gave rise to deep learning and memory-augmented architectures, and constructivism influenced curriculum learning and compositional approaches. This paper argues that each AI paradigm inherited not only the strengths but the structural limitations of the psychological theory that inspired it. Reinforcement learning cannot account for the internal structure of knowledge, deep learning compresses representations into opaque parameter spaces resistant to principled update, and current integrative approaches lack a formal account of how new understanding is constructed from existing components. The paper further examines a cross-cultural divergence in the interpretation of rote learning, arguing that the Eastern conception of memorization as a structured, multi-phase precursor to understanding offers an underexploited bridge between psychological theory and AI methodology. Drawing on the systematicity debate and critique of Aizawa of both classicism and connectionism, this paper introduces ReSynth, a trimodular framework that separates reasoning (Intellect), purpose (Identity), and knowledge (Memory) as architecturally independent components. The paper traces the genealogy from psychological paradigm to AI method, diagnoses the inherited limitations at each stage, and argues that adaptability, the central challenge of artificial general intelligence requires a representational architecture in which systematic behavior is a necessary consequence rather than an accidental property.
[166] Evaluating LLM-Generated Lessons from the Language Learning Students’ Perspective: A Short Case Study on Duolingo
Carlos Rafael Catalan, Patricia Nicole Monderin, Lheane Marie Dizon, Gap Estrella, Raymund John Sarmimento, Marie Antoinette Patalagsa
Main category: cs.CL
TL;DR: Language learning apps like Duolingo use LLMs for lessons but lack profession-specific content, hindering professional fluency. Survey of Filipino employees shows general scenarios are effective for basics but domain-specific content is needed for professional communication. Recommendation: personalized, domain-specific lessons while maintaining foundational general scenarios.
Details
Motivation: Current language learning applications using LLMs primarily focus on general real-world scenarios, creating a gap in supporting professional-level fluency. Professional fluency requires domain-specific vocabulary and work-related communication skills that existing apps don't adequately address.Method: Surveyed five employees from a multinational company in the Philippines about their experiences with Duolingo. Analyzed frequency of encountering general vs. work-related scenarios, effectiveness of different lesson types, and collected suggestions for improvement.
Result: Respondents encountered general scenarios more frequently than work-related ones. General scenarios were relatable and effective for building foundational grammar, vocabulary, and cultural knowledge. Work-related scenarios help bridge the gap toward professional fluency with domain-specific vocabulary. Participants suggested diverse lesson scenarios when analyzed collectively.
Conclusion: Language learning applications should generate personalized, domain-specific lesson scenarios that adapt to individual needs while maintaining foundational support through general, relatable lesson scenarios to achieve professional fluency.
Abstract: Popular language learning applications such as Duolingo use large language models (LLMs) to generate lessons for its users. Most lessons focus on general real-world scenarios such as greetings, ordering food, or asking directions, with limited support for profession-specific contexts. This gap can hinder learners from achieving professional-level fluency, which we define as the ability to communicate comfortably various work-related and domain-specific information in the target language. We surveyed five employees from a multinational company in the Philippines on their experiences with Duolingo. Results show that respondents encountered general scenarios more frequently than work-related ones, and that the former are relatable and effective in building foundational grammar, vocabulary, and cultural knowledge. The latter helps bridge the gap toward professional fluency as it contains domain-specific vocabulary. Each participant suggested lesson scenarios that diverge in contexts when analyzed in aggregate. With this understanding, we propose that language learning applications should generate lessons that adapt to an individual’s needs through personalized, domain specific lesson scenarios while maintaining foundational support through general, relatable lesson scenarios.
[167] Nemotron-Cascade 2: Post-Training LLMs with Cascade RL and Multi-Domain On-Policy Distillation
Zhuolin Yang, Zihan Liu, Yang Chen, Wenliang Dai, Boxin Wang, Sheng-Chieh Lin, Chankyu Lee, Yangyi Chen, Dongfu Jiang, Jiafan He, Renjie Pi, Grace Lam, Nayeon Lee, Alexander Bukharin, Mohammad Shoeybi, Bryan Catanzaro, Wei Ping
Main category: cs.CL
TL;DR: Nemotron-Cascade 2 is a 30B MoE model with 3B activated parameters that achieves state-of-the-art reasoning and agentic capabilities, approaching frontier model performance in mathematics and coding with significantly fewer parameters.
Details
Motivation: To create an efficient, high-performance open-weight language model that delivers best-in-class reasoning and agentic capabilities while maintaining compact size, addressing the need for intelligent models with high parameter efficiency.Method: Uses Mixture of Experts (MoE) architecture with 30B total parameters but only 3B activated. After supervised fine-tuning on curated data, employs expanded Cascade RL covering broader reasoning and agentic domains, plus multi-domain on-policy distillation from strongest intermediate teacher models throughout training.
Result: Achieves Gold Medal-level performance in IMO, IOI, and ICPC World Finals - only the second open-weight LLM to do so. Delivers reasoning performance approaching frontier models with 20x fewer parameters, demonstrating high intelligence density.
Conclusion: Nemotron-Cascade 2 shows that compact models can achieve elite reasoning capabilities through careful architecture design and training techniques, making high-performance AI more accessible through open-weight release.
Abstract: We introduce Nemotron-Cascade 2, an open 30B MoE model with 3B activated parameters that delivers best-in-class reasoning and strong agentic capabilities. Despite its compact size, its mathematical and coding reasoning performance approaches that of frontier open models. It is the second open-weight LLM, after DeepSeekV3.2-Speciale-671B-A37B, to achieve Gold Medal-level performance in the 2025 International Mathematical Olympiad (IMO), the International Olympiad in Informatics (IOI), and the ICPC World Finals, demonstrating remarkably high intelligence density with 20x fewer parameters. In contrast to Nemotron-Cascade 1, the key technical advancements are as follows. After SFT on a meticulously curated dataset, we substantially expand Cascade RL to cover a much broader spectrum of reasoning and agentic domains. Furthermore, we introduce multi-domain on-policy distillation from the strongest intermediate teacher models for each domain throughout the Cascade RL process, allowing us to efficiently recover benchmark regressions and sustain strong performance gains along the way. We release the collection of model checkpoint and training data.
[168] Automatic Analysis of Collaboration Through Human Conversational Data Resources: A Review
Yi Yu, Maria Boritchev, Chloé Clavel
Main category: cs.CL
TL;DR: A review paper on using task-oriented human-human conversational data for collaboration analysis, covering theories, coding schemes, tasks, and modeling approaches.
Details
Motivation: Collaboration is a fundamental high-level human behavior where conversation serves as the primary medium for information exchange. The authors aim to understand how to utilize task-oriented conversational data for analyzing collaborative processes, given its value as a resource for automatic collaboration analysis.Method: The paper conducts a comprehensive review of existing literature on collaboration analysis using task-oriented conversation resources. It systematically examines related theories, coding schemes, tasks, and modeling approaches in the field.
Result: The review provides a practical resource for researchers working on collaboration analysis and identifies unexplored areas for future research in this domain.
Conclusion: Task-oriented conversational data is a valuable resource for collaboration analysis, and the review serves as both a practical guide and a roadmap for future research directions in this area.
Abstract: Collaboration is a task-oriented, high-level human behavior. In most cases, conversation serves as the primary medium for information exchange and coordination, making conversational data a valuable resource for the automatic analysis of collaborative processes. In this paper, we focus on verbal aspects of collaboration and conduct a review of collaboration analysis using task-oriented conversation resources, encompassing related theories, coding schemes, tasks, and modeling approaches. We aim to address the question of how to utilize task-oriented human-human conversational data for collaboration analysis. We hope our review will serve as a practical resource and illuminate unexplored areas for future collaboration analysis.
[169] Scalable Prompt Routing via Fine-Grained Latent Task Discovery
Yunyi Zhang, Soji Adeshina, Sheng Guan, Ashwin Ganesh, Zhen Han, Vassilis N. Ioannidis, Huzefa Rangwala, George Karypis
Main category: cs.CL
TL;DR: A two-stage prompt routing system that uses automated task discovery and task-aware quality estimation to dynamically select the most appropriate LLM from a pool of frontier models, achieving better performance than the strongest individual model at less than half the cost.
Details
Motivation: As model pools scale with dozens of frontier models having narrow performance gaps, existing routing approaches fail: manual task taxonomies can't capture fine-grained capability distinctions, and monolithic routers struggle to differentiate subtle differences across diverse tasks.Method: Two-stage architecture: 1) Graph-based clustering discovers latent task types and trains a classifier for prompt assignment; 2) Mixture-of-experts with task-specific prediction heads for specialized quality estimation. Inference aggregates both stages to balance task-level stability with prompt-specific adaptability.
Result: Evaluated on 10 benchmarks with 11 frontier models, the method consistently outperforms existing baselines and surpasses the strongest individual model while incurring less than half its cost.
Conclusion: The proposed two-stage routing architecture effectively addresses limitations of existing approaches by enabling automated fine-grained task discovery and task-aware quality estimation for optimal model selection in large model pools.
Abstract: Prompt routing dynamically selects the most appropriate large language model from a pool of candidates for each query, optimizing performance while managing costs. As model pools scale to include dozens of frontier models with narrow performance gaps, existing approaches face significant challenges: manually defined task taxonomies cannot capture fine-grained capability distinctions, while monolithic routers struggle to differentiate subtle differences across diverse tasks. We propose a two-stage routing architecture that addresses these limitations through automated fine-grained task discovery and task-aware quality estimation. Our first stage employs graph-based clustering to discover latent task types and trains a classifier to assign prompts to discovered tasks. The second stage uses a mixture-of-experts architecture with task-specific prediction heads for specialized quality estimates. At inference, we aggregate predictions from both stages to balance task-level stability with prompt-specific adaptability. Evaluated on 10 benchmarks with 11 frontier models, our method consistently outperforms existing baselines and surpasses the strongest individual model while incurring less than half its cost.
[170] Current LLMs still cannot ’talk much’ about grammar modules: Evidence from syntax
Mohammed Q. Shormani
Main category: cs.CL
TL;DR: LLMs like ChatGPT struggle with accurate translation of technical syntax terminology from English to Arabic, achieving only 25% accuracy on core generative syntax terms.
Details
Motivation: To evaluate how well Large Language Models can handle specialized linguistic terminology translation, specifically examining whether LLMs can accurately translate core syntax properties and technical terms from generative syntax literature.Method: Collected 44 technical terms from generative syntax literature, had them translated by both human experts and ChatGPT-5, then conducted comparative analysis using analytical and comparative approaches to evaluate translation accuracy.
Result: Only 25% of ChatGPT translations were accurate, 38.6% were inaccurate, and 36.4% were partially correct. The study reveals significant syntactic and semantic challenges in LLM translation of technical linguistic terminology.
Conclusion: LLMs still cannot effectively handle specialized linguistic translation tasks, requiring closer collaboration between AI specialists and linguists to improve translation mechanisms for technical terminology.
Abstract: We aim to examine the extent to which Large Language Models (LLMs) can ’talk much’ about grammar modules, providing evidence from syntax core properties translated by ChatGPT into Arabic. We collected 44 terms from generative syntax previous works, including books and journal articles, as well as from our experience in the field. These terms were translated by humans, and then by ChatGPT-5. We then analyzed and compared both translations. We used an analytical and comparative approach in our analysis. Findings unveil that LLMs still cannot ’talk much’ about the core syntax properties embedded in the terms under study involving several syntactic and semantic challenges: only 25% of ChatGPT translations were accurate, while 38.6% were inaccurate, and 36.4.% were partially correct, which we consider appropriate. Based on these findings, a set of actionable strategies were proposed, the most notable of which is a close collaboration between AI specialists and linguists to better LLMs’ working mechanism for accurate or at least appropriate translation.
[171] Conflict-Aware Fusion: Mitigating Logic Inertia in Large Language Models via Structured Cognitive Priors
Qiming Bao, Xiaoxuan Fu, Michael Witbrock
Main category: cs.CL
TL;DR: Unable to analyze paper 2512.06393 due to HTTP 429 error when fetching abstract from arXiv API
Details
Motivation: Cannot determine motivation as abstract is unavailableMethod: Cannot determine method as abstract is unavailable
Result: Cannot determine results as abstract is unavailable
Conclusion: Cannot draw conclusions about paper content due to data unavailability
Abstract: Failed to fetch summary for 2512.06393: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.06393&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[172] Collusive Pricing Under LLM
Shengyu Cao, Ming Hu
Main category: cs.CL
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation as paper content is unavailable due to server rate limitingMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot draw conclusions about the paper due to access limitations
Abstract: Failed to fetch summary for 2601.01279: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.01279&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[173] APEX-SWE
Abhi Kottamasu, Chirag Mahapatra, Sam Lee, Ben Pan, Aakash Barthwal, Akul Datta, Anurag Gupta, Pranav Mehta, Ajay Arun, Silas Alberti, Adarsh Hiremath, Brendan Foody, Bertie Vidgen
Main category: cs.CL
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot draw conclusions as paper content is unavailable
Abstract: Failed to fetch summary for 2601.08806: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.08806&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[174] C$^2$-Cite: Contextual-Aware Citation Generation for Attributed Large Language Models
Yue Yu, Ting Bai, HengZhi Lan, Li Qian, Li Peng, Jie Wu, Wei Liu, Jian Luan, Chuan Shi
Main category: cs.CL
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2602.00004: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.00004&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[175] Measuring Complexity at the Requirements Stage: Spectral Metrics as Development Effort Predictors
Maximilian Vierlboeck, Antonio Pugliese, Roshanak Nilchian, Paul Grogan, Rashika Sugganahalli Natesh Babu
Main category: cs.CL
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot determine conclusion as paper content is unavailable
Abstract: Failed to fetch summary for 2602.07182: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.07182&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[176] MHPO: Modulated Hazard-aware Policy Optimization for Stable Reinforcement Learning
Hongjun Wang, Wei Liu, Weibo Gu, Xing Sun, Kai Han
Main category: cs.CL
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to paper fetch failureMethod: Unable to determine method due to paper fetch failure
Result: Unable to determine results due to paper fetch failure
Conclusion: Unable to determine conclusion due to paper fetch failure
Abstract: Failed to fetch summary for 2603.16929: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.16929&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[177] FinTradeBench: A Financial Reasoning Benchmark for LLMs
Yogesh Agrawal, Aniruddha Dutta, Md Mahadi Hasan, Santu Karmaker, Aritra Dutta
Main category: cs.CL
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting). The paper ID 2603.19225 suggests it’s from March 2023, but without the abstract, I cannot analyze its content or relevance.
Details
Motivation: Cannot determine motivation without access to the paper abstract or content.Method: Cannot determine method without access to the paper abstract or content.
Result: Cannot determine results without access to the paper abstract or content.
Conclusion: Cannot draw conclusions without access to the paper abstract or content.
Abstract: Failed to fetch summary for 2603.19225: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.19225&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
cs.CV
[178] Efficient AI-Driven Multi-Section Whole Slide Image Analysis for Biochemical Recurrence Prediction in Prostate Cancer
Yesung Cho, Dongmyung Shin, Sujeong Hong, Jooyeon Lee, Seongmin Park, Geongyu Lee, Jongbae Park, Hong Koo Ha
Main category: cs.CV
TL;DR: AI framework for prostate cancer BCR prediction using multi-section pathology slides, outperforming clinical benchmarks with validated prognostic value.
Details
Motivation: Precise prediction of biochemical recurrence (BCR) after radical prostatectomy is challenging due to prostate cancer's multifocality and tumor distribution throughout the gland.Method: Novel AI framework that simultaneously processes multi-section pathology slides to capture comprehensive tumor landscape across entire prostate gland, using patch and slide sub-sampling strategies to reduce computational cost.
Result: Strong predictive performance for 1- and 2-year BCR prediction, substantially outperforming clinical benchmarks; AI-derived risk score validated as most potent independent prognostic factor; computational cost reduced without compromising performance; external validation confirmed generalizability.
Conclusion: The AI-based multi-section slide analysis demonstrates clinical feasibility and prognostic value as a scalable tool for post-operative management in prostate cancer.
Abstract: Prostate cancer is one of the most frequently diagnosed malignancies in men worldwide. However, precise prediction of biochemical recurrence (BCR) after radical prostatectomy remains challenging due to the multifocality of tumors distributed throughout the prostate gland. In this paper, we propose a novel AI framework that simultaneously processes a series of multi-section pathology slides to capture the comprehensive tumor landscape across the entire prostate gland. To develop this predictive AI model, we curated a large-scale dataset of 23,451 slides from 789 patients. The proposed framework demonstrated strong predictive performance for 1- and 2-year BCR prediction, substantially outperforming established clinical benchmarks. The AI-derived risk score was validated as the most potent independent prognostic factor in a multivariable Cox proportional hazards analysis, surpassing conventional clinical markers such as pre-operative PSA and Gleason score. Furthermore, we demonstrated that integrating patch and slide sub-sampling strategies significantly reduces computational cost during both training and inference without compromising predictive performance, and generalizability of AI was confirmed through external validation. Collectively, these results highlight the clinical feasibility and prognostic value of the proposed AI-based multi-section slide analysis as a scalable tool for post-operative management in prostate cancer.
[179] Understanding Pruning Regimes in Vision-Language Models Through Domain-Aware Layer Selection
Saeed Khaki, Nima Safaei, Kamal Ginotra
Main category: cs.CV
TL;DR: Transformer VLMs have depth redundancy; domain-aware pruning based on activation similarity reveals three pruning regimes and enables targeted layer removal without sacrificing math or general multimodal capabilities.
Details
Motivation: Vision-language models contain substantial depth redundancy, but the effect of removing specific decoder layers is poorly understood, especially for domains requiring tight coupling between perception and multi-step reasoning like mathematics.Method: Study structured decoder layer pruning through domain-aware activation similarity, measuring how strongly each layer transforms representations for math vs non-math inputs. Develop math-aware, non-math-aware, and mixed ranking criteria to identify layers with least activation change within target domains.
Result: Across two state-of-the-art VLMs and math/general multimodal benchmarks, discover consistent three-regime structure: low pruning budgets show high sensitivity to which layers are removed; moderate budgets see method convergence as structural damage accumulates; high budgets favor spacing-aware strategies. Domain-aware rankings achieve strongest stability in ranking-sensitive regime while matching/exceeding structure-aware baselines at larger budgets.
Conclusion: Results provide clearer picture of how depth contributes to domain-specific behavior in VLMs and offer practical, interpretable approach to reducing model depth without sacrificing essential mathematical or general vision-language capabilities.
Abstract: Transformer-based vision-language models (VLMs) contain substantial depth redundancy, yet the effect of removing specific decoder layers remains poorly understood, especially for domains that require tight coupling between perception and multi-step reasoning. We study structured decoder layer pruning through the lens of domain-aware activation similarity, measuring how strongly each layer transforms representations for math versus non-math inputs. This yields simple math-aware, non-math-aware, and mixed ranking criteria that identify layers whose input-output activations change least within a target domain. Across two state-of-the-art VLMs and a broad suite of math and general multimodal benchmarks, we uncover a consistent three-regime structure: at low pruning budgets, performance is highly sensitive to which layers are removed; at moderate budgets, methods converge as structural damage accumulates; and at high budgets, structural continuity dominates, favoring spacing-aware strategies. Our domain-aware rankings achieve the strongest stability in the ranking-sensitive regime, while matching or exceeding structure-aware baselines at larger budgets. These results provide a clearer picture of how depth contributes to domain-specific behavior in VLMs and offer a practical, interpretable approach to reducing model depth without sacrificing essential mathematical or general vision-language capabilities.
[180] Mix-and-Match Pruning: Globally Guided Layer-Wise Sparsification of DNNs
Danial Monachan, Samira Nazari, Mahdi Taheri, Ali Azarpeyvand, Milos Krstic, Michael Huebner, Christian Herglotz
Main category: cs.CV
TL;DR: Mix-and-Match Pruning: A globally guided, layer-wise sparsification framework that uses sensitivity scores and architectural rules to generate diverse pruning configurations for efficient DNN deployment on edge devices.
Details
Motivation: Different layers and architectures respond differently to pruning, making single-strategy approaches suboptimal. There's a need for efficient compression with minimal accuracy loss for edge deployment of DNNs.Method: Uses sensitivity scores (magnitude, gradient, or combination) and architectural rules to derive architecture-aware sparsity ranges. Systematically samples these ranges to produce ten strategies per sensitivity signal, eliminating repeated pruning runs.
Result: Demonstrates Pareto-optimal results on CNNs and Vision Transformers, reducing accuracy degradation on Swin-Tiny by 40% relative to standard single-criterion pruning.
Conclusion: Coordinating existing pruning signals enables more reliable and efficient compressed models than introducing new criteria, offering deployment-ready accuracy-sparsity trade-offs.
Abstract: Deploying deep neural networks (DNNs) on edge devices requires strong compression with minimal accuracy loss. This paper introduces Mix-and-Match Pruning, a globally guided, layer-wise sparsification framework that leverages sensitivity scores and simple architectural rules to generate diverse, high-quality pruning configurations. The framework addresses a key limitation that different layers and architectures respond differently to pruning, making single-strategy approaches suboptimal. Mix-and-Match derives architecture-aware sparsity ranges, e.g., preserving normalization layers while pruning classifiers more aggressively, and systematically samples these ranges to produce ten strategies per sensitivity signal (magnitude, gradient, or their combination). This eliminates repeated pruning runs while offering deployment-ready accuracy-sparsity trade-offs. Experiments on CNNs and Vision Transformers demonstrate Pareto-optimal results, with Mix-and-Match reducing accuracy degradation on Swin-Tiny by 40% relative to standard single-criterion pruning. These findings show that coordinating existing pruning signals enables more reliable and efficient compressed models than introducing new criteria.
[181] STAC: Plug-and-Play Spatio-Temporal Aware Cache Compression for Streaming 3D Reconstruction
Runze Wang, Yuxuan Song, Youcheng Cai, Ligang Liu
Main category: cs.CV
TL;DR: STAC: A cache compression framework for streaming 3D reconstruction that reduces memory usage 10x and speeds inference 4x by exploiting spatio-temporal sparsity in attention mechanisms.
Details
Motivation: Streaming 3D reconstruction requires long-term temporal consistency and efficient memory, but current causal transformers with KV cache grow linearly with stream length, creating memory bottlenecks that degrade reconstruction quality when cache eviction occurs under limited memory budgets.Method: STAC framework with three components: (1) Working Temporal Token Caching using decayed cumulative attention scores to preserve long-term informative tokens; (2) Long-term Spatial Token Caching that compresses spatially redundant tokens into voxel-aligned representations; (3) Chunk-based Multi-frame Optimization for joint processing of consecutive frames to improve temporal coherence and GPU efficiency.
Result: Achieves state-of-the-art reconstruction quality while reducing memory consumption by nearly 10x and accelerating inference by 4x, substantially improving scalability for real-time 3D reconstruction in streaming settings.
Conclusion: STAC effectively addresses memory bottlenecks in streaming 3D reconstruction by exploiting intrinsic spatio-temporal sparsity in attention mechanisms, enabling efficient and scalable real-time 3D reconstruction.
Abstract: Online 3D reconstruction from streaming inputs requires both long-term temporal consistency and efficient memory usage. Although causal VGGT transformers address this challenge through a key-value (KV) cache mechanism, the cache grows linearly with the stream length, creating a major memory bottleneck. Under limited memory budgets, early cache eviction significantly degrades reconstruction quality and temporal consistency. In this work, we observe that attention in causal transformers for 3D reconstruction exhibits intrinsic spatio-temporal sparsity. Based on this insight, we propose STAC, a Spatio-Temporally Aware Cache Compression framework for streaming 3D reconstruction with large causal transformers. STAC consists of three key components: (1) a Working Temporal Token Caching mechanism that preserves long-term informative tokens using decayed cumulative attention scores; (2) a Long-term Spatial Token Caching scheme that compresses spatially redundant tokens into voxel-aligned representations for memory-efficient storage; and (3) a Chunk-based Multi-frame Optimization strategy that jointly processes consecutive frames to improve temporal coherence and GPU efficiency. Extensive experiments show that STAC achieves state-of-the-art reconstruction quality while reducing memory consumption by nearly 10x and accelerating inference by 4x, substantially improving the scalability of real-time 3D reconstruction in streaming settings.
[182] UniAnimate-DiT: Human Image Animation with Large-Scale Video Diffusion Transformer
Xiang Wang, Shiwei Zhang, Longxiang Tang, Yingya Zhang, Changxin Gao, Yuehuan Wang, Nong Sang
Main category: cs.CV
TL;DR: UniAnimate-DiT is a human image animation system that fine-tunes the Wan2.1 model using LoRA and a lightweight pose encoder to generate high-fidelity, temporally consistent animations that can upscale from 480p to 720p.
Details
Motivation: To create consistent human image animation while preserving the robust generative capabilities of existing diffusion models, with efficient parameter tuning and enhanced pose alignment.Method: Uses Low-Rank Adaptation (LoRA) to fine-tune minimal parameters of Wan2.1 model, designs lightweight pose encoder with stacked 3D convolutional layers, and integrates reference appearance via concatenation with pose information for better alignment.
Result: Achieves visually appealing and temporally consistent high-fidelity animations with strong generalization capabilities, successfully upscaling from 480p training to 720p inference.
Conclusion: UniAnimate-DiT demonstrates effective human image animation with efficient parameter tuning and good generalization, making it a practical solution for high-quality video generation.
Abstract: This report presents UniAnimate-DiT, an advanced project that leverages the cutting-edge and powerful capabilities of the open-source Wan2.1 model for consistent human image animation. Specifically, to preserve the robust generative capabilities of the original Wan2.1 model, we implement Low-Rank Adaptation (LoRA) technique to fine-tune a minimal set of parameters, significantly reducing training memory overhead. A lightweight pose encoder consisting of multiple stacked 3D convolutional layers is designed to encode motion information of driving poses. Furthermore, we adopt a simple concatenation operation to integrate the reference appearance into the model and incorporate the pose information of the reference image for enhanced pose alignment. Experimental results show that our approach achieves visually appearing and temporally consistent high-fidelity animations. Trained on 480p (832x480) videos, UniAnimate-DiT demonstrates strong generalization capabilities to seamlessly upscale to 720P (1280x720) during inference. The training and inference code is publicly available at https://github.com/ali-vilab/UniAnimate-DiT.
[183] Efficient Visual Anomaly Detection at the Edge: Enabling Real-Time Industrial Inspection on Resource-Constrained Devices
Arianna Stropeni, Fabrizio Genilotti, Francesco Borsatti, Manuel Barusco, Davide Dalle Pezze, Gian Antonio Susto
Main category: cs.CV
TL;DR: Efficient visual anomaly detection methods (PatchCore-Lite and Padim-Lite) optimized for edge deployment with reduced memory footprint and faster inference while maintaining performance on industrial defect detection benchmarks.
Details
Motivation: Real production lines require visual anomaly detection systems that meet strict real-time and privacy requirements, necessitating edge deployment. However, edge devices have limited memory and computational resources, creating challenges for existing VAD methods.Method: Two efficient VAD methods: PatchCore-Lite uses product-quantized memory bank with coarse search followed by exact search on decoded subset. Padim-Lite uses diagonal covariance to convert Mahalanobis distance into efficient element-wise computation.
Result: PatchCore-Lite achieves 79% reduction in total memory footprint. Padim-Lite achieves 77% reduction in total memory and 31% decrease in inference time. Both maintain effectiveness on MVTec AD and VisA benchmarks.
Conclusion: VAD can be effectively deployed on edge devices, enabling real-time, private, and cost-efficient industrial inspection through efficient algorithm design that reduces computational and memory requirements.
Abstract: Visual Anomaly Detection (VAD) is essential for industrial quality control, enabling automatic defect detection in manufacturing. In real production lines, VAD systems must satisfy strict real-time and privacy requirements, necessitating a shift from cloud-based processing to local edge deployment. However, processing data locally on edge devices introduces new challenges because edge hardware has limited memory and computational resources. To overcome these limitations, we propose two efficient VAD methods designed for edge deployment: PatchCore-Lite and Padim-Lite, based on the popular PatchCore and PaDiM models. PatchCore-Lite runs first a coarse search on a product-quantized memory bank, then an exact search on a decoded subset. Padim-Lite is sped up using diagonal covariance, turning Mahalanobis distance into efficient element-wise computation. We evaluate our methods on the MVTec AD and VisA benchmarks and show their suitability for edge environments. PatchCore-Lite achieves a remarkable 79% reduction in total memory footprint, while PaDiM-Lite achieves substantial efficiency gains with a 77% reduction in total memory and a 31% decrease in inference time. These results show that VAD can be effectively deployed on edge devices, enabling real-time, private, and cost-efficient industrial inspection.
[184] EARTalking: End-to-end GPT-style Autoregressive Talking Head Synthesis with Frame-wise Control
Yuzhe Weng, Haotian Wang, Yuanhong Yu, Jun Du, Shan He, Xiaoyan Wu, Haoran Xu
Main category: cs.CV
TL;DR: EARTalking: An end-to-end GPT-style autoregressive model for interactive audio-driven talking head generation with frame-by-frame streaming control
Details
Motivation: Existing AR-based methods rely on intermediate facial representations limiting expressiveness, while diffusion methods generate clip-by-clip with latency and lack fine-grained control. Need for interactive, streaming generation with identity consistency and diverse control signals.Method: Proposes EARTalking with novel frame-by-frame in-context audio-driven streaming generation. Introduces Sink Frame Window Attention (SFA) for variable-length video generation with identity consistency, and streaming Frame Condition In-Context (FCIC) scheme for injecting diverse control signals in streaming manner.
Result: Outperforms existing autoregressive methods and achieves performance comparable to diffusion-based methods. Demonstrates feasibility of in-context streaming autoregressive control for flexible, efficient generation.
Conclusion: EARTalking unlocks scalable direction for flexible, efficient audio-driven talking head generation with interactive control at every frame and arbitrary moments.
Abstract: Audio-driven talking head generation aims to create vivid and realistic videos from a static portrait and speech. Existing AR-based methods rely on intermediate facial representations, which limit their expressiveness and realism. Meanwhile, diffusion-based methods generate clip-by-clip, lacking fine-grained control and causing inherent latency due to overall denoising across the window. To address these limitations, we propose EARTalking, a novel end-to-end, GPT-style autoregressive model for interactive audio-driven talking head generation. Our method introduces a novel frame-by-frame, in-context, audio-driven streaming generation paradigm. For inherently supporting variable-length video generation with identity consistency, we propose the Sink Frame Window Attention (SFA) mechanism. Furthermore, to avoid the complex, separate networks that prior works required for diverse control signals, we propose a streaming Frame Condition In-Context (FCIC) scheme. This scheme efficiently injects diverse control signals in a streaming, in-context manner, enabling interactive control at every frame and at arbitrary moments. Experiments demonstrate that EARTalking outperforms existing autoregressive methods and achieves performance comparable to diffusion-based methods. Our work demonstrates the feasibility of in-context streaming autoregressive control, unlocking a scalable direction for flexible, efficient generation. The code will be released for reproducibility.
[185] Remote Sensing Image Dehazing: A Systematic Review of Progress, Challenges, and Prospects
Heng Zhou, Xiaoxiong Liu, Zhenxi Zhang, Jieheng Yun, Chengyang Li, Yunchu Yang, Dongyi Xia, Chunna Tian, Xiao-Jun Wu
Main category: cs.CV
TL;DR: Comprehensive survey of remote sensing image dehazing methods, categorizing approaches into three evolutionary stages and evaluating performance across multiple datasets and metrics.
Details
Motivation: Remote sensing images are often degraded by atmospheric conditions like haze, fog, and thin clouds, which obscure surface information and hinder downstream applications. There's a need to systematically review and benchmark existing dehazing methods for remote sensing imagery.Method: The study categorizes existing approaches into three evolutionary stages: handcrafted physical priors, data-driven deep restoration, and hybrid physical-intelligent generation. It summarizes over 30 representative methods across CNNs, GANs, Transformers, and diffusion models. Large-scale quantitative experiments were conducted on five public datasets using 12 evaluation metrics.
Result: Transformer- and diffusion-based models improve SSIM by 12%~18% and reduce perceptual errors by 20%~35% on average. Hybrid physics-guided designs achieve higher radiometric stability, with models using explicit transmission or airlight constraints reducing color bias by up to 27%.
Conclusion: The survey identifies open challenges including dynamic atmospheric modeling, multimodal fusion, lightweight deployment, data scarcity, and joint degradations. It outlines future directions for developing trustworthy, controllable, and efficient (TCE) dehazing systems for remote sensing applications.
Abstract: Remote sensing images (RSIs) are frequently degraded by haze, fog, and thin clouds, which obscure surface reflectance and hinder downstream applications. This study presents the first systematic and unified survey of RSIs dehazing, integrating methodological evolution, benchmark assessment, and physical consistency analysis. We categorize existing approaches into a three-stage progression: from handcrafted physical priors, to data-driven deep restoration, and finally to hybrid physical-intelligent generation, and summarize more than 30 representative methods across CNNs, GANs, Transformers, and diffusion models. To provide a reliable empirical reference, we conduct large-scale quantitative experiments on five public datasets using 12 metrics, including PSNR, SSIM, CIEDE, LPIPS, FID, SAM, ERGAS, UIQI, QNR, NIQE, and HIST. Cross-domain comparison reveals that recent Transformer- and diffusion-based models improve SSIM by 12%~18% and reduce perceptual errors by 20%~35% on average, while hybrid physics-guided designs achieve higher radiometric stability. A dedicated physical radiometric consistency experiment further demonstrates that models with explicit transmission or airlight constraints reduce color bias by up to 27%. Based on these findings, we summarize open challenges: dynamic atmospheric modeling, multimodal fusion, lightweight deployment, data scarcity, and joint degradations, and outline promising research directions for future development of trustworthy, controllable, and efficient (TCE) dehazing systems. All reviewed resources, including source code, benchmark datasets, evaluation metrics, and reproduction configurations are publicly available at https://github.com/VisionVerse/RemoteSensing-Restoration-Survey.
[186] Transparent Fragments Contour Estimation via Visual-Tactile Fusion for Autonomous Reassembly
Qihao Lin, Borui Chen, Yuping Zhou, Jianing Wu, Yulan Guo, Weishi Zheng, Chongkun Xia
Main category: cs.CV
TL;DR: A visual-tactile fusion framework for estimating contours of transparent fragments, with applications in reassembly tasks for optical instruments, cultural relics, and precious devices.
Details
Motivation: Contour estimation of transparent fragments is crucial for autonomous reassembly in precision optical instrument repair, cultural relic restoration, and identification of broken devices. Transparent fragments pose unique challenges due to their optical properties, irregular shapes, and edges that make visual-only approaches insufficient.Method: 1) Created TransFrag27K dataset with multiscene synthetic data of broken transparent fragments and scalable generation pipeline. 2) Developed TransFragNet for visual grasping position detection. 3) Used Gelsight Mini sensors on two-finger gripper to obtain tactile information of fragment edges. 4) Proposed visual-tactile fusion material classifier. 5) Introduced visual-tactile fusion contour estimation framework inspired by human perception. 6) Developed multi-dimensional similarity metrics for contour matching and reassembly.
Result: The framework demonstrates strong performance in real-world validation, providing a reproducible benchmark for evaluating visual-tactile contour estimation and fragment reassembly. Experimental results validate the proposed approach.
Conclusion: The proposed visual-tactile fusion framework effectively addresses the challenging problem of transparent fragment contour estimation, offering a comprehensive solution with dataset, detection network, fusion classifier, and reassembly algorithm for practical applications in reassembly tasks.
Abstract: The contour estimation of transparent fragments is very important for autonomous reassembly, especially in the fields of precision optical instrument repair, cultural relic restoration, and identification of other precious device broken accidents. Different from general intact transparent objects, the contour estimation of transparent fragments face greater challenges due to strict optical properties, irregular shapes and edges. To address this issue, a general transparent fragments contour estimation framework based on visual-tactile fusion is proposed in this paper. First, we construct the transparent fragment dataset named TransFrag27K, which includes a multiscene synthetic data of broken fragments from multiple types of transparent objects, and a scalable synthetic data generation pipeline. Secondly, we propose a visual grasping position detection network named TransFragNet to identify, locate and segment the sampling grasping position. And, we use a two-finger gripper with Gelsight Mini sensors to obtain reconstructed tactile information of the lateral edge of the fragments. By fusing this tactile information with visual cues, a visual-tactile fusion material classifier is proposed. Inspired by the way humans estimate a fragment’s contour combining vision and touch, we introduce a general transparent fragment contour estimation framework based on visual-tactile fusion, demonstrates strong performance in real-world validation. Finally, a multi-dimensional similarity metrics based contour matching and reassembly algorithm is proposed, providing a reproducible benchmark for evaluating visual-tactile contour estimation and fragment reassembly. The experimental results demonstrate the validity of the proposed framework. The dataset and codes are available at https://github.com/Keithllin/Transparent-Fragments-Contour-Estimation.
[187] Scene Representation using 360° Saliency Graph and its Application in Vision-based Indoor Navigation
Preeti Meena, Himanshu Kumar, Sandeep Yadav
Main category: cs.CV
TL;DR: Proposes a 360° saliency graph representation for scenes that encodes visual, contextual, semantic, and geometric information as nodes and edges, enabling improved scene localization and vision-based indoor navigation.
Details
Motivation: Existing scene representations (RGB-D, LiDAR, keypoints, etc.) may not be efficient for applications like scene indexing and vision-based navigation. They often lack explicit encoding of relevant information and are sensitive to scene view changes, illumination variations, occlusions, and shadows.Method: Develops a 360° saliency graph representation that explicitly encodes visual, contextual, semantic, and geometric information as nodes, edges, edge weights, and angular positions. This representation is robust to view changes and environmental challenges. Uses this representation for vision-based navigation by first localizing query scenes in topological maps, then estimating movement directions using embedded geometric information.
Result: Experimental results demonstrate the efficacy of the proposed 360° saliency graph representation in enhancing both scene localization and vision-based indoor navigation compared to existing methods.
Conclusion: The 360° saliency graph provides a rich, efficient representation that addresses limitations of traditional scene representations and improves performance in vision-based navigation applications.
Abstract: A Scene, represented visually using different formats such as RGB-D, LiDAR scan, keypoints, rectangular, spherical, multi-views, etc., contains information implicitly embedded relevant to applications such as scene indexing, vision-based navigation. Thus, these representations may not be efficient for such applications. This paper proposes a novel 360° saliency graph representation of the scenes. This rich representation explicitly encodes the relevant visual, contextual, semantic, and geometric information of the scene as nodes, edges, edge weights, and angular position in the 360° graph. Also, this representation is robust against scene view change and addresses challenges of indoor environments such as varied illumination, occlusions, and shadows as in the case of existing traditional methods. We have utilized this rich and efficient representation for vision-based navigation and compared it with existing navigation methods using 360° scenes. However, these existing methods suffer from limitations of poor scene representation, lacking scene-specific information. This work utilizes the proposed representation first to localize the query scene in the given topological map, and then facilitate 2D navigation by estimating the next required movement directions towards the target destination in the topological map by using the embedded geometric information in the 360° saliency graph. Experimental results demonstrate the efficacy of the proposed 360° saliency graph representation in enhancing both scene localization and vision-based indoor navigation.
[188] HSI Image Enhancement Classification Based on Knowledge Distillation: A Study on Forgetting
Songfeng Zhu
Main category: cs.CV
TL;DR: A teacher-based knowledge retention method for incremental hyperspectral image classification that mitigates catastrophic forgetting without needing old category samples, using mask-based partial knowledge distillation to filter misleading information.
Details
Motivation: Address catastrophic forgetting in incremental classification of hyperspectral images, where traditional memory recall methods require old category samples which may not be available, creating a need for methods that work without such samples.Method: Teacher-based knowledge retention method that uses incremental category samples instead of old ones, combined with mask-based partial category knowledge distillation that decouples and filters potentially misleading information to prevent misguiding the student model.
Result: Comparative and ablation experiments show robust performance, demonstrating effectiveness in mitigating catastrophic forgetting and enhancing overall accuracy in incremental hyperspectral image classification.
Conclusion: The proposed approach successfully addresses catastrophic forgetting in incremental hyperspectral image classification without requiring old category samples, using teacher-based knowledge retention and selective knowledge distillation.
Abstract: In incremental classification tasks for hyperspectral images, catastrophic forgetting is an unavoidable challenge. While memory recall methods can mitigate this issue, they heavily rely on samples from old categories. This paper proposes a teacher-based knowledge retention method for incremental image classification. It alleviates model forgetting of old category samples by utilizing incremental category samples, without depending on old category samples. Additionally, this paper introduces a mask-based partial category knowledge distillation algorithm. By decoupling knowledge distillation, this approach filters out potentially misleading information that could misguide the student model, thereby enhancing overall accuracy. Comparative and ablation experiments demonstrate the proposed method’s robust performance.
[189] DSCSNet: A Dynamic Sparse Compression Sensing Network for Closely-Spaced Infrared Small Target Unmixing
Zhiyang Tang, Yiming Zhu, Ruimin Huang, Meng Yang, Yong Ma, Jun Huang, Fan Fan
Main category: cs.CV
TL;DR: DSCSNet: A deep-unfolded network combining ADMM with learnable parameters for infrared small target unmixing, using ℓ₁-norm sparsity constraints and dynamic thresholding to balance model-driven rigor and data-driven adaptability.
Details
Motivation: Infrared small targets often appear as mixed spots due to hardware limitations, requiring unmixing of individual targets. Existing methods struggle to balance the rigorous sparsity guarantees of model-driven approaches with the dynamic scene adaptability of data-driven methods.Method: Proposes Dynamic Sparse Compressed Sensing Network (DSCSNet), a deep-unfolded network that couples ADMM with learnable parameters. Uses ℓ₁-norm sparsity constraints in auxiliary variable updates instead of traditional ℓ₂-norm, and integrates self-attention-based dynamic thresholding for adaptive sparsification. Modules are jointly optimized end-to-end across three ADMM iterative steps.
Result: Extensive experiments on synthetic infrared dataset CSIST-100K show DSCSNet outperforms state-of-the-art methods in key metrics like CSO-mAP and sub-pixel localization error.
Conclusion: DSCSNet achieves robust sparsity induction and scene adaptability while retaining compressed sensing physical logic, enhancing unmixing accuracy and generalization in complex infrared scenarios.
Abstract: Due to the limitations of optical lens focal length and detector resolution, distant clustered infrared small targets often appear as mixed spots. The Close Small Object Unmixing (CSOU) task aims to recover the number, sub-pixel positions, and radiant intensities of individual targets from these spots, which is a highly ill-posed inverse problem. Existing methods struggle to balance the rigorous sparsity guarantees of model-driven approaches and the dynamic scene adaptability of data-driven methods. To address this dilemma, this paper proposes a Dynamic Sparse Compressed Sensing Network (DSCSNet), a deep-unfolded network that couples the Alternating Direction Method of Multipliers (ADMM) with learnable parameters. Specifically, we embed a strict $\ell_1$-norm sparsity constraint into the auxiliary variable update step of ADMM to replace the traditional $\ell_2$-norm smoothness-promoting terms, which effectively preserves the discrete energy peaks of small targets. We also integrate a self-attention-based dynamic thresholding mechanism into the reconstruction stage, which adaptively adjusts the sparsification intensity using the sparsity-enhanced information from the iterative process. These modules are jointly optimized end-to-end across the three iterative steps of ADMM. Retaining the physical logic of compressed sensing, DSCSNet achieves robust sparsity induction and scene adaptability, thus enhancing the unmixing accuracy and generalization in complex infrared scenarios. Extensive experiments on the synthetic infrared dataset CSIST-100K demonstrate that DSCSNet outperforms state-of-the-art methods in key metrics such as CSO-mAP and sub-pixel localization error.
[190] Thermal is Always Wild: Characterizing and Addressing Challenges in Thermal-Only Novel View Synthesis
M. Kerem Aydin, Vishwanath Saragadam, Emma Alexander
Main category: cs.CV
TL;DR: A novel preprocessing and splatting pipeline for thermal-only novel view synthesis that addresses thermal imagery’s low dynamic range and photometric instability to achieve state-of-the-art performance without dataset-specific tuning.
Details
Motivation: Thermal cameras provide visibility in darkness and adverse conditions, but thermal imagery is significantly harder to use for novel view synthesis than visible-light images due to two main challenges: extremely low dynamic range (weakening appearance cues and limiting optimization gradients) and rapid photometric fluctuations with slow radiometric drift (destabilizing correspondence estimation and creating floater artifacts).Method: Introduces a lightweight preprocessing and splatting pipeline that expands usable dynamic range and stabilizes per-frame photometry. The approach specifically addresses thermal imagery’s unique challenges without requiring any dataset-specific tuning.
Result: Achieves state-of-the-art performance across thermal-only novel view synthesis benchmarks, demonstrating effective handling of thermal imagery’s limitations for view synthesis tasks.
Conclusion: The proposed preprocessing and splatting pipeline successfully addresses the core challenges of thermal imagery for novel view synthesis, enabling reliable thermal-only view synthesis without RGB guidance beyond camera pose.
Abstract: Thermal cameras provide reliable visibility in darkness and adverse conditions, but thermal imagery remains significantly harder to use for novel view synthesis (NVS) than visible-light images. This difficulty stems primarily from two characteristics of affordable thermal sensors. First, thermal images have extremely low dynamic range, which weakens appearance cues and limits the gradients available for optimization. Second, thermal data exhibit rapid frame-to-frame photometric fluctuations together with slow radiometric drift, both of which destabilize correspondence estimation and create high-frequency floater artifacts during view synthesis, particularly when no RGB guidance (beyond camera pose) is available. Guided by these observations, we introduce a lightweight preprocessing and splatting pipeline that expands usable dynamic range and stabilizes per-frame photometry. Our approach achieves state-of-the-art performance across thermal-only NVS benchmarks, without requiring any dataset-specific tuning.
[191] InjectFlow: Weak Guides Strong via Orthogonal Injection for Flow Matching
Dayu Wang, Jiaye Yang, Weikang Li, Jiahui Liang, Yang Li
Main category: cs.CV
TL;DR: InjectFlow: Training-free method to mitigate dataset bias in Flow Matching models by injecting orthogonal semantics during initial velocity field computation, preventing trajectory lock-in toward majority modes.
Details
Motivation: Flow Matching models are highly sensitive to dataset biases, causing severe semantic degradation when generating out-of-distribution or minority-class samples. This performance drop is driven by conditional expectation smoothing that leads to trajectory lock-in during inference.Method: Introduces InjectFlow, a training-free method that injects orthogonal semantics during the initial velocity field computation without changing random seeds. This prevents latent drift toward majority modes while maintaining generative quality.
Result: Extensive experiments show InjectFlow successfully fixes 75% of prompts that standard flow matching models fail to generate correctly on the GenEval dataset.
Conclusion: Provides theoretical analysis and algorithm for building more fair and robust visual foundation models, offering a ready-to-use solution for bias mitigation in Flow Matching.
Abstract: Flow Matching (FM) has recently emerged as a leading approach for high-fidelity visual generation, offering a robust continuous-time alternative to ordinary differential equation (ODE) based models. However, despite their success, FM models are highly sensitive to dataset biases, which cause severe semantic degradation when generating out-of-distribution or minority-class samples. In this paper, we provide a rigorous mathematical formalization of the ``Bias Manifold’’ within the FM framework. We identify that this performance drop is driven by conditional expectation smoothing, a mechanism that inevitably leads to trajectory lock-in during inference. To resolve this, we introduce InjectFlow, a novel, training-free method by injecting orthogonal semantics during the initial velocity field computation, without requiring any changes to the random seeds. This design effectively prevents the latent drift toward majority modes while maintaining high generative quality. Extensive experiments demonstrate the effectiveness of our approach. Notably, on the GenEval dataset, InjectFlow successfully fixes 75% of the prompts that standard flow matching models fail to generate correctly. Ultimately, our theoretical analysis and algorithm provide a ready-to-use solution for building more fair and robust visual foundation models.
[192] StreamingEval: A Unified Evaluation Protocol towards Realistic Streaming Video Understanding
Guowei Tang, Tianwen Qian, Huanran Zheng, Yifei Wang, Xiaoling Wang
Main category: cs.CV
TL;DR: StreamingEval: A unified evaluation framework for assessing Video-LLMs’ streaming video understanding capabilities under realistic resource constraints, benchmarking efficiency-storage-accuracy trade-offs.
Details
Motivation: Existing streaming video understanding research focuses on isolated aspects like accuracy or encoding efficiency, overlooking practical deployability under realistic resource constraints. There's a need for standardized evaluation of Video-LLMs' streaming capabilities.Method: Introduces StreamingEval framework with fixed-capacity memory bank to normalize accessible historical visual context. Jointly evaluates visual encoding efficiency, text decoding latency, and task performance to quantify overall system deployability under standardized protocol.
Result: Extensive experiments across multiple datasets reveal substantial gaps between current Video-LLMs and requirements of realistic streaming applications, providing systematic basis for future research.
Conclusion: StreamingEval addresses the need for practical evaluation of streaming video understanding, highlighting current limitations and providing a standardized framework for future Video-LLM development in real-world interactive applications.
Abstract: Real-time, continuous understanding of visual signals is essential for real-world interactive AI applications, and poses a fundamental system-level challenge. Existing research on streaming video understanding, however, typically focuses on isolated aspects such as question-answering accuracy under limited visual context or improvements in encoding efficiency, while largely overlooking practical deployability under realistic resource constraints. To bridge this gap, we introduce StreamingEval, a unified evaluation framework for assessing the streaming video understanding capabilities of Video-LLMs under realistic constraints. StreamingEval benchmarks both mainstream offline models and recent online video models under a standardized protocol, explicitly characterizing the trade-off between efficiency, storage and accuracy. Specifically, we adopt a fixed-capacity memory bank to normalize accessible historical visual context, and jointly evaluate visual encoding efficiency, text decoding latency, and task performance to quantify overall system deployability. Extensive experiments across multiple datasets reveal substantial gaps between current Video-LLMs and the requirements of realistic streaming applications, providing a systematic basis for future research in this direction. Codes will be released at https://github.com/wwgTang-111/StreamingEval1.
[193] The Universal Normal Embedding
Chen Tasker, Roy Betser, Eyal Gofer, Meir Yossef Levi, Guy Gilboa
Main category: cs.CV
TL;DR: The paper proposes a Universal Normal Embedding (UNE) hypothesis that connects generative models and vision encoders through shared Gaussian latent spaces, showing that diffusion noise and encoder embeddings are noisy linear projections of the same underlying Gaussian source.
Details
Motivation: To bridge the gap between generative models and vision encoders, which have developed separately despite sharing fundamental Gaussian latent space properties, by hypothesizing they are views of a shared latent source.Method: Introduced NoiseZoo dataset containing per-image DDIM-inverted diffusion noise and matching encoder representations (CLIP, DINO). Used linear probes to analyze semantic information in both spaces and performed orthogonalization for disentangled image editing.
Result: Linear probes show strong, aligned attribute predictions in both spaces, revealing that generative noise encodes meaningful semantics. Simple orthogonalization enables faithful, controllable image edits (smile, gender, age) without architectural changes.
Conclusion: Provides empirical support for the UNE hypothesis, revealing a shared Gaussian-like latent geometry that concretely links encoding and generation, enabling new editing capabilities.
Abstract: Generative models and vision encoders have largely advanced on separate tracks, optimized for different goals and grounded in different mathematical principles. Yet, they share a fundamental property: latent space Gaussianity. Generative models map Gaussian noise to images, while encoders map images to semantic embeddings whose coordinates empirically behave as Gaussian. We hypothesize that both are views of a shared latent source, the Universal Normal Embedding (UNE): an approximately Gaussian latent space from which encoder embeddings and DDIM-inverted noise arise as noisy linear projections. To test our hypothesis, we introduce NoiseZoo, a dataset of per-image latents comprising DDIM-inverted diffusion noise and matching encoder representations (CLIP, DINO). On CelebA, linear probes in both spaces yield strong, aligned attribute predictions, indicating that generative noise encodes meaningful semantics along linear directions. These directions further enable faithful, controllable edits (e.g., smile, gender, age) without architectural changes, where simple orthogonalization mitigates spurious entanglements. Taken together, our results provide empirical support for the UNE hypothesis and reveal a shared Gaussian-like latent geometry that concretely links encoding and generation. Code and data are available https://rbetser.github.io/UNE/
[194] Transferable Multi-Bit Watermarking Across Frozen Diffusion Models via Latent Consistency Bridges
Hong-Hanh Nguyen-Le, Van-Tuan Tran, Thuc D. Nguyen, Nhien-An Le-Khac
Main category: cs.CV
TL;DR: DiffMark: A plug-and-play watermarking method for diffusion models that enables single-pass multi-bit detection, per-image key flexibility, and cross-model transferability without model fine-tuning.
Details
Motivation: Existing diffusion model watermarking methods face challenges: sampling-based approaches require costly DDIM inversion for detection, while fine-tuning-based methods couple watermarks to specific model checkpoints requiring retraining for each architecture.Method: Instead of encoding watermarks into initial noise, DiffMark injects a persistent learned perturbation at every denoising step of a frozen diffusion model. Uses Latent Consistency Models as a differentiable training bridge to avoid backpropagating through full denoising chain, reducing gradient steps from 50 DDIM to 4 LCM.
Result: Achieves single-pass detection at 16.4 ms (45x speedup over sampling-based methods), maintains competitive watermark robustness against distortion/regeneration/adversarial attacks, and provides per-image key flexibility with cross-model transferability.
Conclusion: DiffMark offers a practical plug-and-play watermarking solution for diffusion models that balances efficiency, flexibility, and robustness while enabling cross-architecture compatibility without per-model fine-tuning.
Abstract: As diffusion models (DMs) enable photorealistic image generation at unprecedented scale, watermarking techniques have become essential for provenance establishment and accountability. Existing methods face challenges: sampling-based approaches operate on frozen models but require costly $N$-step Denoising Diffusion Implicit Models (DDIM) inversion (typically N=50) for zero-bit-only detection; fine-tuning-based methods achieve fast multi-bit extraction but couple the watermark to a specific model checkpoint, requiring retraining for each architecture. We propose DiffMark, a plug-and-play watermarking method that offers three key advantages over existing approaches: single-pass multi-bit detection, per-image key flexibility, and cross-model transferability. Rather than encoding the watermark into the initial noise vector, DiffMark injects a persistent learned perturbation $δ$ at every denoising step of a completely frozen DM. The watermark signal accumulates in the final denoised latent $z_0$ and is recovered in a single forward pass. The central challenge of backpropagating gradients through a frozen UNet without traversing the full denoising chain is addressed by employing Latent Consistency Models (LCM) as a differentiable training bridge. This reduces the number of gradient steps from 50 DDIM to 4 LCM and enables a single-pass detection at 16.4 ms, a 45x speedup over sampling-based methods. Moreover, by this design, the encoder learns to map any runtime secret to a unique perturbation at inference time, providing genuine per-image key flexibility and transferability to unseen diffusion-based architectures without per-model fine-tuning. Although achieving these advantages, DiffMark also maintains competitive watermark robustness against distortion, regeneration, and adversarial attacks.
[195] Cross-Scenario Deraining Adaptation with Unpaired Data: Superpixel Structural Priors and Multi-Stage Pseudo-Rain Synthesis
Kangbo Zhao, Miaoxin Guan, Xiang Chen, Yukai Shi, Jinshan Pan
Main category: cs.CV
TL;DR: Proposes a cross-scenario deraining adaptation framework that uses only rain-free background images from target domain, generates pseudo-data via superpixel priors and resolution-adaptive fusion, and employs pseudo-label re-synthesis with multi-stage noise generation for realistic rain simulation.
Details
Motivation: Current deep learning deraining methods fail in Out-of-Distribution scenarios due to domain discrepancy between synthetic training data and real-world rain dynamics. Need for adaptation without requiring paired rainy observations in target domain.Method: 1) Superpixel Generation module extracts structural priors from source domain using Simple Linear Iterative Clustering. 2) Resolution-adaptive Fusion aligns source structures with target backgrounds via texture similarity. 3) Pseudo-label re-Synthesize mechanism uses multi-stage noise generation to simulate realistic rain streaks. Framework is plug-and-play for arbitrary deraining architectures.
Result: Achieves remarkable PSNR gains of 32% to 59% in OOD domains while significantly accelerating training convergence. Extensive experiments on state-of-the-art models demonstrate effectiveness.
Conclusion: Proposed framework successfully addresses domain adaptation challenges in image deraining without requiring paired rainy observations, enabling robust performance in unseen scenarios through realistic pseudo-data generation.
Abstract: Image deraining plays a pivotal role in low-level computer vision, serving as a prerequisite for robust outdoor surveillance and autonomous driving systems. While deep learning paradigms have achieved remarkable success in firmly aligned settings, they often suffer from severe performance degradation when generalized to unseen Out-of-Distribution (OOD) scenarios. This failure stems primarily from the significant domain discrepancy between synthetic training datasets and the complex physical dynamics of real-world rain. To address these challenges, this paper proposes a pioneering cross-scenario deraining adaptation framework. Diverging from conventional approaches, our method obviates the requirements for paired rainy observations in the target domain, leveraging exclusively rain-free background images. We design a Superpixel Generation (Sup-Gen) module to extract stable structural priors from the source domain using Simple Linear Iterative Clustering. Subsequently, a Resolution-adaptive Fusion strategy is introduced to align these source structures with target backgrounds through texture similarity, ensuring the synthesis of diverse and realistic pseudo-data. Finally, we implement a pseudo-label re-Synthesize mechanism that employs multi-stage noise generation to simulate realistic rain streaks. This framework functions as a versatile plug-and-play module capable of seamless integration into arbitrary deraining architectures. Extensive experiments on state-of-the-art models demonstrate that our approach yields remarkable PSNR gains of up to 32% to 59% in OOD domains while significantly accelerating training convergence.
[196] A Latent Representation Learning Framework for Hyperspectral Image Emulation in Remote Sensing
Chedly Ben Azizi, Claire Guilloteau, Gilles Roussel, Matthieu Puigt
Main category: cs.CV
TL;DR: A latent representation framework for synthetic hyperspectral image generation that outperforms classical regression-based emulators in accuracy and spectral fidelity.
Details
Motivation: Traditional radiative transfer models for hyperspectral image generation are computationally expensive and often limited to spectrum-level outputs, creating a need for more efficient and flexible emulation methods.Method: Proposes a latent representation-based framework using either direct one-step training or a two-step strategy combining VAE pretraining with parameter-to-latent interpolation for both spectrum-level and spatial-spectral emulation.
Result: Outperforms classical regression-based emulators in reconstruction accuracy, spectral fidelity, and robustness to real-world spatial variability on PROSAIL-simulated vegetation data and Sentinel-3 OLCI imagery.
Conclusion: The method enables efficient hyperspectral emulation that preserves downstream biophysical parameter retrieval performance, making it practically relevant for remote sensing applications.
Abstract: Synthetic hyperspectral image (HSI) generation is essential for large-scale simulation, algorithm development, and mission design, yet traditional radiative transfer models remain computationally expensive and often limited to spectrum-level outputs. In this work, we propose a latent representation-based framework for hyperspectral emulation that learns a latent generative representation of hyperspectral data. The proposed approach supports both spectrum-level and spatial-spectral emulation and can be trained either in a direct one-step formulation or in a two-step strategy that couples variational autoencoder (VAE) pretraining with parameter-to-latent interpolation. Experiments on PROSAIL-simulated vegetation data and Sentinel-3 OLCI imagery demonstrate that the method outperforms classical regression-based emulators in reconstruction accuracy, spectral fidelity, and robustness to real-world spatial variability. We further show that emulated HSIs preserve performance in downstream biophysical parameter retrieval, highlighting the practical relevance of emulated data for remote sensing applications.
[197] The Global-Local loop: what is missing in bridging the gap between geospatial data from numerous communities?
Clément Mallet, Ana-Maria Raimond
Main category: cs.CV
TL;DR: The paper discusses challenges in geospatial data fusion, criticizing current “master-slave” approaches and advocating for more symmetric, mutually beneficial integration of multiple data sources across scales and communities.
Details
Motivation: Current geospatial data fusion approaches operate under a "master-slave" paradigm where one source dominates, lacking mutual benefits and community biases. There's a need for more symmetric exploitation of multiple data sources across different scales and communities.Method: Proposes establishing relevant interaction schemes through illustrative use cases and discusses under-explored research directions for leveraging available data through multiple extents and communities.
Result: Identifies gaps in current geospatial data fusion approaches and proposes a framework for more symmetric, mutually beneficial integration of diverse data sources.
Conclusion: Current geospatial data fusion is limited by asymmetric approaches; future research should focus on creating more balanced, mutually beneficial integration schemes across scales and communities.
Abstract: We face a unprecedented amount of geospatial data, describing directly or indirectly the Earth Surface at multiple spatial, temporal, and semantic scales, and stemming from numerous contributors, from satellites to citizens. The main challenge in all the geospatial-related communities lies in suitably leveraging a combination of some of the sources for either a generic or a thematic application. Certain data fusion schemes are predominantly exploited: they correspond to popular tasks with mainstream data sources, e.g., free archives of Sentinel images coupled with OpenStreetMap data under an open and widespread deep-learning backbone for land-cover mapping purposes. Most of these approaches unfortunately operate under a “master-slave” paradigm, where one source is basically integrated to help processing the “main” source, without mutual advantages (e.g., large-scale estimation of a given biophysical variable using in-situ observations) and under a specific community bias. We argue that numerous key data fusion configurations, and in particular the effort in symmetrizing the exploitation of multiple data sources, are insufficiently addressed while being highly beneficial for generic or thematic applications. Bridges and retroactions between scales, communities and their respective sources are lacking, neglecting the utmost potential of such a “global-local loop”. In this paper, we propose to establish the most relevant interaction schemes through illustrative use cases. We subsequently discuss under-explored research directions that could take advantage of leveraging available data through multiples extents and communities.
[198] GraphiContact: Pose-aware Human-Scene Robust Contact Perception for Interactive Systems
Xiaojian Lin, Yaomin Shen, Junyuan Ma, Yujie Sun, Chengqing Bu, Wenxin Zhang, Zongzheng Zhang, Hao Fei, Lei Jin, Hao Zhao
Main category: cs.CV
TL;DR: GraphiContact: A pose-aware framework for joint monocular vertex-level human-scene contact prediction and 3D human mesh reconstruction using complementary human priors from pretrained Transformers with uncertainty training for robustness.
Details
Motivation: Existing approaches either focus on contact prediction without sufficiently exploiting explicit 3D human priors, or emphasize pose/mesh reconstruction without directly optimizing robust vertex-level contact inference under occlusion and perceptual noise.Method: Proposes GraphiContact framework that transfers complementary human priors from two pretrained Transformer encoders to predict per-vertex human-scene contact on reconstructed mesh. Introduces Single-Image Multi-Infer Uncertainty (SIMU) training strategy with token-level adaptive routing to simulate occlusion and noisy observations during training while preserving efficient single-branch inference at test time.
Result: Experiments on five benchmark datasets show that GraphiContact achieves consistent gains on both contact prediction and 3D human reconstruction.
Conclusion: GraphiContact provides a comprehensive solution for 3D human reconstruction and interaction analysis, addressing the gap between contact prediction and mesh reconstruction with improved robustness to real-world challenges like occlusion and noise.
Abstract: Monocular vertex-level human-scene contact prediction is a fundamental capability for interactive systems such as assistive monitoring, embodied AI, and rehabilitation analysis. In this work, we study this task jointly with single-image 3D human mesh reconstruction, using reconstructed body geometry as a scaffold for contact reasoning. Existing approaches either focus on contact prediction without sufficiently exploiting explicit 3D human priors, or emphasize pose/mesh reconstruction without directly optimizing robust vertex-level contact inference under occlusion and perceptual noise. To address this gap, we propose GraphiContact, a pose-aware framework that transfers complementary human priors from two pretrained Transformer encoders and predicts per-vertex human-scene contact on the reconstructed mesh. To improve robustness in real-world scenarios, we further introduce a Single-Image Multi-Infer Uncertainty (SIMU) training strategy with token-level adaptive routing, which simulates occlusion and noisy observations during training while preserving efficient single-branch inference at test time. Experiments on five benchmark datasets show that GraphiContact achieves consistent gains on both contact prediction and 3D human reconstruction. Our code, based on the GraphiContact method, provides comprehensive 3D human reconstruction and interaction analysis, and will be publicly available at https://github.com/Aveiro-Lin/GraphiContact.
[199] FeatDistill: A Feature Distillation Enhanced Multi-Expert Ensemble Framework for Robust AI-generated Image Detection
Zhilin Tu, Kemou Li, Fengpeng Li, Jianwei Fei, Jiamin Zhang, Haiwei Wu
Main category: cs.CV
TL;DR: FeatDistill: A deepfake detection framework using multi-expert ViT ensemble with feature distillation for robust detection under real-world degradations and unseen generators.
Details
Motivation: Address challenges in AI-generated image detection caused by rapid deepfake technology advancement, focusing on real-world bottlenecks: degradation interference, insufficient feature representation, and limited generalization.Method: Four-backbone ViT ensemble (CLIP and SigLIP variants) with expanded training data and degradation modeling. Two-stage training: binary classification followed by feature-level self-distillation for representation alignment.
Result: Achieves strong robustness and generalization in NTIRE challenge setting, handling diverse “in-the-wild” conditions with stable predictions across unseen generators and complex degradations.
Conclusion: FeatDistill offers an effective practical solution for real-world deepfake detection, balancing performance with efficiency (10GB GPU memory).
Abstract: The rapid iteration and widespread dissemination of deepfake technology have posed severe challenges to information security, making robust and generalizable detection of AI-generated forged images increasingly important. In this paper, we propose FeatDistill, an AI-generated image detection framework that integrates feature distillation with a multi-expert ensemble, developed for the NTIRE Challenge on Robust AI-Generated Image Detection in the Wild. The framework explicitly targets three practical bottlenecks in real-world forensics: degradation interference, insufficient feature representation, and limited generalization. Concretely, we build a four-backbone Vision Transformer (ViT) ensemble composed of CLIP and SigLIP variants to capture complementary forensic cues. To improve data coverage, we expand the training set and introduce comprehensive degradation modeling, which exposes the detector to diverse quality variations and synthesis artifacts commonly encountered in unconstrained scenarios. We further adopt a two-stage training paradigm: the model is first optimized with a standard binary classification objective, then refined by dense feature-level self-distillation for representation alignment. This design effectively mitigates overfitting and enhances semantic consistency of learned features. At inference time, the final prediction is obtained by averaging the probabilities from four independently trained experts, yielding stable and reliable decisions across unseen generators and complex degradations. Despite the ensemble design, the framework remains efficient, requiring only about 10 GB peak GPU memory. Extensive evaluations in the NTIRE challenge setting demonstrate that FeatDistill achieves strong robustness and generalization under diverse ``in-the-wild’’ conditions, offering an effective and practical solution for real-world deepfake image detection.
[200] VGS-Decoding: Visual Grounding Score Guided Decoding for Hallucination Mitigation in Medical VLMs
Govinda Kolli, Adinath Madhavrao Dukre, Behzad Bozorgtabar, Dwarikanath Mahapatra, Imran Razzak
Main category: cs.CV
TL;DR: VGS-Decoding is a training-free method that reduces hallucinations in medical VLMs by measuring token visual dependency and adaptively reweighting probabilities during inference.
Details
Motivation: Medical VLMs often hallucinate by relying on language priors rather than visual evidence, which poses serious risks in clinical applications where accuracy is critical.Method: Proposes Visual Grounding Score Guided Decoding (VGS-Decoding) which introduces a Visual Grounding Score (VGS) that measures each token’s visual dependency by comparing probability distributions from original vs. distorted images. During decoding, probabilities are reweighted to amplify visually grounded tokens and suppress hallucinations.
Result: Experiments on MIMIC-Diff-VQA and VQA-RAD datasets across LLaVA-Med, CheXagent, and MedGemma models show consistent improvements: up to +9.12% overall gain and +8.98% in open-ended recall, with only 2× inference overhead and no additional training.
Conclusion: VGS-Decoding effectively mitigates hallucinations in medical VLMs through adaptive per-token probability reweighting based on visual grounding scores, offering a practical, training-free solution for clinical deployment.
Abstract: Medical Vision-Language Models (VLMs) often hallucinate by generating responses based on language priors rather than visual evidence, posing risks in clinical applications. We propose Visual Grounding Score Guided Decoding (VGS-Decoding), a training-free method to mitigate hallucinations during inference. Our key insight is that hallucinated tokens maintain or increase their probability when visual information is degraded, while visually grounded tokens decrease in probability. We introduce the Visual Grounding Score (VGS), which measures each token’s visual dependency by comparing distributions from original and distorted images. During decoding, we reweight probabilities by amplifying visually grounded tokens while suppressing hallucinations. Unlike fixed-weight contrastive methods, VGS-Decoding provides per-token adaptive control. Experiments on MIMIC-Diff-VQA and VQA-RAD across LLaVA-Med, CheXagent, and MedGemma demonstrate consistent improvements, with up to +9.12% overall gain and $+8.98%$ in open-ended recall, while introducing only $2\times$ inference overhead and no additional training, making it practical for clinical deployment. Upon acceptance, code will be released publicly to facilitate reproducibility.
[201] Understanding Temporal Logic Consistency in Video-Language Models through Cross-Modal Attention Discriminability
Chengzhi Li, Heyan Huang, Ping Jian, Zhen Yang, Yaning Tian, Zhongbin Guo
Main category: cs.CV
TL;DR: Proposes TCAS method to improve temporal logic consistency in Video-LLMs by enhancing cross-modal attention heads’ ability to distinguish video tokens across timestamps.
Details
Motivation: Video-LLMs often generate self-contradictory outputs when answering rephrased questions about video content, which undermines reliability and practical adoption. The underlying causes of this temporal inconsistency phenomenon are not well understood.Method: Uses interpretability-driven analysis to identify that cross-modal attention heads fail to distinguish video tokens across different timestamps. Proposes Temporally Conditioned Attention Sharpening (TCAS) - an attention enhancement method that constructs an objective based on attention distinctions to improve temporal resolution capability.
Result: TCAS significantly enhances temporal logic consistency of Video-LLMs. Further analysis shows it improves temporal discriminability of attention heads. The method also achieves performance improvements in general video temporal grounding tasks.
Conclusion: Temporal logic consistency is crucial for temporal understanding in Video-LLMs. The proposed TCAS method effectively addresses inconsistency issues by enhancing attention mechanisms’ temporal discrimination capabilities.
Abstract: Large language models (LLMs) often generate self-contradictory outputs, which severely impacts their reliability and hinders their adoption in practical applications. In video-language models (Video-LLMs), this phenomenon recently draws the attention of researchers. Specifically, these models fail to provide logically consistent responses to rephrased questions based on their grounding outputs. However, the underlying causes of this phenomenon remain underexplored. In this work, we adopt an interpretability-driven approach to analyze, statistically summarize, and intervention the potential factors of the phenomenon. We find that one of the primary reasons for the inconsistency in responses lies in the inability of cross-modal attention heads to effectively distinguish video tokens across different timestamps. To address this, we propose an attention enhancement method called Temporally Conditioned Attention Sharpening (TCAS), which constructs an enhancement objective based on attention distinctions to enhance the model’s temporal resolution capability, thereby improving its temporal understanding logic consistency. Experimental results demonstrate that our method significantly enhances the temporal logic consistency of Video-LLMs. Further analyses reveal that our method indeed improves the temporal discriminability of attention heads, validating our conclusions. Additionally, our method even achieves performance improvements in general video temporal grounding tasks, suggesting that temporal logic consistency is an important factor in temporal understanding.
[202] Which Workloads Belong in Orbit? A Workload-First Framework for Orbital Data Centers Using Semantic Abstraction
Durgendra Narayan Singh
Main category: cs.CV
TL;DR: Space-based computing framework for AI workloads using semantic reduction to decide orbit vs. terrestrial processing, demonstrated with Earth observation pipelines achieving 99%+ data reduction.
Details
Motivation: With falling launch costs and growing AI workloads, there's potential for space-based computing. The paper addresses the challenge of deciding which AI tasks should be processed in orbit versus terrestrial clouds, focusing on data-intensive workloads like Earth observation.Method: Proposes a workload-centric framework with phased adoption model tied to orbital data center maturity. Grounds the framework with in-orbit semantic-reduction prototypes: 1) Earth-observation pipeline on Sentinel-2 imagery converting raw imagery to compact semantic artifacts, 2) Multi-pass stereo reconstruction prototype reducing raw data to derived 3D representations.
Result: Achieved 99.7-99.99% payload reduction for Earth-observation pipeline on Seattle and Bengaluru imagery. Multi-pass stereo reconstruction reduced ~306 MB to ~1.57 MB (99.49% reduction). Demonstrates semantic abstraction drives early workload suitability more than raw compute scale.
Conclusion: Supports a workload-first view where semantic abstraction, not raw compute scale, determines early suitability for space-based processing. The framework helps decide orbit vs. terrestrial allocation for AI workloads based on data reduction potential.
Abstract: Space-based compute is becoming plausible as launch costs fall and data-intensive AI workloads grow. This paper proposes a workload-centric framework for deciding which tasks belong in orbit versus terrestrial cloud, along with a phased adoption model tied to orbital data center maturity. We ground the framework with in-orbit semantic-reduction prototypes. An Earth-observation pipeline on Sentinel-2 imagery from Seattle and Bengaluru (formerly Bangalore) achieves 99.7-99.99% payload reduction by converting raw imagery to compact semantic artifacts. A multi-pass stereo reconstruction prototype reduces ~306 MB to ~1.57 MB of derived 3D representations (99.49% reduction). These results support a workload-first view in which semantic abstraction, not raw compute scale, drives early workload suitability.
[203] NCSTR: Node-Centric Decoupled Spatio-Temporal Reasoning for Video-based Human Pose Estimation
Quang Dang Huynh, Xuefei Yin, Andrew Busch, Hugo G. Espinosa, Alan Wee-Chung Liew, Matthew T. O. Worsey, Yanming Zhu
Main category: cs.CV
TL;DR: A novel node-centric framework for video-based human pose estimation that integrates visual, temporal, and structural reasoning through visuo-temporal velocity embeddings, attention-driven pose queries, and dual-branch spatio-temporal attention graphs.
Details
Motivation: Existing video-based human pose estimation methods struggle with motion blur, occlusion, and complex spatiotemporal dynamics. Current approaches relying on heatmaps or implicit feature aggregation limit joint topology expressiveness and weaken cross-frame consistency.Method: 1) Visuo-temporal velocity-based joint embedding fusing sub-pixel joint cues and inter-frame motion; 2) Attention-driven pose-query encoder mapping joint representations to pose-aware node space; 3) Dual-branch decoupled spatio-temporal attention graph modeling temporal propagation and spatial constraints; 4) Node-space expert fusion module adaptively combining branch outputs.
Result: Extensive experiments on three widely used video pose benchmarks demonstrate superior performance over state-of-the-art methods, highlighting the value of explicit node-centric reasoning.
Conclusion: The proposed node-centric framework offers a new perspective for advancing video-based human pose estimation by explicitly integrating visual, temporal, and structural reasoning, addressing key challenges in motion blur, occlusion, and spatiotemporal dynamics.
Abstract: Video-based human pose estimation remains challenged by motion blur, occlusion, and complex spatiotemporal dynamics. Existing methods often rely on heatmaps or implicit spatio-temporal feature aggregation, which limits joint topology expressiveness and weakens cross-frame consistency. To address these problems, we propose a novel node-centric framework that explicitly integrates visual, temporal, and structural reasoning for accurate pose estimation. First, we design a visuo-temporal velocity-based joint embedding that fuses sub-pixel joint cues and inter-frame motion to build appearance- and motion-aware representations. Then, we introduce an attention-driven pose-query encoder, which applies attention over joint-wise heatmaps and frame-wise features to map the joint representations into a pose-aware node space, generating image-conditioned joint-aware node embeddings. Building upon these node embeddings, we propose a dual-branch decoupled spatio-temporal attention graph that models temporal propagation and spatial constraint reasoning in specialized local and global branches. Finally, a node-space expert fusion module is proposed to adaptively fuse the complementary outputs from both branches, integrating local and global cues for final joint predictions. Extensive experiments on three widely used video pose benchmarks demonstrate that our method outperforms state-of-the-art methods. The results highlight the value of explicit node-centric reasoning, offering a new perspective for advancing video-based human pose estimation.
[204] DCG-Net: Dual Cross-Attention with Concept-Value Graph Reasoning for Interpretable Medical Diagnosis
Getamesay Dagnaw, Xuefei Yin, Muhammad Hassan Maqsood, Yanming Zhu, Alan Wee-Chung Liew
Main category: cs.CV
TL;DR: DCG-Net is an interpretable medical image analysis framework that combines multimodal alignment with structured concept reasoning through dual cross-attention and parametric concept graphs.
Details
Motivation: Current Concept Bottleneck Models (CBMs) for medical image analysis lack interpretability of internal decision processes and typically overlook contextual dependencies among clinical concepts, limiting their clinical utility.Method: Proposes DCG-Net with: 1) Dual Cross-Attention module for bidirectional attention between visual tokens and textual concept-value prototypes, enabling localized evidence attribution; 2) Parametric Concept Graph initialized with Positive Pointwise Mutual Information priors and refined through sparsity-controlled message passing to capture relational structure among clinical concepts.
Result: Experiments on white blood cell morphology and skin lesion diagnosis show DCG-Net achieves state-of-the-art classification performance while producing clinically interpretable diagnostic explanations.
Conclusion: DCG-Net provides an end-to-end interpretable framework that integrates multimodal alignment with structured concept reasoning, offering both high performance and clinically meaningful explanations for medical image analysis.
Abstract: Deep learning models have achieved strong performance in medical image analysis, but their internal decision processes remain difficult to interpret. Concept Bottleneck Models (CBMs) partially address this limitation by structuring predictions through human-interpretable clinical concepts. However, existing CBMs typically overlook the contextual dependencies among concepts. To address these issues, we propose an end-to-end interpretable framework \emph{DCG-Net} that integrates multimodal alignment with structured concept reasoning. DCG-Net introduces a Dual Cross-Attention module that replaces cosine similarity matching with bidirectional attention between visual tokens and canonicalized textual concept-value prototypes, enabling spatially localized evidence attribution. To capture the relational structure inherent to clinical concepts, we develop a Parametric Concept Graph initialized with Positive Pointwise Mutual Information priors and refined through sparsity-controlled message passing. This formulation models inter-concept dependencies in a manner consistent with clinical domain knowledge. Experiments on white blood cell morphology and skin lesion diagnosis demonstrate that DCG-Net achieves state-of-the-art classification performance while producing clinically interpretable diagnostic explanations.
[205] Polarimetric Light Transport Analysis for Specular Inter-reflection
Ryota Maeda, Shinsaku Hiura
Main category: cs.CV
TL;DR: A novel polarization-based method decomposes specular inter-reflections in metal objects by analyzing rotation direction of linear polarization to distinguish direct from inter-reflections.
Details
Motivation: Existing polarization-based decomposition methods focus only on direct reflections and overlook multiple reflections, particularly specular inter-reflection in metal objects, which limits accurate analysis of complex light transport phenomena.Method: The method uses the rotation direction of linear polarization as a discriminative factor between direct and inter-reflection. By actively rotating the linear polarization of incident light and analyzing the rotation direction of reflected light, it decomposes reflectance components in metal objects.
Result: Evaluation with synthetic and real data demonstrates effectiveness in decomposing specular inter-reflections of metal objects. The method can be combined with other decomposition techniques for detailed light transport analysis and improves 3D measurement accuracy against strong specular inter-reflection.
Conclusion: The proposed polarization-based decomposition method successfully addresses the limitation of existing methods by handling specular inter-reflection in metal objects, offering practical applications in 3D measurement and detailed light transport analysis.
Abstract: Polarization is well known for its ability to decompose diffuse and specular reflections. However, the existing decomposition methods only focus on direct reflection and overlook multiple reflections, especially specular inter-reflection. In this paper, we propose a novel decomposition method for handling specular inter-reflection of metal objects by using a unique polarimetric feature: the rotation direction of linear polarization. This rotation direction serves as a discriminative factor between direct and inter-reflection on specular surfaces. To decompose the reflectance components, we actively rotate the linear polarization of incident light and analyze the rotation direction of the reflected light. We evaluate our method using both synthetic and real data, demonstrating its effectiveness in decomposing specular inter-reflections of metal objects. Furthermore, we demonstrate that our method can be combined with other decomposition methods for a detailed analysis of light transport. As a practical application, we show its effectiveness in improving the accuracy of 3D measurement against strong specular inter-reflection.
[206] Prompt-Free Lightweight SAM Adaptation for Histopathology Nuclei Segmentation with Strong Cross-Dataset Generalization
Muhammad Hassan Maqsood, Yanming Zhu, Alfred Lam, Getamesay Dagnaw, Xuefei Yin, Alan Wee-Chung Liew
Main category: cs.CV
TL;DR: A prompt-free, lightweight SAM adaptation for histopathology nuclei segmentation using multi-level encoder features and residual decoding with only 4.1M trainable parameters via LoRA fine-tuning.
Details
Motivation: Existing nuclei segmentation methods are computationally heavy and lack generalization across datasets, limiting practical deployment. SAM-based approaches rely on prompts or complex decoders, making them unsuitable for dense histopathology images with heterogeneous appearances.Method: Proposes a prompt-free SAM adaptation that leverages multi-level encoder features and residual decoding. Only fine-tunes LoRA modules within the frozen SAM encoder, requiring just 4.1M trainable parameters for efficient adaptation.
Result: Achieves state-of-the-art performance on three benchmark datasets (TNBC, MoNuSeg, PanNuke) and demonstrates strong cross-dataset generalization capabilities.
Conclusion: The proposed framework provides an effective and practical solution for histopathology nuclei segmentation with computational efficiency and strong generalization across datasets.
Abstract: Histopathology nuclei segmentation is crucial for quantitative tissue analysis and cancer diagnosis. Although existing segmentation methods have achieved strong performance, they are often computationally heavy and show limited generalization across datasets, which constrains their practical deployment. Recent SAM-based approaches have shown great potential in general and medical imaging, but typically rely on prompt guidance or complex decoders, making them less suitable for histopathology images with dense nuclei and heterogeneous appearances. We propose a prompt-free and lightweight SAM adaptation that leverages multi-level encoder features and residual decoding for accurate and efficient nuclei segmentation. The framework fine-tunes only LoRA modules within the frozen SAM encoder, requiring just 4.1M trainable parameters. Experiments on three benchmark datasets TNBC, MoNuSeg, and PanNuke demonstrate state-of-the-art performance and strong cross-dataset generalization, highlighting the effectiveness and practicality of the proposed framework for histopathology applications.
[207] From Scale to Speed: Adaptive Test-Time Scaling for Image Editing
Xiangyan Qu, Zhenlong Yuan, Jing Tang, Rui Chen, Datao Tang, Meng Yu, Lei Sun, Yancheng Bai, Xiangxiang Chu, Gaopeng Gou, Gang Xiong, Yujun Cai
Main category: cs.CV
TL;DR: ADE-CoT is an adaptive test-time scaling framework for image editing that improves efficiency by dynamically allocating resources based on edit difficulty, using edit-specific verification for early pruning, and opportunistic stopping when intent-aligned results are found.
Details
Motivation: Image-CoT methods work well for text-to-image generation but are inefficient for image editing due to constrained solution spaces, fixed sampling budgets, unreliable MLLM verification, and redundant results. The paper aims to adapt Image-CoT principles specifically for image editing tasks.Method: ADE-CoT incorporates three strategies: 1) difficulty-aware resource allocation that assigns dynamic budgets based on estimated edit difficulty, 2) edit-specific verification using region localization and caption consistency for early pruning, and 3) depth-first opportunistic stopping guided by an instance-specific verifier.
Result: Experiments on three state-of-the-art editing models (Step1X-Edit, BAGEL, FLUX.1 Kontext) across three benchmarks show ADE-CoT achieves superior performance-efficiency trade-offs, obtaining better performance with more than 2x speedup over Best-of-N with comparable sampling budgets.
Conclusion: ADE-CoT successfully adapts Image-CoT principles to image editing by addressing the unique challenges of goal-directed editing, demonstrating significant improvements in both efficiency and performance through adaptive resource allocation and verification strategies.
Abstract: Image Chain-of-Thought (Image-CoT) is a test-time scaling paradigm that improves image generation by extending inference time. Most Image-CoT methods focus on text-to-image (T2I) generation. Unlike T2I generation, image editing is goal-directed: the solution space is constrained by the source image and instruction. This mismatch causes three challenges when applying Image-CoT to editing: inefficient resource allocation with fixed sampling budgets, unreliable early-stage verification using general MLLM scores, and redundant edited results from large-scale sampling. To address this, we propose ADaptive Edit-CoT (ADE-CoT), an on-demand test-time scaling framework to enhance editing efficiency and performance. It incorporates three key strategies: (1) a difficulty-aware resource allocation that assigns dynamic budgets based on estimated edit difficulty; (2) edit-specific verification in early pruning that uses region localization and caption consistency to select promising candidates; and (3) depth-first opportunistic stopping, guided by an instance-specific verifier, that terminates when intent-aligned results are found. Extensive experiments on three SOTA editing models (Step1X-Edit, BAGEL, FLUX.1 Kontext) across three benchmarks show that ADE-CoT achieves superior performance-efficiency trade-offs. With comparable sampling budgets, ADE-CoT obtains better performance with more than 2x speedup over Best-of-N.
[208] High-fidelity Multi-view Normal Integration with Scale-encoded Neural Surface Representation
Tongyu Yang, Heng Guo, Yasuyuki Matsushita, Fumio Okura, Yu Luo, Xin Fan
Main category: cs.CV
TL;DR: Scale-encoded neural surface representation for multi-view normal integration that addresses pixel coverage variation across distances to preserve high-frequency details.
Details
Motivation: Existing multi-view normal integration methods sample single rays per pixel without considering pixel coverage area variation with camera distance, causing multi-view normal inconsistency and blurring of high-frequency surface details.Method: Proposes scale-encoded neural surface representation incorporating pixel coverage area, associating each 3D point with spatial scale and calculating normals from hybrid grid-based encoding. Includes mesh extraction module assigning optimal local scale to each vertex based on training observations.
Result: Method consistently yields high-fidelity surface reconstruction from normals observed at varying distances, outperforming existing multi-view normal integration methods.
Conclusion: Scale-aware approach effectively addresses multi-view normal inconsistency caused by varying camera distances, enabling preservation of high-frequency surface details in reconstruction.
Abstract: Previous multi-view normal integration methods typically sample a single ray per pixel, without considering the spatial area covered by each pixel, which varies with camera intrinsics and the camera-to-object distance. Consequently, when the target object is captured at different distances, the normals at corresponding pixels may differ across views. This multi-view surface normal inconsistency results in the blurring of high-frequency details in the reconstructed surface. To address this issue, we propose a scale-encoded neural surface representation that incorporates the pixel coverage area into the neural representation. By associating each 3D point with a spatial scale and calculating its normal from a hybrid grid-based encoding, our method effectively represents multi-scale surface normals captured at varying distances. Furthermore, to enable scale-aware surface reconstruction, we introduce a mesh extraction module that assigns an optimal local scale to each vertex based on the training observations. Experimental results demonstrate that our approach consistently yields high-fidelity surface reconstruction from normals observed at varying distances, outperforming existing multi-view normal integration methods.
[209] Toward a Multi-View Brain Network Foundation Model: Cross-View Consistency Learning Across Arbitrary Atlases
Jiaxing Xu, Jingying Ma, Xin Lin, Yuxiao Liu, Kai He, Qika Lin, Yiping Ke, Yang Li, Dinggang Shen, Mengling Feng
Main category: cs.CV
TL;DR: MV-BrainFM is a multi-view brain network foundation model that learns generalizable representations from fMRI data across different brain atlases by incorporating anatomical priors and cross-view consistency learning.
Details
Motivation: Existing brain network foundation models are limited by atlas dependency, insufficient exploitation of multiple network views, and weak incorporation of anatomical priors, which hinders their generalizability and scalability across different brain atlases.Method: Proposes MV-BrainFM with anatomical distance information incorporated into Transformer-based modeling to guide inter-regional interactions, and introduces unsupervised cross-view consistency learning to align representations from multiple atlases of the same subject in a shared latent space.
Result: Extensive experiments on 20K+ subjects from 17 fMRI datasets show MV-BrainFM consistently outperforms 14 existing brain network foundation models and task-specific baselines under both single-atlas and multi-atlas settings.
Conclusion: MV-BrainFM provides a scalable, atlas-agnostic foundation model for brain network analysis that effectively captures complementary information across heterogeneous network views while maintaining anatomical awareness.
Abstract: Brain network analysis provides an interpretable framework for characterizing brain organization and has been widely used for neurological disorder identification. Recent advances in self-supervised learning have motivated the development of brain network foundation models. However, existing approaches are often limited by atlas dependency, insufficient exploitation of multiple network views, and weak incorporation of anatomical priors. In this work, we propose MV-BrainFM, a multi-view brain network foundation model designed to learn generalizable and scalable representations from brain networks constructed with arbitrary atlases. MV-BrainFM explicitly incorporates anatomical distance information into Transformer-based modeling to guide inter-regional interactions, and introduces an unsupervised cross-view consistency learning strategy to align representations from multiple atlases of the same subject in a shared latent space. By jointly enforcing within-view robustness and cross-view alignment during pretraining, the model effectively captures complementary information across heterogeneous network views while remaining atlas-aware. In addition, MV-BrainFM adopts a unified multi-view pretraining paradigm that enables simultaneous learning from multiple datasets and atlases, significantly improving computational efficiency compared to conventional sequential training strategies. The proposed framework also demonstrates strong scalability, consistently benefiting from increasing data diversity while maintaining stable performance across unseen atlas configurations. Extensive experiments on more than 20K subjects from 17 fMRI datasets show that MV-BrainFM consistently outperforms 14 existing brain network foundation models and task-specific baselines under both single-atlas and multi-atlas settings.
[210] Uni-Classifier: Leveraging Video Diffusion Priors for Universal Guidance Classifier
Yujie Zhou, Pengyang Ling, Jiazi Bu, Bingjie Gao, Li Niu
Main category: cs.CV
TL;DR: Uni-Classifier (Uni-C) is a plug-and-play module that uses video diffusion priors to align outputs between chained generative models, addressing distributional mismatches in AI workflows.
Details
Motivation: Complex AI workflows often chain multiple generative models (e.g., 2D image → video/3D generation), but distributional mismatches between upstream outputs and downstream input expectations degrade overall quality.Method: Uni-C leverages video diffusion priors to guide the denoising process of preceding models, aligning their outputs with downstream requirements. It can be used as a plug-and-play module in workflows or independently to enhance individual model outputs.
Result: Extensive experiments across video and 3D generation tasks show Uni-C consistently improves generation quality in both workflow-based and standalone settings.
Conclusion: Uni-C demonstrates versatility and strong generalization capability for improving generative model outputs and addressing distributional mismatches in chained AI workflows.
Abstract: In practical AI workflows, complex tasks often involve chaining multiple generative models, such as using a video or 3D generation model after a 2D image generator. However, distributional mismatches between the output of upstream models and the expected input of downstream models frequently degrade overall generation quality. To address this issue, we propose Uni-Classifier (Uni-C), a simple yet effective plug-and-play module that leverages video diffusion priors to guide the denoising process of preceding models, thereby aligning their outputs with downstream requirements. Uni-C can also be applied independently to enhance the output quality of individual generative models. Extensive experiments across video and 3D generation tasks demonstrate that Uni-C consistently improves generation quality in both workflow-based and standalone settings, highlighting its versatility and strong generalization capability.
[211] Multi-Stage Fine-Tuning of Pathology Foundation Models with Head-Diverse Ensembling for White Blood Cell Classification
Antony Gitau, Martin Paulson, Bjørn-Jostein Singstad, Karl Thomas Hjelmervik, Ola Marius Lysaker, Veralia Gabriela Sanchez
Main category: cs.CV
TL;DR: Multi-stage fine-tuning approach for 13-class WBC classification using DINOBloom-base with ensemble of specialized classifier heads to address morphological continuum confusion in leukemia diagnosis
Details
Motivation: Automated white blood cell classification for leukemia diagnosis faces challenges including class imbalance, domain shift, and morphological continuum confusion where adjacent maturation stages have subtle, overlapping featuresMethod: Multi-stage fine-tuning of DINOBloom-base model with three classifier head families (linear, cosine, MLP), then constructing ensemble where MLP is primary predictor with confusion pair resolution when two other heads agree
Result: Different heads specialized for different granulocyte classes: cosine best for mature (BNE F1=0.470), linear for immature (MMY F1=0.585), MLP for most immature (PMY F1=0.733). Ensemble improved performance on confusion pairs
Conclusion: Head-diverse ensemble leverages class-specific specialization to address morphological continuum confusion, with consistent misclassifications revealing probable labeling errors or inherent ambiguity
Abstract: The classification of white blood cells (WBCs) from peripheral blood smears is critical for the diagnosis of leukemia. However, automated approaches still struggle due to challenges including class imbalance, domain shift, and morphological continuum confusion, where adjacent maturation stages exhibit subtle, overlapping features. We present a multi-stage fine-tuning methodology for 13-class WBC classification in the WBCBench 2026 Challenge (ISBI 2026). Our best-performing model is a fine-tuned DINOBloom-base, on which we train multiple classifier head families (linear, cosine, and multilayer perceptron (MLP)). The cosine head performed best on the mature granulocyte boundary (Band neutrophil (BNE) F1 = 0.470), the linear head on more immature granulocyte classes (Metamyelocyte (MMY) F1 = 0.585), and the MLP head on the most immature granulocyte (Promyelocyte (PMY) F1 = 0.733), revealing class-specific specialization. Based on this specialization, we construct a head-diverse ensemble, where the MLP head acts as the primary predictor, and its predictions within the four predefined confusion pairs are replaced only when two other head families agree. We further show that cases consistently misclassified by all models are substantially enriched for probable labeling errors or inherent morphological ambiguity.
[212] Jigsaw Regularization in Whole-Slide Image Classification
So Won Jeong, Veronika Ročková
Main category: cs.CV
TL;DR: Graph-based MIL with vision foundation model embeddings and jigsaw regularization improves pathology slide classification by incorporating local and global spatial structure.
Details
Motivation: Current multiple instance learning (MIL) methods for computational pathology treat tissue patches as exchangeable, ignoring the rich spatial and topological structure inherent in whole-slide images, which limits classification performance.Method: Two novel components: (1) using vision foundation model embeddings to capture local spatial structure within each patch, and (2) employing graph neural networks with jigsaw regularization to achieve across-patch spatial awareness in tissue images.
Result: The combination of vision foundation model embeddings and jigsaw regularization markedly improves classification performance over state-of-the-art attention-based MIL approaches on benchmark datasets for breast, head-and-neck, and colon cancer.
Conclusion: Incorporating both local patch-level spatial structure (via foundation model embeddings) and global across-patch spatial relationships (via graph neural networks with jigsaw regularization) significantly enhances computational pathology classification by better leveraging the inherent spatial organization of tissue images.
Abstract: Computational pathology involves the digitization of stained tissues into whole-slide images (WSIs) that contain billions of pixels arranged as contiguous patches. Statistical analysis of WSIs largely focuses on classification via multiple instance learning (MIL), in which slide-level labels are inferred from unlabeled patches. Most MIL methods treat patches as exchangeable, overlooking the rich spatial and topological structure that underlies tissue images. This work builds on recent graph-based methods that aim to incorporate spatial awareness into MIL. Our approach is new in two regards: (1) we deploy vision \emph{foundation-model embeddings} to incorporate local spatial structure within each patch, and (2) achieve across-patch spatial awareness using graph neural networks together with a novel {\em jigsaw regularization}. We find that a combination of these two features markedly improves classification over state-of-the-art attention-based MIL approaches on benchmark datasets in breast, head-and-neck, and colon cancer.
[213] Monocular Models are Strong Learners for Multi-View Human Mesh Recovery
Haoyu Xie, Shengkai Xu, Cheng Guo, Muhammad Usama Saleem, Wenhan Wu, Chen Chen, Ahmed Helmy, Pu Wang, Hongfei Xue
Main category: cs.CV
TL;DR: Training-free multi-view human mesh recovery framework that uses pretrained single-view models as priors, eliminating need for multi-view training data or camera calibration.
Details
Motivation: Existing multi-view human mesh recovery methods have limitations: geometry-based approaches require cumbersome camera calibration, while learning-based methods generalize poorly to unseen camera configurations due to lack of multi-view training data. There's a need for calibration-free reconstruction that generalizes to arbitrary camera setups.Method: Proposes a training-free framework leveraging pretrained single-view HMR models as strong priors. First constructs robust multi-view initialization from single-view predictions, then refines via test-time optimization guided by multi-view consistency and anatomical constraints.
Result: Extensive experiments demonstrate state-of-the-art performance on standard benchmarks, surpassing multi-view models trained with explicit multi-view supervision.
Conclusion: The proposed training-free framework enables calibration-free multi-view human mesh recovery that generalizes to arbitrary camera setups, overcoming limitations of existing geometry-based and learning-based approaches.
Abstract: Multi-view human mesh recovery (HMR) is broadly deployed in diverse domains where high accuracy and strong generalization are essential. Existing approaches can be broadly grouped into geometry-based and learning-based methods. However, geometry-based methods (e.g., triangulation) rely on cumbersome camera calibration, while learning-based approaches often generalize poorly to unseen camera configurations due to the lack of multi-view training data, limiting their performance in real-world scenarios. To enable calibration-free reconstruction that generalizes to arbitrary camera setups, we propose a training-free framework that leverages pretrained single-view HMR models as strong priors, eliminating the need for multi-view training data. Our method first constructs a robust and consistent multi-view initialization from single-view predictions, and then refines it via test-time optimization guided by multi-view consistency and anatomical constraints. Extensive experiments demonstrate state-of-the-art performance on standard benchmarks, surpassing multi-view models trained with explicit multi-view supervision.
[214] FAAR: Efficient Frequency-Aware Multi-Task Fine-Tuning via Automatic Rank Selection
Maxime Fontana, Michael Spratling, Miaojing Shi
Main category: cs.CV
TL;DR: FAAR introduces frequency-aware automatic rank adaptation for efficient multi-task learning fine-tuning with performance-driven rank shrinking and task-spectral pyramidal decoder
Details
Motivation: Traditional full fine-tuning is inefficient for multi-task learning as cost scales with tasks. Existing PEFT methods use fixed ranks and fail to capture spatial inter-task relationships needed for diverse task predictions.Method: FAAR uses Performance-Driven Rank Shrinking (PDRS) to allocate optimal rank per adapter location and task, and Task-Spectral Pyramidal Decoder (TS-PD) that analyzes image frequency spectrum to inject input-specific context into spatial bias learning for cross-task relationships.
Result: FAAR reduces parameters by up to 9x compared to traditional MTL fine-tuning while improving overall performance on dense visual task benchmarks, outperforming other PEFT methods in both accuracy and efficiency.
Conclusion: FAAR provides an effective parameter-efficient fine-tuning approach for multi-task learning that dynamically adapts ranks and captures spatial inter-task relationships through frequency analysis.
Abstract: Adapting models pre-trained on large-scale datasets is a proven way to reach strong performance quickly for down-stream tasks. However, the growth of state-of-the-art mod-els makes traditional full fine-tuning unsuitable and difficult, especially for multi-task learning (MTL) where cost scales with the number of tasks. As a result, recent studies investigate parameter-efficient fine-tuning (PEFT) using low-rank adaptation to significantly reduce the number of trainable parameters. However, these existing methods use a single, fixed rank, which may not be optimal for differ-ent tasks or positions in the MTL architecture. Moreover, these methods fail to learn spatial information that cap-tures inter-task relationships and helps to improve diverse task predictions. This paper introduces Frequency-Aware and Automatic Rank (FAAR) for efficient MTL fine-tuning. Our method introduces Performance-Driven Rank Shrink-ing (PDRS) to allocate the optimal rank per adapter location and per task. Moreover, by analyzing the image frequency spectrum, FAAR proposes a Task-Spectral Pyramidal Decoder (TS-PD) that injects input-specific context into spatial bias learning to better reflect cross-task relationships. Experiments performed on dense visual task benchmarks show the superiority of our method in terms of both accuracy and efficiency compared to other PEFT methods in MTL. FAAR reduces the number of parameters by up to 9 times compared to traditional MTL fine-tuning whilst improving overall performance. Our code is available.
[215] PEARL: Personalized Streaming Video Understanding Model
Yuanhong Zheng, Ruichuan An, Xiaopeng Lin, Yuxing Liu, Sihan Yang, Huanyu Zhang, Haodong Li, Qintong Zhang, Renrui Zhang, Guopeng Li, Yifan Zhang, Yuheng Li, Wentao Zhang
Main category: cs.CV
TL;DR: Proposes Personalized Streaming Video Understanding (PSVU) task, introduces PEARL-Bench benchmark, and presents PEARL training-free method for real-time personalized video understanding.
Details
Motivation: Current multimodal personalization methods are limited to static images or offline videos, disconnected from continuous visual input and real-time feedback needed for interactive AI assistants.Method: Formally defines PSVU task, creates PEARL-Bench with 132 videos and 2,173 fine-grained annotations, and proposes PEARL - a plug-and-play, training-free strategy for streaming video personalization.
Result: PEARL achieves state-of-the-art performance across 8 offline and online models, brings consistent PSVU improvements to 3 distinct architectures, proving to be effective and robust.
Conclusion: This work advances vision-language model personalization and inspires research into streaming personalized AI assistants, with PEARL serving as a strong baseline for PSVU.
Abstract: Human cognition of new concepts is inherently a streaming process: we continuously recognize new objects or identities and update our memories over time. However, current multimodal personalization methods are largely limited to static images or offline videos. This disconnects continuous visual input from instant real-world feedback, limiting their ability to provide the real-time, interactive personalized responses essential for future AI assistants. To bridge this gap, we first propose and formally define the novel task of Personalized Streaming Video Understanding (PSVU). To facilitate research in this new direction, we introduce PEARL-Bench, the first comprehensive benchmark designed specifically to evaluate this challenging setting. It evaluates a model’s ability to respond to personalized concepts at exact timestamps under two modes: (1) Frame-level, focusing on a specific person or object in discrete frames, and (2) a novel Video-level, focusing on personalized actions unfolding across continuous frames. PEARL-Bench comprises 132 unique videos and 2,173 fine-grained annotations with precise timestamps. Concept diversity and annotation quality are strictly ensured through a combined pipeline of automated generation and human verification. To tackle this challenging new setting, we further propose PEARL, a plug-and-play, training-free strategy that serves as a strong baseline. Extensive evaluations across 8 offline and online models demonstrate that PEARL achieves state-of-the-art performance. Notably, it brings consistent PSVU improvements when applied to 3 distinct architectures, proving to be a highly effective and robust strategy. We hope this work advances vision-language model (VLM) personalization and inspires further research into streaming personalized AI assistants. Code is available at https://github.com/Yuanhong-Zheng/PEARL.
[216] Benchmarking Efficient & Effective Camera Pose Estimation Strategies for Novel View Synthesis
Jhacson Meza, Martin R. Oswald, Torsten Sattler
Main category: cs.CV
TL;DR: A benchmark for efficient and accurate Structure-from-Motion (SfM) methods for novel view synthesis, showing that using fewer features accelerates classical SfM while maintaining accuracy, and that neural network initialization with classical refinement offers the best trade-off.
Details
Motivation: Recent neural SfM approaches are efficient but less accurate than classical bundle adjustment methods. There's a need for SfM methods that are both efficient AND effective for novel view synthesis applications.Method: Develops a benchmark for SfM in novel view synthesis using existing datasets. Tests two strategies: 1) using fewer features to accelerate classical SfM, and 2) using feed-forward neural networks for initial estimates followed by classical refinement.
Result: 1) Fewer features significantly accelerate classical SfM while maintaining high pose accuracy. 2) Neural network initialization with classical refinement provides the best efficiency-effectiveness trade-off.
Conclusion: The benchmark demonstrates practical strategies for efficient SfM that maintain accuracy, with hybrid approaches offering optimal performance for novel view synthesis applications.
Abstract: Novel view synthesis (NVS) approaches such as NeRFs or 3DGS can produce photo-realistic 3D scene representation from a set of images with known extrinsic and intrinsic parameters. The necessary camera poses and calibrations are typically obtained from the images via Structure-from-Motion (SfM). Classical SfM approaches rely on local feature matches between the images to estimate both the poses and a sparse 3D model of the scene, using bundle adjustment to refine initial pose, intrinsics, and geometry estimates. In order to increase run-time efficiency, recent SfM systems forgo optimization via bundle adjustment. Instead, they train feed-forward (transformer-based) neural networks to directly regress camera parameters and the 3D structure. While orders of magnitude more efficient, such recent works produce significantly less accurate estimates. To stimulate research on developing SfM approaches that are both efficient \emph{and} effective, this paper develops a benchmark focused on SfM for novel view synthesis. Using existing datasets and two simple strategies for making the reconstruction process more efficient, we show that: (1) simply using fewer features already significantly accelerates classical SfM methods while maintaining high pose accuracy. (2) using feed-forward networks to obtain initial estimates and refining them using classical SfM techniques leads to the best efficiency-effectiveness trade-off. We will make our benchmark and code publicly available.
[217] Inverting Neural Networks: New Methods to Generate Neural Network Inputs from Prescribed Outputs
Rebecca Pattichis, Sebastian Janampa, Constantinos S. Pattichis, Marios S. Pattichis
Main category: cs.CV
TL;DR: The paper presents two methods for solving the inverse problem of finding input images that map to specific neural network classes, revealing vulnerabilities in networks by generating random-like images that achieve near-perfect classification scores.
Details
Motivation: Neural networks are complex black-box systems, and understanding what input images get mapped to specific classes is challenging. The authors aim to solve this inverse problem to reveal recognizable features associated with class classifications and expose network vulnerabilities.Method: Two general methods: 1) Forward pass method using root-finding algorithm and Jacobian with respect to input images, 2) Backward pass method that iteratively inverts each layer from the top, adding random vectors sampled from the null-space of each linear layer.
Result: The methods successfully produce random-like input images that yield near-perfect classification scores on both transformer architectures and sequential networks based on linear layers, revealing vulnerabilities in the underlying networks.
Conclusion: The proposed methods provide more comprehensive coverage of input image spaces that solve the inverse mapping problem, demonstrating that networks can be fooled by seemingly random images that achieve high classification confidence.
Abstract: Neural network systems describe complex mappings that can be very difficult to understand. In this paper, we study the inverse problem of determining the input images that get mapped to specific neural network classes. Ultimately, we expect that these images contain recognizable features that are associated with their corresponding class classifications. We introduce two general methods for solving the inverse problem. In our forward pass method, we develop an inverse method based on a root-finding algorithm and the Jacobian with respect to the input image. In our backward pass method, we iteratively invert each layer, at the top. During the inversion process, we add random vectors sampled from the null-space of each linear layer. We demonstrate our new methods on both transformer architectures and sequential networks based on linear layers. Unlike previous methods, we show that our new methods are able to produce random-like input images that yield near perfect classification scores in all cases, revealing vulnerabilities in the underlying networks. Hence, we conclude that the proposed methods provide a more comprehensive coverage of the input image spaces that solve the inverse mapping problem.
[218] CREG: Compass Relational Evidence for Interpreting Spatial Reasoning in Vision-Language Models
Kaizhen Tan
Main category: cs.CV
TL;DR: CREG is a training-free interpretability framework that reveals directional relations in vision-language models by projecting contrastive attributions into polar coordinates, outperforming standard attribution methods on spatial reasoning tasks.
Details
Motivation: While VLMs perform well on spatial reasoning, existing attribution methods (GradCAM, attention rollout) show where models attend but not what directional relations they infer between objects. There's a need for interpretability methods that specifically reveal directional reasoning in VLMs.Method: CREG projects multi-layer contrastive Grad-times-Act attributions into a reference-centered polar coordinate system, creating directional evidence distributions over compass sectors. It’s training-free and uses three evaluation metrics: Direction Alignment Error, Edge Accuracy, and Causal Occlusion Score.
Result: On Qwen2-VL-7B across VSR and COCO-Pairs, CREG consistently outperforms standard attribution baselines. On COCO-Pairs, it achieves DAE of 55.5° and EA of 0.553, improving over attention rollout by 16.1° in angular error and 0.120 in EA. Causal occlusion experiments show COS ≥ +0.42, supporting faithfulness.
Conclusion: Contrastive, multi-layer attribution can expose directional evidence more faithfully than standard saliency-based explanations in VLM spatial reasoning. CREG benefits from more structured spatial representations that emerge at larger model scales.
Abstract: Vision-language models (VLMs) perform strongly on spatial reasoning benchmarks, yet how they encode directional relations remains poorly understood. Existing attribution methods such as GradCAM and attention rollout reveal where a model attends, but not what direction it infers between objects. We introduce CREG (Compass Relational Evidence Graph), a training-free interpretability framework that projects multi-layer contrastive Grad-times-Act attributions into a reference-centered polar coordinate system, producing a directional evidence distribution over compass sectors. To evaluate directional explanations, we propose three metrics: Direction Alignment Error (DAE), Edge Accuracy (EA), and Causal Occlusion Score (COS). On Qwen2-VL-7B across VSR and COCO-Pairs, CREG consistently outperforms standard attribution baselines; on COCO-Pairs, prediction-targeted CREG achieves a DAE of 55.5 degrees and an EA of 0.553, improving over attention rollout by 16.1 degrees in angular error and 0.120 in EA. Causal occlusion experiments on 540 samples across both datasets further support the faithfulness of these directional explanations, with COS greater than or equal to +0.42. The gains are smaller on Qwen2-VL-2B, suggesting that CREG benefits from the more structured spatial representations that emerge at larger scales. Overall, our results show that contrastive, multi-layer attribution can expose directional evidence more faithfully than standard saliency-based explanations in VLM spatial reasoning.
[219] Lessons and Open Questions from a Unified Study of Camera-Trap Species Recognition Over Time
Sooyoung Jeon, Hongjie Tian, Lemeng Wang, Zheda Mai, Vidhi Bakshi, Jiacheng Hou, Ping Zhang, Arpita Chowdhury, Jianyang Gu, Wei-Lun Chao
Main category: cs.CV
TL;DR: Study on camera-trap species recognition over time, introducing a streaming benchmark and analyzing challenges of temporal shifts in ecosystems for reliable automated monitoring.
Details
Motivation: Camera traps are crucial for biodiversity monitoring but automated analysis faces challenges beyond cross-domain generalization - specifically, maintaining reliable recognition at fixed sites over time as ecosystems change dynamically.Method: Created realistic benchmark with 546 camera traps using streaming protocol evaluating models over chronologically ordered intervals; studied site-specific adaptation, model updating with past data, and identified drivers of difficulty including class imbalance and temporal shifts.
Result: Found biological foundation models underperform at many sites; adaptation is challenging and can degrade performance below zero-shot; identified class imbalance and temporal shifts as key difficulties; effective integration of model-update and post-processing improves accuracy but gaps remain.
Conclusion: Provides actionable deployment guidelines for ecological practitioners and establishes new research directions for vision and ML, highlighting need for better temporal adaptation methods and understanding when zero-shot models succeed.
Abstract: Camera traps are vital for large-scale biodiversity monitoring, yet accurate automated analysis remains challenging due to diverse deployment environments. While the computer vision community has mostly framed this challenge as cross-domain generalization, this perspective overlooks a primary challenge faced by ecological practitioners: maintaining reliable recognition at the fixed site over time, where the dynamic nature of ecosystems introduces profound temporal shifts in both background and animal distributions. To bridge this gap, we present the first unified study of camera-trap species recognition over time. We introduce a realistic benchmark comprising 546 camera traps with a streaming protocol that evaluates models over chronologically ordered intervals. Our end-user-centric study yields four key findings. (1) Biological foundation models (e.g., BioCLIP 2) underperform at numerous sites even in initial intervals, underscoring the necessity of site-specific adaptation. (2) Adaptation is challenging under realistic evaluation: when models are updated using past data and evaluated on future intervals (mirrors real deployment lifecycles), naive adaptation can even degrade below zero-shot performance. (3) We identify two drivers of this difficulty: severe class imbalance and pronounced temporal shift in both species distribution and backgrounds between consecutive intervals. (4) We find that effective integration of model-update and post-processing techniques can largely improve accuracy, though a gap from the upper bounds remains. Finally, we highlight critical open questions, such as predicting when zero-shot models will succeed at a new site and determining whether/when model updates are necessary. Our benchmark and analysis provide actionable deployment guidelines for ecological practitioners while establishing new directions for future research in vision and machine learning.
[220] End-to-End Optimization of Polarimetric Measurement and Material Classifier
Ryota Maeda, Naoki Arikawa, Yutaka No, Shinsaku Hiura
Main category: cs.CV
TL;DR: End-to-end optimization framework for material classification using optimal polarization measurement angles
Details
Motivation: Material classification is crucial for scene understanding, and polarization provides rich material information even at distances where high-resolution texture capture is impractical. However, traditional polarimetric measurements require multiple modulations of polarization states, making the process time-consuming. The optimal configuration of measurement angles for material classification remains unclear.Method: Proposes an end-to-end optimization framework that jointly learns a material classifier and determines optimal combinations of rotation angles for polarization elements controlling both incident and reflected light states. Uses Mueller-matrix material dataset.
Result: Demonstrates high-accuracy material classification even with a limited number of measurements.
Conclusion: The proposed framework enables efficient material classification by optimizing polarization measurement angles, reducing measurement time while maintaining accuracy.
Abstract: Material classification is a fundamental problem in computer vision and plays a crucial role in scene understanding. Previous studies have explored various material recognition methods based on reflection properties such as color, texture, specularity, and scattering. Among these cues, polarization is particularly valuable because it provides rich material information and enables recognition even at distances where capturing high-resolution texture is impractical. However, measuring polarimetric reflectance properties typically requires multiple modulations of the polarization state of the incident light, making the process time-consuming and often unnecessary for certain recognition tasks. While material classification can be achieved using only a subset of polarimetric measurements, the optimal configuration of measurement angles remains unclear. In this study, we propose an end-to-end optimization framework that jointly learns a material classifier and determines the optimal combinations of rotation angles for polarization elements that control both the incident and reflected light states. Using our Mueller-matrix material dataset, we demonstrate that our method achieves high-accuracy material classification even with a limited number of measurements.
[221] When Negation Is a Geometry Problem in Vision-Language Models
Fawaz Sammani, Tzoulio Chamiti, Paul Gavrikov, Nikos Deligiannis
Main category: cs.CV
TL;DR: The paper addresses CLIP’s failure to understand negation in text queries, proposes MLLM-based evaluation metrics, discovers a negation direction in CLIP’s embedding space, and shows test-time intervention can improve negation understanding without fine-tuning.
Details
Motivation: CLIP and similar vision-language models fail to understand negation in text queries (e.g., "a plain blue shirt with no logos"). Prior work uses data-centric approaches with flawed evaluation metrics that don't reliably measure actual negation understanding.Method: 1) Identifies limitations of existing retrieval-based evaluation metrics; 2) Proposes Multimodal LLMs-as-a-judge evaluation framework using yes/no questions about image content; 3) Investigates whether negation direction exists in CLIP embedding space; 4) Uses representation engineering for test-time intervention to steer CLIP toward negation-aware behavior without fine-tuning; 5) Tests generalization on non-common image-text samples.
Result: Found evidence that a negation direction exists in CLIP’s embedding space. Showed that representation engineering can manipulate this direction to improve negation understanding without fine-tuning. Demonstrated the proposed MLLM-based evaluation provides fair assessment of negation understanding compared to flawed retrieval metrics.
Conclusion: Negation understanding in CLIP can be improved through test-time intervention using discovered negation directions in embedding space, bypassing need for large-scale fine-tuning. New MLLM-based evaluation framework provides more reliable assessment of negation understanding capabilities.
Abstract: Joint Vision-Language Embedding models such as CLIP typically fail at understanding negation in text queries - for example, failing to distinguish “no” in the query: “a plain blue shirt with no logos”. Prior work has largely addressed this limitation through data-centric approaches, fine-tuning CLIP on large-scale synthetic negation datasets. However, these efforts are commonly evaluated using retrieval-based metrics that cannot reliably reflect whether negation is actually understood. In this paper, we identify two key limitations of such evaluation metrics and investigate an alternative evaluation framework based on Multimodal LLMs-as-a-judge, which typically excel at understanding simple yes/no questions about image content, providing a fair evaluation of negation understanding in CLIP models. We then ask whether there already exists a direction in the CLIP embedding space associated with negation. We find evidence that such a direction exists, and show that it can be manipulated through test-time intervention via representation engineering to steer CLIP toward negation-aware behavior without any fine-tuning. Finally, we test negation understanding on non-common image-text samples to evaluate generalization under distribution shifts.
[222] Improving Diffusion Generalization with Weak-to-Strong Segmented Guidance
Liangyu Yuan, Yufei Huang, Mingkun Lei, Tong Zhao, Ruoyu Wang, Changxi Chi, Yiwei Wang, Chi Zhang
Main category: cs.CV
TL;DR: Proposes SGG (Strong-to-Weak Guidance), a hybrid guidance method for diffusion models that combines CFG and AutoGuidance principles to improve generalization and reduce gradient error accumulation.
Details
Motivation: Diffusion models suffer from misalignment between simulation-free objectives and iterative refinement, causing accumulated gradient errors and poor generalization. Existing guidance methods like CFG and AG have unclear effective regimes, making selection ambiguous.Method: First conducts synthetic comparisons to isolate effective regimes of CFG and AG from weak-to-strong principle perspective. Proposes SGG as hybrid instantiation combining benefits of both. Also migrates W2S principle into training objective to improve unguided diffusion models.
Result: SGG outperforms existing training-free guidance variants on SD3 and SD3.5. Training-time experiments on transformer architectures show effective migration and performance gains in both conditional and unconditional settings.
Conclusion: SGG provides effective hybrid guidance for diffusion models, addressing gradient error accumulation and improving generalization through weak-to-strong principle integration.
Abstract: Diffusion models generate synthetic images through an iterative refinement process. However, the misalignment between the simulation-free objective and the iterative process often causes accumulated gradient error along the sampling trajectory, which leads to unsatisfactory results and a failure to generalize. Guidance techniques like Classifier Free Guidance (CFG) and AutoGuidance (AG) alleviate this by extrapolating between the main and inferior signal for stronger generalization. Despite empirical success, the effective operational regimes of prevalent guidance methods are still under-explored, leading to ambiguity when selecting the appropriate guidance method given a precondition. In this work, we first conduct synthetic comparisons to isolate and demonstrate the effective regime of guidance methods represented by CFG and AG from the perspective of weak-to-strong principle. Based on this, we propose a hybrid instantiation called SGG under the principle, taking the benefits of both. Furthermore, we demonstrate that the W2S principle along with SGG can be migrated into the training objective, improving the generalization ability of unguided diffusion models. We validate our approach with comprehensive experiments. At inference time, evaluations on SD3 and SD3.5 confirm that SGG outperforms existing training-free guidance variants. Training-time experiments on transformer architectures demonstrate the effective migration and performance gains in both conditional and unconditional settings. Code is available at https://github.com/851695e35/SGG.
[223] RayMap3R: Inference-Time RayMap for Dynamic 3D Reconstruction
Feiran Wang, Zezhou Shang, Gaowen Liu, Yan Yan
Main category: cs.CV
TL;DR: RayMap3R: Training-free streaming framework for dynamic 3D scene reconstruction that identifies moving objects by contrasting RayMap-based predictions with image predictions to suppress artifacts and drift.
Details
Motivation: Streaming feed-forward 3D reconstruction models can be affected by moving objects, causing artifacts and drift due to lack of explicit dynamic reasoning. Current approaches struggle with dynamic scenes in real-time reconstruction.Method: Proposes RayMap3R with dual-branch inference: identifies dynamic regions by contrasting RayMap predictions (which exhibit static-scene bias) with image predictions, suppresses dynamic interference during memory updates, and introduces reset metric alignment and state-aware smoothing for consistency.
Result: Achieves state-of-the-art performance among streaming approaches on dynamic scene reconstruction across multiple benchmarks.
Conclusion: RayMap3R provides an effective training-free solution for dynamic scene reconstruction in streaming settings by leveraging inherent static-scene bias in RayMap predictions for dynamic identification.
Abstract: Streaming feed-forward 3D reconstruction enables real-time joint estimation of scene geometry and camera poses from RGB images. However, without explicit dynamic reasoning, streaming models can be affected by moving objects, causing artifacts and drift. In this work, we propose RayMap3R, a training-free streaming framework for dynamic scene reconstruction. We observe that RayMap-based predictions exhibit a static-scene bias, providing an internal cue for dynamic identification. Based on this observation, we construct a dual-branch inference scheme that identifies dynamic regions by contrasting RayMap and image predictions, suppressing their interference during memory updates. We further introduce reset metric alignment and state-aware smoothing to preserve metric consistency and stabilize predicted trajectories. Our method achieves state-of-the-art performance among streaming approaches on dynamic scene reconstruction across multiple benchmarks.
[224] GaussianPile: A Unified Sparse Gaussian Splatting Framework for Slice-based Volumetric Reconstruction
Di Kong, Yikai Wang, Wenjie Guo, Yifan Bu, Boya Zhang, Yuexin Duan, Xiawei Yue, Wenbiao Du, Yiman Zhong, Yuwen Chen, Cheng Ma
Main category: cs.CV
TL;DR: GaussianPile unifies 3D Gaussian splatting with imaging system-aware focus modeling for efficient compression and visualization of slice-based volumetric data.
Details
Motivation: Slice-based volumetric imaging needs representations that compress aggressively while preserving internal structure for analysis. Current methods lack efficiency in both compression and visualization.Method: Three key innovations: 1) slice-aware piling strategy positioning anisotropic 3D Gaussians, 2) differentiable projection operator encoding finite-thickness point spread function, 3) compact encoding and joint optimization pipeline for simultaneous reconstruction and compression.
Result: Reduces storage and reconstruction costs, sustains diagnostic fidelity, enables fast 2D visualization and 3D voxelization. Delivers results 11x faster than NeRF-based approaches with 16x compression over voxel grids.
Conclusion: GaussianPile offers practical path to deployable compression and exploration of slice-based volumetric datasets with high efficiency and quality preservation.
Abstract: Slice-based volumetric imaging is widely applied and it demands representations that compress aggressively while preserving internal structure for analysis. We introduce GaussianPile, unifying 3D Gaussian splatting with an imaging system-aware focus model to address this challenge. Our proposed method introduces three key innovations: (i) a slice-aware piling strategy that positions anisotropic 3D Gaussians to model through-slice contributions, (ii) a differentiable projection operator that encodes the finite-thickness point spread function of the imaging acquisition system, and (iii) a compact encoding and joint optimization pipeline that simultaneously reconstructs and compresses the Gaussian sets. Our CUDA-based design retains the compression and real-time rendering efficiency of Gaussian primitives while preserving high-frequency internal volumetric detail. Experiments on microscopy and ultrasound datasets demonstrate that our method reduces storage and reconstruction cost, sustains diagnostic fidelity, and enables fast 2D visualization, along with 3D voxelization. In practice, it delivers high-quality results in as few as 3 minutes, up to 11x faster than NeRF-based approaches, and achieves consistent 16x compression over voxel grids, offering a practical path to deployable compression and exploration of slice-based volumetric datasets.
[225] ScaleEdit-12M: Scaling Open-Source Image Editing Data Generation via Multi-Agent Framework
Guanzhou Chen, Erfei Cui, Changyao Tian, Danni Yang, Ganlin Yang, Yu Qiao, Hongsheng Li, Gen Luo, Hongjie Zhang
Main category: cs.CV
TL;DR: ScaleEditor is an open-source hierarchical multi-agent framework for creating large-scale, high-quality image editing datasets without proprietary APIs, producing ScaleEdit-12M - the largest open-source image editing dataset.
Details
Motivation: Current image editing datasets either rely on costly proprietary APIs or have limited quality/generalizability from fixed synthetic pipelines, creating a need for scalable, open-source solutions.Method: Three-component pipeline: 1) source image expansion with world-knowledge infusion, 2) adaptive multi-agent editing instruction-image synthesis, and 3) task-aware data quality verification mechanism.
Result: Created ScaleEdit-12M dataset spanning 23 task families; fine-tuning UniWorld-V1 and Bagel models yielded up to 10.4% improvement on ImgEdit, 35.1% on GEdit, and up to 150.0% on knowledge-infused benchmarks.
Conclusion: Open-source agentic pipelines can approach commercial-grade data quality while maintaining cost-effectiveness and scalability for multimodal model training.
Abstract: Instruction-based image editing has emerged as a key capability for unified multimodal models (UMMs), yet constructing large-scale, diverse, and high-quality editing datasets without costly proprietary APIs remains challenging. Previous image editing datasets either rely on closed-source models for annotation, which prevents cost-effective scaling, or employ fixed synthetic editing pipelines, which suffer from limited quality and generalizability. To address these challenges, we propose ScaleEditor, a fully open-source hierarchical multi-agent framework for end-to-end construction of large-scale, high-quality image editing datasets. Our pipeline consists of three key components: source image expansion with world-knowledge infusion, adaptive multi-agent editing instruction-image synthesis, and a task-aware data quality verification mechanism. Using ScaleEditor, we curate ScaleEdit-12M, the largest open-source image editing dataset to date, spanning 23 task families across diverse real and synthetic domains. Fine-tuning UniWorld-V1 and Bagel on ScaleEdit yields consistent gains, improving performance by up to 10.4% on ImgEdit and 35.1% on GEdit for general editing benchmarks and by up to 150.0% on RISE and 26.5% on KRIS-Bench for knowledge-infused benchmarks. These results demonstrate that open-source, agentic pipelines can approach commercial-grade data quality while retaining cost-effectiveness and scalability. Both the framework and dataset will be open-sourced.
[226] A Multihead Continual Learning Framework for Fine-Grained Fashion Image Retrieval with Contrastive Learning and Exponential Moving Average Distillation
Ling Xiao, Toshihiko Yamasaki
Main category: cs.CV
TL;DR: Multi-head continual learning framework for fine-grained fashion image retrieval using contrastive learning and EMA distillation to handle evolving classes efficiently.
Details
Motivation: Existing fine-grained fashion image retrieval methods require full retraining for new attributes, which is costly. Pretrained models have accuracy drops without supervision, and no prior work explores class-incremental learning for this task.Method: Proposes MCL-FIR: multi-head design for evolving classes, reformulates triplets into doublets with InfoNCE for simpler training, and uses EMA distillation for efficient knowledge transfer.
Result: Outperforms CIL baselines with similar training cost, achieves comparable performance to static methods using only ~30% training cost across four datasets.
Conclusion: MCL-FIR provides scalable, efficient solution for dynamic fashion retrieval scenarios with strong balance between efficiency and accuracy.
Abstract: Most fine-grained fashion image retrieval (FIR) methods assume a static setting, requiring full retraining when new attributes appear, which is costly and impractical for dynamic scenarios. Although pretrained models support zero-shot inference, their accuracy drops without supervision, and no prior work explores class-incremental learning (CIL) for fine-grained FIR. We propose a multihead continual learning framework for fine-grained fashion image retrieval with contrastive learning and exponential moving average (EMA) distillation (MCL-FIR). MCL-FIR adopts a multi-head design to accommodate evolving classes across increments, reformulates triplet inputs into doublets with InfoNCE for simpler and more effective training, and employs EMA distillation for efficient knowledge transfer. Experiments across four datasets demonstrate that, beyond its scalability, MCL-FIR achieves a strong balance between efficiency and accuracy. It significantly outperforms CIL baselines under similar training cost, and compared with static methods, it delivers comparable performance while using only about 30% of the training cost. The source code is publicly available in https://github.com/Dr-LingXiao/MCL-FIR.
[227] IBCapsNet: Information Bottleneck Capsule Network for Noise-Robust Representation Learning
Canqun Xiang, Chen Yang, Jiaoyan Zhao
Main category: cs.CV
TL;DR: IBCapsNet improves capsule networks using Information Bottleneck principle for better robustness and efficiency, replacing iterative routing with one-pass variational aggregation.
Details
Motivation: Capsule networks have hierarchical spatial modeling advantages but suffer from high computational cost due to iterative dynamic routing and poor robustness under input corruptions.Method: Proposes IBCapsNet using Information Bottleneck principle with one-pass variational aggregation: primary capsules compressed into global context, then processed by class-specific VAEs to infer latent capsules regularized by KL divergence.
Result: Matches CapsNet accuracy on clean data (99.41% MNIST, 92.01% SVHN), significantly outperforms under noise (+17.10% clamped additive, +14.54% multiplicative), 2.54x faster training, 3.64x higher inference throughput, 4.66% fewer parameters.
Conclusion: Bridges information-theoretic representation learning with capsule networks for robust, efficient, interpretable deep models.
Abstract: Capsule networks (CapsNets) are superior at modeling hierarchical spatial relationships but suffer from two critical limitations: high computational cost due to iterative dynamic routing and poor robustness under input corruptions. To address these issues, we propose IBCapsNet, a novel capsule architecture grounded in the Information Bottleneck (IB) principle. Instead of iterative routing, IBCapsNet employs a one-pass variational aggregation mechanism, where primary capsules are first compressed into a global context representation and then processed by class-specific variational autoencoders (VAEs) to infer latent capsules regularized by the KL divergence. This design enables efficient inference while inherently filtering out noise. Experiments on MNIST, Fashion-MNIST, SVHN and CIFAR-10 show that IBCapsNet matches CapsNet in clean-data accuracy (achieving 99.41% on MNIST and 92.01% on SVHN), yet significantly outperforms it under four types of synthetic noise - demonstrating average improvements of +17.10% and +14.54% for clamped additive and multiplicative noise, respectively. Moreover, IBCapsNet achieves 2.54x faster training and 3.64x higher inference throughput compared to CapsNet, while reducing model parameters by 4.66%. Our work bridges information-theoretic representation learning with capsule networks, offering a principled path toward robust, efficient, and interpretable deep models. Code is available at https://github.com/cxiang26/IBCapsnet
[228] MFSR: MeanFlow Distillation for One Step Real-World Image Super Resolution
Ruiqing Wang, Kai Zhang, Yuanzhi Zhu, Hanshu Yan, Shilin Lu, Jian Yang
Main category: cs.CV
TL;DR: MFSR is a distillation framework for image super-resolution that enables photorealistic results in one step while preserving optional multi-step refinement capability through MeanFlow learning targets.
Details
Motivation: Diffusion/flow models provide high-quality image super-resolution but suffer from slow multi-step inference. One-step distillation reduces cost but degrades quality and removes refinement options. Need efficient single-step solution with optional improvement path.Method: Uses MeanFlow as learning target to approximate average velocity between arbitrary PF-ODE states, capturing teacher dynamics without explicit rollouts. Improves CFG formulation with teacher CFG distillation to better leverage pretrained generative priors.
Result: Achieves efficient, flexible, high-quality super-resolution on synthetic and real-world benchmarks, matching or exceeding multi-step teacher performance with much lower computational cost.
Conclusion: MFSR provides a practical distillation framework that balances efficiency and quality for real-world image super-resolution, enabling single-step inference while preserving optional refinement capability.
Abstract: Diffusion- and flow-based models have advanced Real-world Image Super-Resolution (Real-ISR), but their multi-step sampling makes inference slow and hard to deploy. One-step distillation alleviates the cost, yet often degrades restoration quality and removes the option to refine with more steps. We present Mean Flows for Super-Resolution (MFSR), a new distillation framework that produces photorealistic results in a single step while still allowing an optional few-step path for further improvement. Our approach uses MeanFlow as the learning target, enabling the student to approximate the average velocity between arbitrary states of the Probability Flow ODE (PF-ODE) and effectively capture the teacher’s dynamics without explicit rollouts. To better leverage pretrained generative priors, we additionally improve original MeanFlow’s Classifier-Free Guidance (CFG) formulation with teacher CFG distillation strategy, which enhances restoration capability and preserves fine details. Experiments on both synthetic and real-world benchmarks demonstrate that MFSR achieves efficient, flexible, and high-quality super-resolution, delivering results on par with or even better than multi-step teachers while requiring much lower computational cost.
[229] Satellite-to-Street: Synthesizing Post-Disaster Views from Satellite Imagery via Generative Vision Models
Yifan Yang, Lei Zou, Wendy Jepson
Main category: cs.CV
TL;DR: Satellite-to-street view synthesis for disaster assessment using VLM-guided and damage-sensitive MoE methods, with evaluation showing realism-fidelity trade-off.
Details
Motivation: Rapid situational awareness after natural disasters requires ground-level perspectives that satellite imagery lacks, but ground-level data is inaccessible during time-sensitive events.Method: Two generative strategies: VLM-guided approach and damage-sensitive Mixture-of-Experts method. Benchmark against Pix2Pix and ControlNet using Structure-Aware Evaluation Framework with pixel-level assessment, ResNet-based semantic consistency, and VLM-as-a-Judge.
Result: Experiments on 300 disaster scenarios reveal realism-fidelity trade-off: diffusion-based approaches achieve high perceptual realism but hallucinate details. ControlNet achieves highest semantic accuracy (0.71), while VLM-enhanced and MoE models excel in textural plausibility but struggle with semantic clarity.
Conclusion: Establishes baseline for trustworthy cross-view synthesis, emphasizing that visually realistic generations may fail to preserve critical structural information needed for reliable disaster assessment.
Abstract: In the immediate aftermath of natural disasters, rapid situational awareness is critical. Traditionally, satellite observations are widely used to estimate damage extent. However, they lack the ground-level perspective essential for characterizing specific structural failures and impacts. Meanwhile, ground-level data (e.g., street-view imagery) remains largely inaccessible during time-sensitive events. This study investigates Satellite-to-Street View Synthesis to bridge this data gap. We introduce two generative strategies to synthesize post-disaster street views from satellite imagery: a Vision-Language Model (VLM)-guided approach and a damage-sensitive Mixture-of-Experts (MoE) method. We benchmark these against general-purpose baselines (Pix2Pix, ControlNet) using a proposed Structure-Aware Evaluation Framework. This multi-tier protocol integrates (1) pixel-level quality assessment, (2) ResNet-based semantic consistency verification, and (3) a novel VLM-as-a-Judge for perceptual alignment. Experiments on 300 disaster scenarios reveal a critical realism–fidelity trade-off: while diffusion-based approaches (e.g., ControlNet) achieve high perceptual realism, they often hallucinate structural details. Quantitative results show that standard ControlNet achieves the highest semantic accuracy, 0.71, whereas VLM-enhanced and MoE models excel in textural plausibility but struggle with semantic clarity. This work establishes a baseline for trustworthy cross-view synthesis, emphasizing that visually realistic generations may still fail to preserve critical structural information required for reliable disaster assessment.
[230] Clinical Cognition Alignment for Gastrointestinal Diagnosis with Multimodal LLMs
Huan Zheng, Yucheng Zhou, Tianyi Yan, Dubing Chen, Hongbo Lu, Wenlong Liao, Tao He, Pai Peng, Jianbing Shen
Main category: cs.CV
TL;DR: CogAlign framework aligns MLLMs with clinical cognitive pathways for gastrointestinal endoscopy, using hierarchical clinical cognition SFT and counterfactual-driven reinforcement learning to eliminate visual bias and improve diagnostic accuracy.
Details
Motivation: Current MLLMs in medical image analysis face two key limitations in gastrointestinal endoscopy: misalignment between general model reasoning and standardized clinical cognitive pathways, and lack of causal association between visual features and diagnostic outcomes.Method: 1) Construct hierarchical clinical cognition dataset and use Supervised Fine-Tuning to internalize expert diagnostic logic (anatomical localization → morphological evaluation → microvascular analysis). 2) Counterfactual-driven reinforcement learning strategy using lesion masking to generate counterfactual normal samples and clinical-cognition-centric rewards to enforce causal rectification.
Result: Achieves State-of-the-Art performance across multiple benchmarks, significantly enhancing diagnostic accuracy in complex clinical scenarios.
Conclusion: The CogAlign framework successfully addresses clinical-cognitive misalignment and visual bias in medical MLLMs, providing a robust approach for gastrointestinal endoscopy analysis with publicly available code and datasets.
Abstract: Multimodal Large Language Models (MLLMs) have demonstrated remarkable potential in medical image analysis. However, their application in gastrointestinal endoscopy is currently hindered by two critical limitations: the misalignment between general model reasoning and standardized clinical cognitive pathways, and the lack of causal association between visual features and diagnostic outcomes. In this paper, we propose a novel Clinical-Cognitive-Aligned (CogAlign) framework to address these challenges. First, we endow the model with rigorous clinical analytical capabilities by constructing the hierarchical clinical cognition dataset and employing Supervised Fine-Tuning (SFT). Unlike conventional approaches, this strategy internalizes the hierarchical diagnostic logic of experts, ranging from anatomical localization and morphological evaluation to microvascular analysis, directly into the model. Second, to eliminate visual bias, we provide a theoretical analysis demonstrating that standard supervised tuning inevitably converges to spurious background correlations. Guided by this insight, we propose a counterfactual-driven reinforcement learning strategy to enforce causal rectification. By generating counterfactual normal samples via lesion masking and optimizing through clinical-cognition-centric rewards, we constrain the model to strictly ground its diagnosis in causal lesion features. Extensive experiments demonstrate that our approach achieves State-of-the-Art (SoTA) performance across multiple benchmarks, significantly enhancing diagnostic accuracy in complex clinical scenarios. All source code and datasets will be made publicly available.
[231] High-Quality and Efficient Turbulence Mitigation with Events
Xiaoran Zhang, Jian Ding, Yuxing Duan, Haoyue Liu, Gang Chen, Yi Chang, Luxin Yan
Main category: cs.CV
TL;DR: EHETM uses event cameras for efficient turbulence mitigation by leveraging event polarity patterns and event-tube motion priors to disentangle dynamic objects from turbulence with minimal frames.
Details
Motivation: Traditional turbulence mitigation methods face accuracy-efficiency tradeoffs requiring multiple frames, while event cameras offer microsecond temporal resolution and efficient dynamic sensing to break this bottleneck.Method: Two complementary modules: 1) polarity-weighted gradients for scene refinement using turbulence-induced event polarity alternation patterns, and 2) event-tube constraints for motion decoupling using spatiotemporally coherent event patterns from dynamic objects.
Result: Outperforms state-of-the-art methods, especially with dynamic objects, while reducing data overhead by ~77.3% and system latency by ~89.5%. Two real-world event-frame turbulence datasets created.
Conclusion: Event cameras enable high-quality turbulence mitigation with few frames by exploiting event polarity patterns and motion priors, overcoming traditional accuracy-efficiency tradeoffs.
Abstract: Turbulence mitigation (TM) is highly ill-posed due to the stochastic nature of atmospheric turbulence. Most methods rely on multiple frames recorded by conventional cameras to capture stable patterns in natural scenarios. However, they inevitably suffer from a trade-off between accuracy and efficiency: more frames enhance restoration at the cost of higher system latency and larger data overhead. Event cameras, equipped with microsecond temporal resolution and efficient sensing of dynamic changes, offer an opportunity to break the bottleneck. In this work, we present EHETM, a high-quality and efficient TM method inspired by the superiority of events to model motions in continuous sequences. We discover two key phenomena: (1) turbulence-induced events exhibit distinct polarity alternation correlated with sharp image gradients, providing structural cues for restoring scenes; and (2) dynamic objects form spatiotemporally coherent ``event tubes’’ in contrast to irregular patterns within turbulent events, providing motion priors for disentangling objects from turbulence. Based on these insights, we design two complementary modules that respectively leverage polarity-weighted gradients for scene refinement and event-tube constraints for motion decoupling, achieving high-quality restoration with few frames. Furthermore, we construct two real-world event-frame turbulence datasets covering atmospheric and thermal cases. Experiments show that EHETM outperforms SOTA methods, especially under scenes with dynamic objects, while reducing data overhead and system latency by approximately 77.3% and 89.5%, respectively. Our code is available at: https://github.com/Xavier667/EHETM.
[232] The Role and Relationship of Initialization and Densification in 3D Gaussian Splatting
Ivan Desiatov, Torsten Sattler
Main category: cs.CV
TL;DR: Systematic study of 3D Gaussian Splatting initialization methods and densification schemes, showing current densification approaches fail to fully leverage dense initialization data.
Details
Motivation: 3DGS typically uses sparse Structure-from-Motion point clouds as initialization and relies on densification to create dense Gaussian clouds. The paper investigates whether current densification methods can effectively utilize denser initialization data like laser scans, stereo point clouds, or monocular depth estimates.Method: Created a new benchmark to systematically study combinations of different initialization types (dense laser scans, dense multi-view stereo point clouds, dense monocular depth estimates, sparse SfM point clouds) with various densification schemes. Analyzed how well current densification approaches can improve reconstruction quality when starting from different initialization densities.
Result: Current densification approaches are unable to take full advantage of dense initialization data and often fail to significantly improve over sparse SfM-based initialization. The benchmark reveals limitations in existing densification schemes when provided with higher-quality initial point clouds.
Conclusion: There is a need for improved densification methods that can better leverage dense initialization data to achieve higher quality 3D reconstructions with 3D Gaussian Splatting.
Abstract: 3D Gaussian Splatting (3DGS) has become the method of choice for photo-realistic 3D reconstruction of scenes, due to being able to efficiently and accurately recover the scene appearance and geometry from images. 3DGS represents the scene through a set of 3D Gaussians, parameterized by their position, spatial extent, and view-dependent color. Starting from an initial point cloud, 3DGS refines the Gaussians’ parameters as to reconstruct a set of training images as accurately as possible. Typically, a sparse Structure-from-Motion point cloud is used as initialization. In order to obtain dense Gaussian clouds, 3DGS methods thus rely on a densification stage. In this paper, we systematically study the relation between densification and initialization. Proposing a new benchmark, we study combinations of different types of initializations (dense laser scans, dense (multi-view) stereo point clouds, dense monocular depth estimates, sparse SfM point clouds) and different densification schemes. We show that current densification approaches are not able to take full advantage of dense initialization as they are often unable to (significantly) improve over sparse SfM-based initialization. We will make our benchmark publicly available.
[233] Cross-modal Fuzzy Alignment Network for Text-Aerial Person Retrieval and A Large-scale Benchmark
Yifei Deng, Chenglong Li, Yuyang Zhang, Guyue Hu, Jin Tang
Main category: cs.CV
TL;DR: A Cross-modal Fuzzy Alignment Network for text-aerial person retrieval that uses fuzzy logic for token-level reliability and ground-view images as bridge agents to address visual degradation in UAV images.
Details
Motivation: Text-aerial person retrieval faces challenges due to degraded visual information in UAV-captured images from varying viewing angles and altitudes, making semantic alignment with text descriptions difficult.Method: Proposes a Cross-modal Fuzzy Alignment Network with Fuzzy Token Alignment module using fuzzy membership functions to model token-level associations, and Context-Aware Dynamic Alignment module incorporating ground-view images as bridge agents for adaptive alignment.
Result: Experiments on the constructed AERI-PEDES dataset and TBAPR benchmark demonstrate the superiority of the proposed method.
Conclusion: The approach effectively addresses the challenges of text-aerial person retrieval by improving semantic alignment through fuzzy logic and bridge agents, with validation on new benchmark datasets.
Abstract: Text-aerial person retrieval aims to identify targets in UAV-captured images from eyewitness descriptions, supporting intelligent transportation and public security applications. Compared to ground-view text–image person retrieval, UAV-captured images often suffer from degraded visual information due to drastic variations in viewing angles and flight altitudes, making semantic alignment with textual descriptions very challenging. To address this issue, we propose a novel Cross-modal Fuzzy Alignment Network, which quantifies the token-level reliability by fuzzy logic to achieve accurate fine-grained alignment and incorporates ground-view images as a bridge agent to further mitigate the gap between aerial images and text descriptions, for text–aerial person retrieval. In particular, we design the Fuzzy Token Alignment module that employs the fuzzy membership function to dynamically model token-level association strength and suppress the influence of unobservable or noisy tokens. It can alleviate the semantic inconsistencies caused by missing visual cues and significantly enhance the robustness of token-level semantic alignment. Moreover, to further mitigate the gap between aerial images and text descriptions, we design a Context-Aware Dynamic Alignment module to incorporate the ground-view agent as a bridge in text–aerial alignment and adaptively combine direct alignment and agent-assisted alignment to improve the robustness. In addition, we construct a large-scale benchmark dataset called AERI-PEDES by using a chain-of-thought to decompose text generation into attribute parsing, initial captioning, and refinement, thus boosting textual accuracy and semantic consistency. Experiments on AERI-PEDES and TBAPR demonstrate the superiority of our method.
[234] Premier: Personalized Preference Modulation with Learnable User Embedding in Text-to-Image Generation
Zihao Wang, Yuxiang Wei, Xinpeng Zhou, Tianyu Zhang, Tao Liang, Yalong Bai, Hongzhi Zhang, Wangmeng Zuo
Main category: cs.CV
TL;DR: Premier is a preference modulation framework for personalized image generation that learns user-specific embeddings and uses a preference adapter to fuse them with text prompts for fine-grained control.
Details
Motivation: Current text-to-image generation struggles to capture nuanced user preferences. Existing methods using multimodal LLMs often fail to faithfully reflect user preferences in derived prompts or latent codes, leading to suboptimal personalization.Method: Premier represents each user’s preference as a learnable embedding and introduces a preference adapter that fuses user embeddings with text prompts. It modulates the generative process with fused preference embeddings and uses a dispersion loss to enhance distinctness of individual preferences and improve alignment with user-specific styles. For new users with scarce data, it represents them as linear combinations of existing preference embeddings.
Result: Experiments show Premier outperforms prior methods under the same history length, achieving stronger preference alignment and superior performance on text consistency, ViPer proxy metrics, and expert evaluations.
Conclusion: Premier provides an effective framework for personalized image generation that better captures and reflects user preferences through learnable embeddings and preference modulation.
Abstract: Text-to-image generation has advanced rapidly, yet it still struggles to capture the nuanced user preferences. Existing approaches typically rely on multimodal large language models to infer user preferences, but the derived prompts or latent codes rarely reflect them faithfully, leading to suboptimal personalization. We present Premier, a novel preference modulation framework for personalized image generation. Premier represents each user’s preference as a learnable embedding and introduces a preference adapter that fuses the user embedding with the text prompt. To enable accurate and fine-grained preference control, the fused preference embedding is further used to modulate the generative process. To enhance the distinctness of individual preference and improve alignment between outputs and user-specific styles, we incorporate a dispersion loss that enforces separation among user embeddings. When user data are scarce, new users are represented as linear combinations of existing preference embeddings learned during training, enabling effective generalization. Experiments show that Premier outperforms prior methods under the same history length, achieving stronger preference alignment and superior performance on text consistency, ViPer proxy metrics, and expert evaluations.
[235] Weakly supervised multimodal segmentation of acoustic borehole images with depth-aware cross-attention
Jose Luis Lima de Jesus Silva
Main category: cs.CV
TL;DR: Weakly supervised multimodal segmentation framework for acoustic borehole images using threshold-guided pseudo-labels with depth-aligned well-logs, achieving robust improvement through confidence-gated depth-aware cross-attention fusion.
Details
Motivation: Large-scale interpretation of acoustic borehole images is difficult due to lack of dense expert annotations and the intrinsic multimodal nature of subsurface information (2D image texture + 1D well-logs). Need for weakly supervised methods that combine these modalities without requiring extensive manual labeling.Method: Weakly supervised multimodal segmentation framework that refines threshold-guided pseudo-labels through learned models. Preserves annotation-free character of classical thresholding/clustering while adding denoising, confidence-aware pseudo-supervision, and physically structured fusion. Compares different fusion strategies: direct concatenation, depth-aware cross-attention, gated fusion, and confidence-aware modulation.
Result: Threshold-guided learned refinement provides most robust improvement over baselines. Best performing model is confidence-gated depth-aware cross-attention (CG-DCA), which consistently outperforms threshold-based, image-only, and earlier multimodal baselines. Performance depends on confidence-aware fusion and structured local depth interaction rather than model complexity alone. Cross-well analyses confirm stable performance.
Conclusion: Establishes practical, scalable framework for annotation-free segmentation showing multimodal improvement is maximized when auxiliary logs are incorporated selectively and depth-aware. Confidence-gated depth-aware cross-attention provides optimal performance for combining 2D image texture with 1D well-logs in weakly supervised setting.
Abstract: Acoustic borehole images provide high-resolution borehole-wall structure, but large-scale interpretation remains difficult because dense expert annotations are rarely available and subsurface information is intrinsically multimodal. The challenge is developing weakly supervised methods combining two-dimensional image texture with depth-aligned one-dimensional well-logs. Here, we introduce a weakly supervised multimodal segmentation framework that refines threshold-guided pseudo-labels through learned models. This preserves the annotation-free character of classical thresholding and clustering workflows while extending them with denoising, confidence-aware pseudo-supervision, and physically structured fusion. We establish that threshold-guided learned refinement provides the most robust improvement over raw thresholding, denoised thresholding, and latent clustering baselines. Multimodal performance depends strongly on fusion strategy: direct concatenation provides limited gains, whereas depth-aware cross-attention, gated fusion, and confidence-aware modulation substantially improve agreement with the weak supervisory reference. The strongest model, confidence-gated depth-aware cross-attention (CG-DCA), consistently outperforms threshold-based, image-only, and earlier multimodal baselines. Targeted ablations show its advantage depends specifically on confidence-aware fusion and structured local depth interaction rather than model complexity alone. Cross-well analyses confirm this performance is broadly stable. These results establish a practical, scalable framework for annotation-free segmentation, showing multimodal improvement is maximized when auxiliary logs are incorporated selectively and depth-aware.
[236] VSD-MOT: End-to-End Multi-Object Tracking in Low-Quality Video Scenes Guided by Visual Semantic Distillation
Jun Du
Main category: cs.CV
TL;DR: VSD-MOT: A multi-object tracking framework using visual semantic distillation from CLIP to handle low-quality videos by compensating for information loss through knowledge distillation and dynamic weight regulation.
Details
Motivation: Existing multi-object tracking algorithms perform poorly in low-quality videos due to information loss from image degradation. Real-world scenarios often have varying video quality that causes significant tracking performance decline.Method: Proposes VSD-MOT framework with CLIP Image Encoder as teacher model for visual semantic extraction. Uses Dual-Constraint Semantic Distillation (DCSD) to transfer semantic extraction capabilities to student model. Includes Dynamic Semantic Weight Regulation (DSWR) module to adaptively allocate fusion weights based on real-time frame quality assessment.
Result: Extensive experiments demonstrate effectiveness and superiority in low-quality video scenarios while maintaining good performance in conventional scenarios.
Conclusion: The proposed VSD-MOT framework successfully addresses multi-object tracking challenges in low-quality videos through visual semantic distillation and dynamic weight regulation, improving robustness in real-world scenarios.
Abstract: Existing multi-object tracking algorithms typically fail to adequately address the issues in low-quality videos, resulting in a significant decline in tracking performance when image quality deteriorates in real-world scenarios. This performance degradation is primarily due to the algorithms’ inability to effectively tackle the problems caused by information loss in low-quality images. To address the challenges of low-quality video scenarios, inspired by vision-language models, we propose a multi-object tracking framework guided by visual semantic distillation (VSD-MOT). Specifically, we introduce the CLIP Image Encoder to extract global visual semantic information from images to compensate for the loss of information in low-quality images. However, direct integration can substantially impact the efficiency of the multi-object tracking algorithm. Therefore, this paper proposes to extract visual semantic information from images through knowledge distillation. This method adopts a teacher-student learning framework, with the CLIP Image Encoder serving as the teacher model. To enable the student model to acquire the capability of extracting visual semantic information suitable for multi-object tracking tasks from the teacher model, we have designed the Dual-Constraint Semantic Distillation method (DCSD). Furthermore, to address the dynamic variation of frame quality in low-quality videos, we propose the Dynamic Semantic Weight Regulation (DSWR) module, which adaptively allocates fusion weights based on real-time frame quality assessment. Extensive experiments demonstrate the effectiveness and superiority of the proposed method in low-quality video scenarios in the real world. Meanwhile, our method can maintain good performance in conventional scenarios.
[237] SATTC: Structure-Aware Label-Free Test-Time Calibration for Cross-Subject EEG-to-Image Retrieval
Qunjie Huang, Weina Zhu
Main category: cs.CV
TL;DR: SATTC improves cross-subject EEG-to-image retrieval by addressing subject shift and hubness through geometric and structural calibration of similarity matrices without labels.
Details
Motivation: Cross-subject EEG-to-image retrieval faces challenges from subject shift (differences between individuals' brain signals) and hubness (certain items dominating similarity rankings), which distort similarity geometry and make small-k retrieval unreliable.Method: SATTC (Structure-Aware Test-Time Calibration) is a label-free calibration head that operates on similarity matrices of frozen EEG and image encoders. It combines: 1) Geometric expert with subject-adaptive whitening of EEG embeddings and adaptive CSLS, and 2) Structural expert using mutual nearest neighbors, bidirectional top-k ranks, and class popularity, fused via Product-of-Experts rule.
Result: On THINGS-EEG dataset with leave-one-subject-out protocol, SATTC improves Top-1 and Top-5 accuracy, reduces hubness and per-class imbalance, and produces more reliable small-k shortlists compared to baseline methods. Gains transfer across multiple EEG encoders.
Conclusion: SATTC serves as an encoder-agnostic, label-free test-time calibration layer for cross-subject neural decoding, effectively addressing subject shift and hubness problems in EEG-to-image retrieval.
Abstract: Cross-subject EEG-to-image retrieval for visual decoding is challenged by subject shift and hubness in the embedding space, which distort similarity geometry and destabilize top-k rankings, making small-k shortlists unreliable. We introduce SATTC (Structure-Aware Test-Time Calibration), a label-free calibration head that operates directly on the similarity matrix of frozen EEG and image encoders. SATTC combines a geometric expert, subject-adaptive whitening of EEG embeddings with an adaptive variant of Cross-domain Similarity Local Scaling (CSLS), and a structural expert built from mutual nearest neighbors, bidirectional top-k ranks, and class popularity, fused via a simple Product-of-Experts rule. On THINGS-EEG under a strict leave-one-subject-out protocol, standardized inference with cosine similarities, L2-normalized embeddings, and candidate whitening already yields a strong cross-subject baseline over the original ATM retrieval setup. Building on this baseline, SATTC further improves Top-1 and Top-5 accuracy, reduces hubness and per-class imbalance, and produces more reliable small-k shortlists. These gains transfer across multiple EEG encoders, supporting SATTC as an encoder-agnostic, label-free test-time calibration layer for cross-subject neural decoding.
[238] Mamba Learns in Context: Structure-Aware Domain Generalization for Multi-Task Point Cloud Understanding
Jincen Jiang, Qianyu Zhou, Yuhang Li, Kui Su, Meili Wang, Jian Chang, Jian Jun Zhang, Xuequan Lu
Main category: cs.CV
TL;DR: SADG is a Mamba-based In-Context Learning framework for multi-task domain generalization in point cloud processing, featuring structure-aware serialization and hierarchical domain modeling to preserve structural hierarchy across domains.
Details
Motivation: Existing Transformer and Mamba architectures for point clouds are designed for single-task/single-domain settings and degrade in multi-task domain generalization. Transformers have quadratic attention costs and lack explicit structural ordering, while Mamba's coordinate-driven serialization is sensitive to viewpoint changes and missing regions, causing structural drift.Method: Proposes Structure-Aware Domain Generalization (SADG) with: 1) Structure-aware serialization (SAS) using centroid-based topology and geodesic curvature continuity for transformation-invariant sequences; 2) Hierarchical domain-aware modeling (HDM) to stabilize cross-domain reasoning; 3) Lightweight spectral graph alignment (SGA) for test-time feature shifting without parameter updates.
Result: The approach improves structural fidelity and consistently outperforms state-of-the-art methods across multiple tasks including reconstruction, denoising, and registration. Also introduces MP3DObject dataset for multi-task DG evaluation.
Conclusion: SADG effectively addresses structural drift in multi-task domain generalization for point clouds by preserving structural hierarchy through transformation-invariant serialization and hierarchical domain modeling.
Abstract: While recent Transformer and Mamba architectures have advanced point cloud representation learning, they are typically developed for single-task or single-domain settings. Directly applying them to multi-task domain generalization (DG) leads to degraded performance. Transformers effectively model global dependencies but suffer from quadratic attention cost and lack explicit structural ordering, whereas Mamba offers linear-time recurrence yet often depends on coordinate-driven serialization, which is sensitive to viewpoint changes and missing regions, causing structural drift and unstable sequential modeling. In this paper, we propose Structure-Aware Domain Generalization (SADG), a Mamba-based In-Context Learning framework that preserves structural hierarchy across domains and tasks. We design structure-aware serialization (SAS) that generates transformation-invariant sequences using centroid-based topology and geodesic curvature continuity. We further devise hierarchical domain-aware modeling (HDM) that stabilizes cross-domain reasoning by consolidating intra-domain structure and fusing inter-domain relations. At test time, we introduce a lightweight spectral graph alignment (SGA) that shifts target features toward source prototypes in the spectral domain without updating model parameters, ensuring structure-preserving test-time feature shifting. In addition, we introduce MP3DObject, a real-scan object dataset for multi-task DG evaluation. Comprehensive experiments demonstrate that the proposed approach improves structural fidelity and consistently outperforms state-of-the-art methods across multiple tasks including reconstruction, denoising, and registration.
[239] CTCal: Rethinking Text-to-Image Diffusion Models via Cross-Timestep Self-Calibration
Xiefan Guo, Xinzhu Ma, Haiyu Zhang, Di Huang
Main category: cs.CV
TL;DR: CTCal introduces cross-timestep self-calibration to improve text-image alignment in diffusion models by using reliable alignment from smaller timesteps to calibrate learning at larger timesteps.
Details
Motivation: Current text-to-image diffusion models struggle with precise text-image alignment due to limitations of conventional diffusion loss, which provides only implicit supervision for fine-grained correspondence.Method: CTCal leverages reliable cross-attention maps from smaller timesteps (less noise) to calibrate representation learning at larger timesteps (more noise), with timestep-aware adaptive weighting to balance CTCal and diffusion loss.
Result: Extensive experiments on T2I-Compbench++ and GenEval benchmarks demonstrate effectiveness and generalizability across both diffusion-based (SD 2.1) and flow-based (SD 3) approaches.
Conclusion: CTCal provides explicit supervision for text-image alignment, is model-agnostic, and can be seamlessly integrated into existing text-to-image diffusion models to improve alignment quality.
Abstract: Recent advancements in text-to-image synthesis have been largely propelled by diffusion-based models, yet achieving precise alignment between text prompts and generated images remains a persistent challenge. We find that this difficulty arises primarily from the limitations of conventional diffusion loss, which provides only implicit supervision for modeling fine-grained text-image correspondence. In this paper, we introduce Cross-Timestep Self-Calibration (CTCal), founded on the supporting observation that establishing accurate text-image alignment within diffusion models becomes progressively more difficult as the timestep increases. CTCal leverages the reliable text-image alignment (i.e., cross-attention maps) formed at smaller timesteps with less noise to calibrate the representation learning at larger timesteps with more noise, thereby providing explicit supervision during training. We further propose a timestep-aware adaptive weighting to achieve a harmonious integration of CTCal and diffusion loss. CTCal is model-agnostic and can be seamlessly integrated into existing text-to-image diffusion models, encompassing both diffusion-based (e.g., SD 2.1) and flow-based approaches (e.g., SD 3). Extensive experiments on T2I-Compbench++ and GenEval benchmarks demonstrate the effectiveness and generalizability of the proposed CTCal. Our code is available at https://github.com/xiefan-guo/ctcal.
[240] Smart Operation Theatre: An AI-based System for Surgical Gauze Counting
Saraf Krish, Cai Yiyu, Huang Li Hui
Main category: cs.CV
TL;DR: AI-based surgical gauze counting system using YOLOv5 for real-time video monitoring to prevent gossypiboma (retained surgical items)
Details
Motivation: Prevent gossypiboma (retained surgical gauzes) which causes serious complications and legal issues; manual counting is time-consuming and diverts nursing resources from patient careMethod: Real-time video surveillance with YOLOv5 deep learning model tracks gauzes on “In” and “Out” trays; integrated model detects both humans and gauzes using 11,000 training images; supports manual count adjustments
Result: Improved accuracy and increased frame rate from 8 FPS to 15 FPS; system now supports manual adjustments based on doctor feedback
Conclusion: AI-based gauze counting system offers reliable prevention method for gossypiboma, addressing limitations of manual counting and previous iterations
Abstract: During surgeries, there is a risk of medical gauzes being left inside patients’ bodies, leading to “Gossypiboma” in patients and can cause serious complications in patients and also lead to legal problems for hospitals from malpractice lawsuits and regulatory penalties. Diagnosis depends on imaging methods such as X-rays or CT scans, and the usual treatment involves surgical excision. Prevention methods, such as manual counts and RFID-integrated gauzes, aim to minimize gossypiboma risks. However, manual tallying of 100s of gauzes by nurses is time-consuming and diverts resources from patient care. In partnership with Singapore General Hospital (SGH) we have developed a new prevention method, an AI-based system for gauze counting in surgical settings. Utilizing real-time video surveillance and object recognition technology powered by YOLOv5, a Deep Learning model was designed to monitor gauzes on two designated trays labelled “In” and “Out”. Gauzes are tracked from the “In” tray, prior to their use in the patient’s body & in the “Out” tray post-use, ensuring accurate counting and verifying that no gauze remains inside the patient at the end of the surgery. We have trained it using numerous images from Operation Theatres & augmented it to satisfy all possible scenarios. This study has also addressed the shortcomings of previous project iterations. Previously, the project employed two models: one for human detection and another for gauze detection, trained on a total of 2800 images. Now we have an integrated model capable of identifying both humans and gauzes, using a training set of 11,000 images. This has led to improvements in accuracy and increased the frame rate from 8 FPS to 15 FPS now. Incorporating doctor’s feedback, the system now also supports manual count adjustments, enhancing its reliability in actual surgeries.
[241] Memory-Efficient Fine-Tuning Diffusion Transformers via Dynamic Patch Sampling and Block Skipping
Sunghyun Park, Jeongho Kim, Hyoungwoo Park, Debasmit Das, Sungrack Yun, Munawar Hayat, Jaegul Choo, Fatih Porikli, Seokeon Choi
Main category: cs.CV
TL;DR: DiT-BlockSkip: Memory-efficient fine-tuning framework for Diffusion Transformers using dynamic patch sampling and block skipping for on-device personalization.
Details
Motivation: Fine-tuning Diffusion Transformers for personalized text-to-image generation requires substantial computational resources and memory, limiting practical deployment on resource-constrained devices like smartphones and IoT devices.Method: Proposes DiT-BlockSkip with two key techniques: 1) Timestep-aware dynamic patch sampling that adjusts patch sizes based on diffusion timesteps and resizes to lower resolution, 2) Block skipping mechanism that selectively fine-tunes essential transformer blocks while precomputing residual features for skipped blocks, using cross-attention masking for block selection.
Result: Achieves competitive personalization performance both qualitatively and quantitatively while substantially reducing memory usage, enabling on-device feasibility for large-scale diffusion transformers.
Conclusion: DiT-BlockSkip provides an effective memory-efficient fine-tuning framework that maintains performance while reducing computational requirements, moving toward practical deployment of personalized diffusion transformers on resource-constrained devices.
Abstract: Diffusion Transformers (DiTs) have significantly enhanced text-to-image (T2I) generation quality, enabling high-quality personalized content creation. However, fine-tuning these models requires substantial computational complexity and memory, limiting practical deployment under resource constraints. To tackle these challenges, we propose a memory-efficient fine-tuning framework called DiT-BlockSkip, integrating timestep-aware dynamic patch sampling and block skipping by precomputing residual features. Our dynamic patch sampling strategy adjusts patch sizes based on the diffusion timestep, then resizes the cropped patches to a fixed lower resolution. This approach reduces forward & backward memory usage while allowing the model to capture global structures at higher timesteps and fine-grained details at lower timesteps. The block skipping mechanism selectively fine-tunes essential transformer blocks and precomputes residual features for the skipped blocks, significantly reducing training memory. To identify vital blocks for personalization, we introduce a block selection strategy based on cross-attention masking. Evaluations demonstrate that our approach achieves competitive personalization performance qualitatively and quantitatively, while reducing memory usage substantially, moving toward on-device feasibility (e.g., smartphones, IoT devices) for large-scale diffusion transformers.
[242] PiLoT: Neural Pixel-to-3D Registration for UAV-based Ego and Target Geo-localization
Xiaoya Cheng, Long Wang, Yan Liu, Xinyi Liu, Hanlin Tan, Yu Liu, Maojun Zhang, Shen Yan
Main category: cs.CV
TL;DR: PiLoT is a unified framework for UAV-based ego and target geo-localization that directly registers live video against geo-referenced 3D maps, eliminating dependency on GNSS and active sensors.
Details
Motivation: Current UAV localization methods rely on GNSS and VIO fusion for ego-pose estimation and active sensors like laser rangefinders for target localization, making them vulnerable in GNSS-denied environments and requiring expensive hardware.Method: 1) Dual-Thread Engine decouples map rendering from core localization for low latency and drift-free accuracy; 2) Large-scale synthetic dataset with precise annotations enables training lightweight network for zero-shot simulation-to-real generalization; 3) Joint Neural-Guided Stochastic-Gradient Optimizer (JNGO) ensures robust convergence under aggressive motion.
Result: PiLoT outperforms state-of-the-art methods on comprehensive benchmarks while running over 25 FPS on NVIDIA Jetson Orin platform.
Conclusion: PiLoT provides a robust, accurate, real-time solution for UAV geo-localization that works in GNSS-denied environments without expensive active sensors, enabled by novel architectural and optimization techniques.
Abstract: We present PiLoT, a unified framework that tackles UAV-based ego and target geo-localization. Conventional approaches rely on decoupled pipelines that fuse GNSS and Visual-Inertial Odometry (VIO) for ego-pose estimation, and active sensors like laser rangefinders for target localization. However, these methods are susceptible to failure in GNSS-denied environments and incur substantial hardware costs and complexity. PiLoT breaks this paradigm by directly registering live video stream against a geo-referenced 3D map. To achieve robust, accurate, and real-time performance, we introduce three key contributions: 1) a Dual-Thread Engine that decouples map rendering from core localization thread, ensuring both low latency while maintaining drift-free accuracy; 2) a large-scale synthetic dataset with precise geometric annotations (camera pose, depth maps). This dataset enables the training of a lightweight network that generalizes in a zero-shot manner from simulation to real data; and 3) a Joint Neural-Guided Stochastic-Gradient Optimizer (JNGO) that achieves robust convergence even under aggressive motion. Evaluations on a comprehensive set of public and newly collected benchmarks show that PiLoT outperforms state-of-the-art methods while running over 25 FPS on NVIDIA Jetson Orin platform. Our code and dataset is available at: https://github.com/Choyaa/PiLoT.
[243] MEMO: Human-like Crisp Edge Detection Using Masked Edge Prediction
Jiaxin Cheng, Yue Wu, Yicong Zhou
Main category: cs.CV
TL;DR: MEMO is a method for producing crisp, human-like edge detection using only cross-entropy loss through masked training and progressive inference based on confidence gradients.
Details
Motivation: Existing edge detection models trained with cross-entropy loss produce thick edges that don't match human-annotated single-pixel edges. Previous solutions required specialized loss functions or architecture changes, but this work shows training/inference strategy alone can achieve crisp edges.Method: Proposes Masked Edge Prediction Model (MEMO) with: 1) Large-scale synthetic edge dataset for pre-training, 2) Lightweight fine-tuning module (1.2% additional params), 3) Training with varying input masking ratios, 4) Progressive inference strategy that finalizes edge predictions based on confidence gradients (high center, lower boundaries).
Result: Achieves visually appealing, post-processing-free, human-like edge maps. Outperforms prior methods on crispness-aware evaluations.
Conclusion: Carefully designed training and inference strategy alone (without specialized losses or architecture changes) is sufficient to achieve crisp, human-like edge quality using only cross-entropy loss.
Abstract: Learning-based edge detection models trained with cross-entropy loss often suffer from thick edge predictions, which deviate from the crisp, single-pixel annotations typically provided by humans. While previous approaches to achieving crisp edges have focused on designing specialized loss functions or modifying network architectures, we show that a carefully designed training and inference strategy alone is sufficient to achieve human-like edge quality. In this work, we introduce the Masked Edge Prediction MOdel (MEMO), which produces both accurate and crisp edges using only cross-entropy loss. We first construct a large-scale synthetic edge dataset to pre-train MEMO, enhancing its generalization ability. Subsequent fine-tuning on downstream datasets requires only a lightweight module comprising 1.2% additional parameters. During training, MEMO learns to predict edges under varying ratios of input masking. A key insight guiding our inference is that thick edge predictions typically exhibit a confidence gradient: high in the center and lower toward the boundaries. Leveraging this, we propose a novel progressive prediction strategy that sequentially finalizes edge predictions in order of prediction confidence, resulting in thinner and more precise contours. Our method achieves visually appealing, post-processing-free, human-like edge maps and outperforms prior methods on crispness-aware evaluations.
[244] ME-IQA: Memory-Enhanced Image Quality Assessment via Re-Ranking
Kanglong Fan, Tianhe Wu, Wen Wen, Jianzhao Liu, Le Yang, Yabin Zhang, Yiting Liao, Junlin Li, Li Zhang
Main category: cs.CV
TL;DR: ME-IQA: A memory-enhanced re-ranking framework that improves reasoning-induced vision-language models for image quality assessment by addressing discrete collapse through memory retrieval, probabilistic comparison, and gated reflection.
Details
Motivation: Reasoning-induced VLMs advance IQA with textual reasoning, but their scalar scores suffer from discrete collapse (limited sensitivity, collapsing to few values) and lack distortion sensitivity.Method: Plug-and-play test-time framework that: (1) builds memory bank and retrieves semantically/perceptually aligned neighbors using reasoning summaries, (2) reframes VLM as probabilistic comparator for pairwise preference probabilities, fusing ordinal evidence with initial scores using Thurstone’s Case V model, (3) performs gated reflection and consolidates memory for future decisions.
Result: Produces denser, distortion-sensitive predictions that mitigate discrete collapse. Experiments across multiple IQA benchmarks show consistent gains over reasoning-induced VLM baselines, existing non-reasoning IQA methods, and test-time scaling alternatives.
Conclusion: ME-IQA effectively addresses discrete collapse in reasoning-induced VLMs for IQA through memory-enhanced re-ranking, improving sensitivity and performance across benchmarks.
Abstract: Reasoning-induced vision-language models (VLMs) advance image quality assessment (IQA) with textual reasoning, yet their scalar scores often lack sensitivity and collapse to a few values, so-called discrete collapse. We introduce ME-IQA, a plug-and-play, test-time memory-enhanced re-ranking framework. It (i) builds a memory bank and retrieves semantically and perceptually aligned neighbors using reasoning summaries, (ii) reframes the VLM as a probabilistic comparator to obtain pairwise preference probabilities and fuse this ordinal evidence with the initial score under Thurstone’s Case V model, and (iii) performs gated reflection and consolidates memory to improve future decisions. This yields denser, distortion-sensitive predictions and mitigates discrete collapse. Experiments across multiple IQA benchmarks show consistent gains over strong reasoning-induced VLM baselines, existing non-reasoning IQA methods, and test-time scaling alternatives.
[245] Does Peer Observation Help? Vision-Sharing Collaboration for Vision-Language Navigation
Qunchao Jin, Yiliao Song, Qi Wu
Main category: cs.CV
TL;DR: Co-VLN enables collaborative vision-language navigation by allowing multiple agents to share perceptual memories when they traverse common locations, expanding each agent’s receptive field without additional exploration cost.
Details
Motivation: Traditional VLN systems suffer from partial observability since agents can only gather information from locations they personally visit. With multiple robots increasingly coexisting in shared environments, the authors investigate whether agents can benefit from sharing observations with peers navigating the same space.Method: Co-VLN is a minimalist, model-agnostic framework where independently navigating agents exchange structured perceptual memory when they identify common traversed locations. This expands each agent’s receptive field without requiring additional exploration. The framework is validated on the R2R benchmark using two representative paradigms: learning-based DUET and zero-shot MapGPT.
Result: Vision-sharing enabled models yield substantial performance improvements across both paradigms. The framework establishes a strong foundation for collaborative embodied navigation research and systematically reveals the underlying dynamics of peer observation sharing in VLN.
Conclusion: Co-VLN demonstrates that peer observation sharing can significantly benefit VLN systems by overcoming partial observability constraints, enabling agents to leverage each other’s perceptual experiences for improved navigation performance.
Abstract: Vision-Language Navigation (VLN) systems are fundamentally constrained by partial observability, as an agent can only accumulate knowledge from locations it has personally visited. As multiple robots increasingly coexist in shared environments, a natural question arises: can agents navigating the same space benefit from each other’s observations? In this work, we introduce Co-VLN, a minimalist, model-agnostic framework for systematically investigating whether and how peer observations from concurrently navigating agents can benefit VLN. When independently navigating agents identify common traversed locations, they exchange structured perceptual memory, effectively expanding each agent’s receptive field at no additional exploration cost. We validate our framework on the R2R benchmark under two representative paradigms (the learning-based DUET and the zero-shot MapGPT), and conduct extensive analytical experiments to systematically reveal the underlying dynamics of peer observation sharing in VLN. Results demonstrate that vision-sharing enabled model yields substantial performance improvements across both paradigms, establishing a strong foundation for future research in collaborative embodied navigation.
[246] Less is More in Semantic Space: Intrinsic Decoupling via Clifford-M for Fundus Image Classification
Yifeng Zheng
Main category: cs.CV
TL;DR: Clifford-M is a lightweight backbone for multi-label fundus diagnosis that uses sparse geometric interaction instead of explicit frequency decomposition, achieving competitive performance with minimal parameters.
Details
Motivation: Multi-label fundus diagnosis requires capturing both fine-grained lesions and large-scale retinal structure. Existing multi-scale medical vision models use explicit frequency decomposition, but ablation studies show limited benefits with increased computational cost.Method: Proposes Clifford-M, a lightweight backbone replacing feed-forward expansion and frequency-splitting modules with sparse geometric interaction. Uses Clifford-style rolling product to jointly capture alignment and structural variation with linear complexity, enabling efficient cross-scale fusion in a compact dual-resolution architecture.
Result: Without pre-training, achieves mean AUC-ROC of 0.8142 and mean macro-F1 of 0.5481 on ODIR-5K using only 0.85M parameters, outperforming larger CNN baselines. On RFMiD without fine-tuning, attains 0.7425 macro AUC and 0.7610 micro AUC, showing robustness to cross-dataset shift.
Conclusion: Competitive and efficient fundus diagnosis can be achieved without explicit frequency engineering by designing core feature interaction to capture multi-scale structure directly through sparse geometric interaction.
Abstract: Multi-label fundus diagnosis requires features that capture both fine-grained lesions and large-scale retinal structure. Many multi-scale medical vision models address this challenge through explicit frequency decomposition, but our ablation studies show that such heuristics provide limited benefit in this setting: replacing the proposed simple dual-resolution stem with Octave Convolution increased parameters by 35% and computation by a 2.23-fold increase in computation; without improving mean accuracy, while a fixed wavelet-based variant performed substantially worse. Motivated by these findings, we propose Clifford-M, a lightweight backbone that replaces both feed-forward expansion and frequency-splitting modules with sparse geometric interaction. The model is built on a Clifford-style rolling product that jointly captures alignment and structural variation with linear complexity, enabling efficient cross-scale fusion and self-refinement in a compact dual-resolution architecture. Without pre-training, Clifford-M achieves a mean AUC-ROC of 0.8142 and a mean macro-F1 (optimal threshold) of 0.5481 on ODIR-5K using only 0.85M parameters, outperforming substantially larger mid-scale CNN baselines under the same training protocol. When evaluated on RFMiD without fine-tuning, it attains 0.7425 +/- 0.0198 macro AUC and 0.7610 +/- 0.0344 micro AUC, indicating reasonable robustness to cross-dataset shift. These results suggest that competitive and efficient fundus diagnosis can be achieved without explicit frequency engineering, provided that the core feature interaction is designed to capture multi-scale structure directly.
[247] Predictive Regularization Against Visual Representation Degradation in Multimodal Large Language Models
Enguang Wang, Qiang Wang, Yuanchen Wu, Ke Yan, Xinbin Yuan, Shouhong Ding, Xialei Liu, Ming-Ming Cheng
Main category: cs.CV
TL;DR: MLLMs suffer from visual representation degradation in middle layers due to text-generation optimization, and Predictive Regularization (PRe) helps maintain visual fidelity to improve multimodal performance.
Details
Motivation: The paper investigates whether MLLMs' language-driven training compromises their internal visual foundational competence, seeking to understand if visual representation degradation occurs and how it affects multimodal understanding.Method: Conducts diagnostic analysis of visual representation degradation in MLLMs, identifies degradation in middle layers, and proposes Predictive Regularization (PRe) to force degraded intermediate features to predict initial visual features, maintaining visual attributes.
Result: Extensive experiments show that mitigating visual degradation through PRe effectively boosts vision-language performance, confirming the importance of robust internal visual representations for comprehensive multimodal understanding.
Conclusion: Visual representation degradation is a pervasive issue in MLLMs driven by text-generation optimization, and maintaining visual fidelity through techniques like PRe is crucial for building robust multimodal models with both strong cross-modal reasoning and core visual competence.
Abstract: While Multimodal Large Language Models (MLLMs) excel at vision-language tasks, the cost of their language-driven training on internal visual foundational competence remains unclear. In this paper, we conduct a detailed diagnostic analysis to unveil a pervasive issue: visual representation degradation in MLLMs. Specifically, we find that compared to the initial visual features, the visual representation in the middle layers of LLM exhibits both a degradation in global function and patch structure. We attribute this phenomenon to a visual sacrifice driven by the singular text-generation objective, where the model compromises its visual fidelity to optimize for answer generation. We argue that a robust MLLM requires both strong cross-modal reasoning and core visual competence, and propose Predictive Regularization (PRe) to force degraded intermediate features to predict initial visual features, thereby maintaining the inherent visual attributes of the MLLM’s internal representations. Extensive experiments confirm that mitigating this visual degradation effectively boosts vision-language performance, underscoring the critical importance of fostering robust internal visual representations within MLLMs for comprehensive multimodal understanding.
[248] Lean Learning Beyond Clouds: Efficient Discrepancy-Conditioned Optical-SAR Fusion for Semantic Segmentation
Chenxing Meng, Wuzhou Quan, Yingjie Cai, Liqun Cao, Liyan Zhang, Mingqiang Wei
Main category: cs.CV
TL;DR: EDC: An efficiency-oriented optical-SAR semantic segmentation framework that uses tri-stream encoding with carrier tokens and discrepancy-conditioned fusion to handle cloud occlusion while reducing computational overhead.
Details
Motivation: Cloud occlusion degrades optical remote sensing imagery, and while SAR provides complementary data, existing methods suffer from computational inefficiency and noise propagation when using dense global attention for cross-modal fusion.Method: Proposes EDC with: 1) Tri-stream encoder with Carrier Tokens for compact global context modeling, 2) Discrepancy-Conditioned Hybrid Fusion (DCHF) to selectively suppress unreliable regions during global aggregation, and 3) Auxiliary cloud removal branch with teacher-guided distillation.
Result: Achieves superior accuracy and efficiency: improves mIoU by 0.56% on M3M-CR and 0.88% on WHU-OPT-SAR datasets, while reducing parameters by 46.7% and accelerating inference by 1.98×.
Conclusion: EDC effectively addresses the efficiency-reliability trade-off in optical-SAR semantic segmentation under cloud occlusion through compact global modeling and selective fusion mechanisms.
Abstract: Cloud occlusion severely degrades the semantic integrity of optical remote sensing imagery. While incorporating Synthetic Aperture Radar (SAR) provides complementary observations, achieving efficient global modeling and reliable cross-modal fusion under cloud interference remains challenging. Existing methods rely on dense global attention to capture long-range dependencies, yet such aggregation indiscriminately propagates cloud-induced noise. Improving robustness typically entails enlarging model capacity, which further increases computational overhead. Given the large-scale and high-resolution nature of remote sensing applications, such computational demands hinder practical deployment, leading to an efficiency-reliability trade-off. To address this dilemma, we propose EDC, an efficiency-oriented and discrepancy-conditioned optical-SAR semantic segmentation framework. A tri-stream encoder with Carrier Tokens enables compact global context modeling with reduced complexity. To prevent noise contamination, we introduce a Discrepancy-Conditioned Hybrid Fusion (DCHF) mechanism that selectively suppresses unreliable regions during global aggregation. In addition, an auxiliary cloud removal branch with teacher-guided distillation enhances semantic consistency under occlusion. Extensive experiments demonstrate that EDC achieves superior accuracy and efficiency, improving mIoU by 0.56% and 0.88% on M3M-CR and WHU-OPT-SAR, respectively, while reducing the number of parameters by 46.7% and accelerating inference by 1.98$\times$. Our implementation is available at https://github.com/mengcx0209/EDC.
[249] PlanaReLoc: Camera Relocalization in 3D Planar Primitives via Region-Based Structure Matching
Hanqiao Ye, Yuzhou Liu, Yangdong Liu, Shuhan Shen
Main category: cs.CV
TL;DR: PlanaReLoc: A novel camera relocalization method using planar primitives and 3D planar maps for 6-DoF pose estimation in structured environments without requiring textured maps or per-scene training.
Details
Motivation: Traditional structure-based relocalizers rely on point correspondences, but planar primitives offer richer structural and semantic information while being fundamental in projective geometry. The authors aim to leverage planes for more robust cross-modal structural correspondences in camera relocalization.Method: Introduces PlanaReLoc, a plane-centric paradigm where a deep matcher associates planar primitives between query images and 3D planar maps in a learned unified embedding space. The 6-DoF pose is then solved and refined under a robust framework.
Result: Comprehensive experiments on ScanNet and 12Scenes datasets across hundreds of scenes show superiority of planar primitives in facilitating reliable cross-modal structural correspondences and achieving effective camera relocalization without requiring textured maps, pose priors, or per-scene training.
Conclusion: Planar primitives provide effective region-based representations for camera relocalization, enabling reliable structural correspondences and robust pose estimation in structured environments with minimal requirements.
Abstract: While structure-based relocalizers have long strived for point correspondences when establishing or regressing query-map associations, in this paper, we pioneer the use of planar primitives and 3D planar maps for lightweight 6-DoF camera relocalization in structured environments. Planar primitives, beyond being fundamental entities in projective geometry, also serve as region-based representations that encapsulate both structural and semantic richness. This motivates us to introduce PlanaReLoc, a streamlined plane-centric paradigm where a deep matcher associates planar primitives across the query image and the map within a learned unified embedding space, after which the 6-DoF pose is solved and refined under a robust framework. Through comprehensive experiments on the ScanNet and 12Scenes datasets across hundreds of scenes, our method demonstrates the superiority of planar primitives in facilitating reliable cross-modal structural correspondences and achieving effective camera relocalization without requiring realistically textured/colored maps, pose priors, or per-scene training. The code and data are available at https://github.com/3dv-casia/PlanaReLoc .
[250] EruDiff: Refactoring Knowledge in Diffusion Models for Advanced Text-to-Image Synthesis
Xiefan Guo, Xinzhu Ma, Haoxiang Ma, Zihao Zhou, Di Huang
Main category: cs.CV
TL;DR: EruDiff improves diffusion models’ ability to handle implicit prompts requiring world knowledge by refactoring knowledge structures through distribution matching and reinforcement learning.
Details
Motivation: Current text-to-image diffusion models excel with explicit prompts but fail with implicit prompts requiring deep world knowledge, leading to counter-factual synthesis due to chaotic knowledge organization.Method: Proposes EruDiff with two components: 1) Diffusion Knowledge Distribution Matching (DK-DM) to align knowledge distributions of implicit prompts with explicit anchors, and 2) Negative-Only Reinforcement Learning (NO-RL) to correct biases in explicit prompt rendering.
Result: Significantly enhances performance of leading diffusion models (FLUX and Qwen-Image) on both scientific knowledge (Science-T2I) and world knowledge (WISE) benchmarks, demonstrating effectiveness and generalizability.
Conclusion: EruDiff successfully addresses the knowledge dislocation problem in diffusion models, enabling better handling of implicit prompts requiring world knowledge through knowledge refactoring and bias correction.
Abstract: Text-to-image diffusion models have achieved remarkable fidelity in synthesizing images from explicit text prompts, yet exhibit a critical deficiency in processing implicit prompts that require deep-level world knowledge, ranging from natural sciences to cultural commonsense, resulting in counter-factual synthesis. This paper traces the root of this limitation to a fundamental dislocation of the underlying knowledge structures, manifesting as a chaotic organization of implicit prompts compared to their explicit counterparts. In this paper, we propose EruDiff, which aims to refactor the knowledge within diffusion models. Specifically, we develop the Diffusion Knowledge Distribution Matching (DK-DM) to register the knowledge distribution of intractable implicit prompts with that of well-defined explicit anchors. Furthermore, to rectify the inherent biases in explicit prompt rendering, we employ the Negative-Only Reinforcement Learning (NO-RL) strategy for fine-grained correction. Rigorous empirical evaluations demonstrate that our method significantly enhances the performance of leading diffusion models, including FLUX and Qwen-Image, across both the scientific knowledge benchmark (i.e., Science-T2I) and the world knowledge benchmark (i.e., WISE), underscoring the effectiveness and generalizability. Our code is available at https://github.com/xiefan-guo/erudiff.
[251] MERIT: Multi-domain Efficient RAW Image Translation
Wenjun Huang, Shenghao Fu, Yian Jin, Yang Ni, Ziteng Cui, Hanning Chen, Yirui He, Yezi Liu, Sanggeon Yun, SungHeon Jeong, Ryozo Masukawa, William Youngwoo Chung, Mohsen Imani
Main category: cs.CV
TL;DR: MERIT is a unified framework for multi-domain RAW image translation using a single model to handle translations across different camera sensors, addressing domain shifts in spectral responses, noise, and tone behaviors.
Details
Motivation: RAW images from different camera sensors have substantial domain shifts due to varying spectral responses, noise characteristics, and tone behaviors, making them difficult to use directly in computer vision tasks. Existing methods require training separate translators for each source-target camera pair, which doesn't scale to real-world scenarios with multiple commercial cameras.Method: MERIT introduces a unified framework with: 1) sensor-aware noise modeling loss that aligns signal-dependent noise statistics of generated images with target domain, 2) conditional multi-scale large kernel attention module for improved context and sensor-aware feature modeling, and 3) MDRAW dataset for multi-domain RAW image translation evaluation.
Result: Extensive experiments show MERIT outperforms prior models with 5.56 dB improvement in quality and 80% reduction in training iterations. The framework demonstrates superior scalability and performance across multiple camera domains.
Conclusion: MERIT provides an effective unified solution for multi-domain RAW image translation that addresses the scalability limitations of previous approaches while improving translation quality through sensor-aware noise modeling and attention mechanisms.
Abstract: RAW images captured by different camera sensors exhibit substantial domain shifts due to varying spectral responses, noise characteristics, and tone behaviors, complicating their direct use in downstream computer vision tasks. Prior methods address this problem by training domain-specific RAW-to-RAW translators for each source-target pair, but such approaches do not scale to real-world scenarios involving multiple types of commercial cameras. In this work, we introduce MERIT, the first unified framework for multi-domain RAW image translation, which leverages a single model to perform translations across arbitrary camera domains. To address domain-specific noise discrepancies, we propose a sensor-aware noise modeling loss that explicitly aligns the signal-dependent noise statistics of the generated images with those of the target domain. We further enhance the generator with a conditional multi-scale large kernel attention module for improved context and sensor-aware feature modeling. To facilitate standardized evaluation, we introduce MDRAW, the first dataset tailored for multi-domain RAW image translation, comprising both paired and unpaired RAW captures from five diverse camera sensors across a wide range of scenes. Extensive experiments demonstrate that MERIT outperforms prior models in both quality (5.56 dB improvement) and scalability (80% reduction in training iterations).
[252] Dodgersort: Uncertainty-Aware VLM-Guided Human-in-the-Loop Pairwise Ranking
Yujin Park, Haejun Chung, Ikbeom Jang
Main category: cs.CV
TL;DR: Dodgersort: A method for efficient pairwise comparison labeling using CLIP-based hierarchical pre-ordering, neural ranking, probabilistic ensembles, and information-theoretic pair selection to reduce annotation costs while improving ranking reliability.
Details
Motivation: Pairwise comparison labeling yields higher inter-rater reliability than conventional classification, but exhaustive comparisons require quadratic cost. Need methods to reduce human comparisons while maintaining or improving ranking reliability.Method: Combines CLIP-based hierarchical pre-ordering, neural ranking head, probabilistic ensemble (Elo, BTL, GP), epistemic-aleatoric uncertainty decomposition, and information-theoretic pair selection to intelligently select which pairs to compare.
Result: Achieves 11-16% annotation reduction while improving inter-rater reliability in visual ranking tasks (medical imaging, historical dating, aesthetics). Extracts 5-20× more ranking information per comparison than baselines in FG-NET with ground-truth ages.
Conclusion: Dodgersort provides Pareto-optimal accuracy-efficiency trade-offs for ranking tasks, with neural adaptation and ensemble uncertainty being key components for cross-domain performance gains.
Abstract: Pairwise comparison labeling is emerging as it yields higher inter-rater reliability than conventional classification labeling, but exhaustive comparisons require quadratic cost. We propose Dodgersort, which leverages CLIP-based hierarchical pre-ordering, a neural ranking head and probabilistic ensemble (Elo, BTL, GP), epistemic–aleatoric uncertainty decomposition, and information-theoretic pair selection. It reduces human comparisons while improving the reliability of the rankings. In visual ranking tasks in medical imaging, historical dating, and aesthetics, Dodgersort achieves a 11–16% annotation reduction while improving inter-rater reliability. Cross-domain ablations across four datasets show that neural adaptation and ensemble uncertainty are key to this gain. In FG-NET with ground-truth ages, the framework extracts 5–20$\times$ more ranking information per comparison than baselines, yielding Pareto-optimal accuracy–efficiency trade-offs.
[253] GOLDMARK: Governed Outcome-Linked Diagnostic Model Assessment Reference Kit
Chad Vanderbilt, Gabriele Campanella, Siddharth Singi, Swaraj Nanda, Jie-Fu Chen, Ali Kamali, Amir Momeni Boroujeni, David Kim, Mohamed Yakoub, Jamal Benhamida, Meera Hameed, Neeraj Kumar, Gregory Goldgof
Main category: cs.CV
TL;DR: GOLDMARK is a standardized benchmarking framework for computational pathology that provides structured data formats, feature embeddings, and evaluation metrics for reproducible AI-based biomarker development from histopathology images.
Details
Motivation: Computational pathology lacks standardized intermediate data formats, provenance tracking, checkpointing conventions, and reproducible evaluation metrics needed for clinical-grade deployment of AI-based biomarkers from whole-slide images.Method: Developed GOLDMARK framework with curated TCGA cohort and clinically actionable biomarker labels. Provides structured intermediate representations including tile coordinate maps, per-slide feature embeddings from pathology foundation models, quality-control metadata, patient-level splits, trained models, and evaluation outputs.
Result: Across 33 tumor-biomarker tasks, mean AUROC was 0.689 (TCGA) and 0.630 (MSKCC). For eight highest-performing tasks, mean AUROCs were 0.831 and 0.801 respectively, showing stable cross-site performance for established morphologic-genomic associations.
Conclusion: GOLDMARK establishes a shared experimental substrate for computational pathology, enabling reproducible benchmarking and direct comparison of methods across datasets and models.
Abstract: Computational biomarkers (CBs) are histopathology-derived patterns extracted from hematoxylin-eosin (H&E) whole-slide images (WSIs) using artificial intelligence (AI) to predict therapeutic response or prognosis. Recently, slide-level multiple-instance learning (MIL) with pathology foundation models (PFMs) has become the standard baseline for CB development. While these methods have improved predictive performance, computational pathology lacks standardized intermediate data formats, provenance tracking, checkpointing conventions, and reproducible evaluation metrics required for clinical-grade deployment. We introduce GOLDMARK (https://artificialintelligencepathology.org), a standardized benchmarking framework built on a curated TCGA cohort with clinically actionable OncoKB level 1-3 biomarker labels. GOLDMARK releases structured intermediate representations, including tile coordinate maps, per-slide feature embeddings from canonical PFMs, quality-control metadata, predefined patient-level splits, trained slide-level models, and evaluation outputs. Models are trained on TCGA and evaluated on an independent MSKCC cohort with reciprocal testing. Across 33 tumor-biomarker tasks, mean AUROC was 0.689 (TCGA) and 0.630 (MSKCC). Restricting to the eight highest-performing tasks yielded mean AUROCs of 0.831 and 0.801, respectively. These tasks correspond to established morphologic-genomic associations (e.g., LGG IDH1, COAD MSI/BRAF, THCA BRAF/NRAS, BLCA FGFR3, UCEC PTEN) and showed the most stable cross-site performance. Differences between canonical encoders were modest relative to task-specific variability. GOLDMARK establishes a shared experimental substrate for computational pathology, enabling reproducible benchmarking and direct comparison of methods across datasets and models.
[254] Glove2Hand: Synthesizing Natural Hand-Object Interaction from Multi-Modal Sensing Gloves
Xinyu Zhang, Ziyi Kou, Chuan Qin, Mia Huang, Ergys Ristani, Ankit Kumar, Lele Chen, Kun He, Abdeslam Boularias, Li Guan
Main category: cs.CV
TL;DR: Glove2Hand translates multi-modal sensing glove videos into photorealistic bare hands while preserving physical interaction dynamics, enabling creation of HandSense dataset with synchronized tactile/IMU signals for enhanced hand-object interaction applications.
Details
Motivation: Conventional hand videos lack essential physical information like contact forces and motion signals, and suffer from frequent occlusions. There's a need for better hand-object interaction understanding with physical dynamics preserved.Method: 1) Novel 3D Gaussian hand model for temporal rendering consistency; 2) Diffusion-based hand restorer for seamless scene integration and handling complex interactions/deformations; 3) Framework translates multi-modal sensing glove videos to photorealistic bare hands.
Result: Created HandSense, the first multi-modal HOI dataset with glove-to-hand videos and synchronized tactile/IMU signals. Demonstrated significant enhancement of downstream applications including video-based contact estimation and hand tracking under severe occlusion.
Conclusion: Glove2Hand enables creation of physically-informed hand-object interaction datasets that overcome limitations of conventional hand videos, advancing computer vision, robotics, and AR/VR applications.
Abstract: Understanding hand-object interaction (HOI) is fundamental to computer vision, robotics, and AR/VR. However, conventional hand videos often lack essential physical information such as contact forces and motion signals, and are prone to frequent occlusions. To address the challenges, we present Glove2Hand, a framework that translates multi-modal sensing glove HOI videos into photorealistic bare hands, while faithfully preserving the underlying physical interaction dynamics. We introduce a novel 3D Gaussian hand model that ensures temporal rendering consistency. The rendered hand is seamlessly integrated into the scene using a diffusion-based hand restorer, which effectively handles complex hand-object interactions and non-rigid deformations. Leveraging Glove2Hand, we create HandSense, the first multi-modal HOI dataset featuring glove-to-hand videos with synchronized tactile and IMU signals. We demonstrate that HandSense significantly enhances downstream bare-hand applications, including video-based contact estimation and hand tracking under severe occlusion.
[255] Ensemble of Small Classifiers For Imbalanced White Blood Cell Classification
Siddharth Srivastava, Adam Smith, Scott Brooks, Jack Bacon, Till Bretschneider
Main category: cs.CV
TL;DR: Lightweight ensemble of pretrained vision models (SwinV2-Tiny, DinoBloom-Small, ConvNeXT-V2-Tiny) achieves excellent performance on white blood cell classification for leukemia diagnosis, addressing challenges of rare cell types and dataset imbalance.
Details
Motivation: Automating white blood cell classification for leukemia diagnosis is needed to reduce reliance on time-consuming expert pathologist examination. Current algorithms struggle with rare cell types due to variations in staining, scanning, and inter-patient heterogeneity.Method: Proposes a lightweight ensemble approach using pretrained SwinV2-Tiny, DinoBloom-Small, and ConvNeXT-V2-Tiny models. Uses dataset expansion to address class imbalance, trains 3 instantiations of each architecture in stratified 3-fold cross-validation, and aggregates predictions through logit averaging from all 9 models.
Result: The ensemble achieves excellent performance on the challenging white blood cell classification dataset, though the model shows weaknesses in confusing similar-looking myelocytes in granulopoiesis and lymphocytes in lymphopoiesis.
Conclusion: A simple ensemble of lightweight pretrained vision models can effectively automate white blood cell classification for leukemia diagnosis, demonstrating strong performance despite challenges with rare cell types and dataset imbalances.
Abstract: Automating white blood cell classification for diagnosis of leukaemia is a promising alternative to time-consuming and resource-intensive examination of cells by expert pathologists. However, designing robust algorithms for classification of rare cell types remains challenging due to variations in staining, scanning and inter-patient heterogeneity. We propose a lightweight ensemble approach for classification of cells during Haematopoiesis, with a focus on the biology of Granulopoiesis, Monocytopoiesis and Lymphopoiesis. Through dataset expansion to alleviate some class imbalance, we demonstrate that a simple ensemble of lightweight pretrained SwinV2-Tiny, DinoBloom-Small and ConvNeXT-V2-Tiny models achieves excellent performance on this challenging dataset. We train 3 instantiations of each architecture in a stratified 3-fold cross-validation framework; for an input image, we forward-pass through all 9 models and aggregate through logit averaging. We further reason on the weaknesses of our model in confusing similar-looking myelocytes in granulopoiesis and lymphocytes in lymphopoiesis. Code: https://gitlab.com/siddharthsrivastava/wbc-bench-2026.
[256] Fast and Robust Deformable 3D Gaussian Splatting
Han Jiao, Jiakai Sun, Lei Zhao, Zhanjie Zhang, Wei Xing, Huaizhong Lin
Main category: cs.CV
TL;DR: FRoG is an efficient framework for dynamic scene reconstruction using 3D Gaussian Splatting with per-Gaussian embedding, coarse-to-fine temporal strategy, depth-guided sampling, and opacity modulation to improve rendering speed and quality.
Details
Motivation: Existing deformation field-based methods for dynamic 3D Gaussian Splatting suffer from slow rendering speeds, heavy dependence on initial point clouds, and vulnerability to local optima in dim scenes. The authors aim to overcome these limitations.Method: FRoG integrates per-Gaussian embedding with coarse-to-fine temporal embedding strategy for early fusion acceleration. It introduces depth- and error-guided sampling for robust initialization, and modulates opacity variations to mitigate local optima in dim scenes.
Result: Comprehensive experiments show FRoG achieves accelerated rendering speeds while maintaining state-of-the-art visual quality for dynamic scene reconstruction.
Conclusion: FRoG presents an efficient and robust framework that addresses key limitations in dynamic 3D Gaussian Splatting, achieving both speed and quality improvements.
Abstract: 3D Gaussian Splatting has demonstrated remarkable real-time rendering capabilities and superior visual quality in novel view synthesis for static scenes. Building upon these advantages, researchers have progressively extended 3D Gaussians to dynamic scene reconstruction. Deformation field-based methods have emerged as a promising approach among various techniques. These methods maintain 3D Gaussian attributes in a canonical field and employ the deformation field to transform this field across temporal sequences. Nevertheless, these approaches frequently encounter challenges such as suboptimal rendering speeds, significant dependence on initial point clouds, and vulnerability to local optima in dim scenes. To overcome these limitations, we present FRoG, an efficient and robust framework for high-quality dynamic scene reconstruction. FRoG integrates per-Gaussian embedding with a coarse-to-fine temporal embedding strategy, accelerating rendering through the early fusion of temporal embeddings. Moreover, to enhance robustness against sparse initializations, we introduce a novel depth- and error-guided sampling strategy. This strategy populates the canonical field with new 3D Gaussians at low-deviation initial positions, significantly reducing the optimization burden on the deformation field and improving detail reconstruction in both static and dynamic regions. Furthermore, by modulating opacity variations, we mitigate the local optima problem in dim scenes, improving color fidelity. Comprehensive experimental results validate that our method achieves accelerated rendering speeds while maintaining state-of-the-art visual quality.
[257] Restoring Neural Network Plasticity for Faster Transfer Learning
Xander Coetzer, Arné Schreuder, Anna Sergeevna Bosman
Main category: cs.CV
TL;DR: Targeted weight re-initialization strategy to restore neural plasticity in transfer learning for image classification tasks, improving performance for both CNNs and ViTs.
Details
Motivation: Pretrained models on ImageNet can lose neural plasticity during transfer learning, especially when downstream datasets are atypical, leading to saturated weights and insignificant gradients that hinder effective adaptation to new tasks.Method: Proposes a targeted weight re-initialization strategy to restore neural plasticity before fine-tuning. The method selectively re-initializes certain weights in pretrained models to break saturation and enable better gradient flow during transfer learning.
Result: Both convolutional neural networks (CNNs) and vision transformers (ViTs) benefit from the approach, achieving higher test accuracy with faster convergence on several image classification benchmarks. The method adds negligible computational overhead and is compatible with standard transfer learning pipelines.
Conclusion: Targeted weight re-initialization effectively addresses neural plasticity loss in transfer learning, improving adaptation to downstream tasks, especially for atypical datasets, while maintaining computational efficiency.
Abstract: Transfer learning with models pretrained on ImageNet has become a standard practice in computer vision. Transfer learning refers to fine-tuning pretrained weights of a neural network on a downstream task, typically unrelated to ImageNet. However, pretrained weights can become saturated and may yield insignificant gradients, failing to adapt to the downstream task. This hinders the ability of the model to train effectively, and is commonly referred to as loss of neural plasticity. Loss of plasticity may prevent the model from fully adapting to the target domain, especially when the downstream dataset is atypical in nature. While this issue has been widely explored in continual learning, it remains relatively understudied in the context of transfer learning. In this work, we propose the use of a targeted weight re-initialization strategy to restore neural plasticity prior to fine-tuning. Our experiments show that both convolutional neural networks (CNNs) and vision transformers (ViTs) benefit from this approach, yielding higher test accuracy with faster convergence on several image classification benchmarks. Our method introduces negligible computational overhead and is compatible with common transfer learning pipelines.
[258] TAFG-MAN: Timestep-Adaptive Frequency-Gated Latent Diffusion for Efficient and High-Quality Low-Dose CT Image Denoising
Tangtangfang Fang, Yang Jiao, Xiangjian He, Jingxi Hu, Jiaqi Yang
Main category: cs.CV
TL;DR: TAFG-MAN is a latent diffusion framework for low-dose CT denoising that uses timestep-adaptive frequency-gated conditioning to balance noise suppression with anatomical detail preservation.
Details
Motivation: Low-dose CT reduces radiation exposure but introduces substantial noise and structural degradation, making it difficult to suppress noise without erasing subtle anatomical details that are crucial for diagnosis.Method: Combines perceptually optimized autoencoder, conditional latent diffusion restoration in compact latent space, and lightweight Timestep-Adaptive Frequency-Gated (TAFG) conditioning. TAFG decomposes condition features into low/high-frequency components, predicts timestep-adaptive gates from current denoising feature and timestep embedding, and progressively releases high-frequency guidance in later denoising stages.
Result: TAFG-MAN achieves favorable quality-efficiency trade-off against baselines. Compared to base variant without TAFG, it improves detail preservation and perceptual quality while maintaining same inference cost. Ablation confirms effectiveness of proposed conditioning mechanism.
Conclusion: The TAFG-MAN framework effectively addresses the LDCT denoising challenge by balancing noise suppression with anatomical detail preservation through timestep-adaptive frequency-gated conditioning in latent diffusion models.
Abstract: Low-dose computed tomography (LDCT) reduces radiation exposure but also introduces substantial noise and structural degradation, making it difficult to suppress noise without erasing subtle anatomical details. In this paper, we present TAFG-MAN, a latent diffusion framework for efficient and high-quality LDCT image denoising. The framework combines a perceptually optimized autoencoder, conditional latent diffusion restoration in a compact latent space, and a lightweight Timestep-Adaptive Frequency-Gated (TAFG) conditioning design. TAFG decomposes condition features into low- and high-frequency components, predicts timestep-adaptive gates from the current denoising feature and timestep embedding, and progressively releases high-frequency guidance in later denoising stages before cross-attention. In this way, the model relies more on stable structural guidance at early reverse steps and introduces fine details more cautiously as denoising proceeds, improving the balance between noise suppression and detail preservation. Experiments show that TAFG-MAN achieves a favorable quality-efficiency trade-off against representative baselines. Compared with its base variant without TAFG, it further improves detail preservation and perceptual quality while maintaining essentially the same inference cost, and ablation results confirm the effectiveness of the proposed conditioning mechanism.
[259] Scene Graph-guided SegCaptioning Transformer with Fine-grained Alignment for Controllable Video Segmentation and Captioning
Xu Zhang, Jin Yuan, BinHong Yang, Xuan Liu, Qianjun Zhang, Yuyi Wang, Zhiyong Li, Hanwang Zhang
Main category: cs.CV
TL;DR: SG-FSCFormer enables controllable video segmentation and captioning where users can provide prompts (e.g., bounding boxes) to generate precise masks and captions aligned with their intent.
Details
Motivation: Existing video multimodal interpretation methods focus on global comprehension with limited user interaction. The paper addresses the need for more controllable and interactive video understanding systems that can respond to specific user prompts.Method: Proposes SG-FSCFormer with two key components: 1) Prompt-guided Temporal Graph Former that captures user intent via adaptive prompt adaptor, and 2) Fine-grained Mask-linguistic Decoder that collaboratively predicts caption-mask pairs using Multi-entity Contrastive loss for fine-grained alignment.
Result: Comprehensive experiments on two benchmark datasets show SG-FSCFormer achieves remarkable performance, effectively capturing user intent and generating precise multimodal outputs tailored to user specifications.
Conclusion: The proposed Controllable Video Segmentation and Captioning task and SG-FSCFormer framework successfully enable interactive video understanding with precise user intent alignment, advancing video multimodal interpretation capabilities.
Abstract: Recent advancements in multimodal large models have significantly bridged the representation gap between diverse modalities, catalyzing the evolution of video multimodal interpretation, which enhances users’ understanding of video content by generating correlated modalities. However, most existing video multimodal interpretation methods primarily concentrate on global comprehension with limited user interaction. To address this, we propose a novel task, Controllable Video Segmentation and Captioning (SegCaptioning), which empowers users to provide specific prompts, such as a bounding box around an object of interest, to simultaneously generate correlated masks and captions that precisely embody user intent. An innovative framework Scene Graph-guided Fine-grained SegCaptioning Transformer (SG-FSCFormer) is designed that integrates a Prompt-guided Temporal Graph Former to effectively captures and represents user intent through an adaptive prompt adaptor, ensuring that the generated content well aligns with the user’s requirements. Furthermore, our model introduces a Fine-grained Mask-linguistic Decoder to collaboratively predict high-quality caption-mask pairs using a Multi-entity Contrastive loss, as well as provide fine-grained alignment between each mask and its corresponding caption tokens, thereby enhancing users’ comprehension of videos. Comprehensive experiments conducted on two benchmark datasets demonstrate that SG-FSCFormer achieves remarkable performance, effectively capturing user intent and generating precise multimodal outputs tailored to user specifications. Our code is available at https://github.com/XuZhang1211/SG-FSCFormer.
[260] GraPHFormer: A Multimodal Graph Persistent Homology Transformer for the Analysis of Neuroscience Morphologies
Uzair Shah, Marco Agus, Mahmoud Gamal, Mahmood Alzubaidi, Corrado Cali, Pierre J. Magistretti, Abdesselam Bouzerdoum, Mowafa Househ
Main category: cs.CV
TL;DR: GraPHFormer is a multimodal architecture that unifies topological and graph-based analysis of neuronal morphology using CLIP-style contrastive learning, achieving state-of-the-art performance on multiple benchmarks.
Details
Motivation: Current methods analyze neuronal morphology either through topology or graph structure in isolation, missing complementary information. There's a need for unified multimodal approaches that can leverage both topological and geometric perspectives for better understanding of circuit function, development, and disease.Method: GraPHFormer uses a multimodal architecture with two branches: 1) Vision branch processes three-channel persistence images (unweighted, persistence-weighted, radius-weighted topological densities) via DINOv2-ViT-S, 2) TreeLSTM encoder captures geometric and radial attributes from skeleton graphs. Both project to a shared embedding space trained with symmetric InfoNCE loss, augmented by persistence-space transformations that preserve topological semantics.
Result: Achieves state-of-the-art performance on five out of six benchmarks (BIL-6, ACT-4, JML-4, N7, M1-Cell, M1-REG) spanning self-supervised and supervised settings, significantly outperforming topology-only, graph-only, and morphometrics baselines. Demonstrates practical utility in discriminating glial morphologies across cortical regions and species, and detecting signatures of developmental and degenerative processes.
Conclusion: GraPHFormer successfully unifies topological and graph-based analysis of neuronal morphology through multimodal contrastive learning, providing a powerful framework for comprehensive morphological analysis with applications in neuroscience research.
Abstract: Neuronal morphology encodes critical information about circuit function, development, and disease, yet current methods analyze topology or graph structure in isolation. We introduce GraPHFormer, a multimodal architecture that unifies these complementary views through CLIP-style contrastive learning. Our vision branch processes a novel three-channel persistence image encoding unweighted, persistence-weighted, and radius-weighted topological densities via DINOv2-ViT-S. In parallel, a TreeLSTM encoder captures geometric and radial attributes from skeleton graphs. Both project to a shared embedding space trained with symmetric InfoNCE loss, augmented by persistence-space transformations that preserve topological semantics. Evaluated on six benchmarks (BIL-6, ACT-4, JML-4, N7, M1-Cell, M1-REG) spanning self-supervised and supervised settings, GraPHFormer achieves state-of-the-art performance on five benchmarks, significantly outperforming topology-only, graph-only, and morphometrics baselines. We demonstrate practical utility by discriminating glial morphologies across cortical regions and species, and detecting signatures of developmental and degenerative processes. Code: https://github.com/Uzshah/GraPHFormer
[261] Consistent but Dangerous: Per-Sample Safety Classification Reveals False Reliability in Medical Vision-Language Models
Binesh Sadanandan, Vahid Behzadan
Main category: cs.CV
TL;DR: Medical VLMs can achieve perfect consistency by ignoring images and relying on text patterns, creating a false reliability trap where dangerous samples appear accurate but aren’t image-reliant.
Details
Motivation: Consistency under paraphrase is increasingly used as a proxy for reliability in medical VLMs, but this proxy is fundamentally flawed because models can achieve perfect consistency by ignoring images and relying solely on text patterns.Method: Introduces a four-quadrant safety taxonomy evaluating both consistency (stable predictions across paraphrased prompts) and image reliance (predictions that change when image is removed). Evaluates five medical VLM configurations across two chest X-ray datasets (MIMIC-CXR, PadChest) using this framework.
Result: LoRA fine-tuning dramatically reduces flip rates but shifts majority of samples into Dangerous quadrant (consistent but not image-reliant). LLaVA-Rad Base achieves 1.5% flip rate while 98.5% of samples are Dangerous. Dangerous samples exhibit high accuracy (up to 99.6%) and low entropy, making them invisible to standard confidence-based screening.
Conclusion: Consistency alone is insufficient for reliability assessment. Deployment evaluations must pair consistency checks with text-only baselines to expose the false reliability trap where models ignore images while appearing consistent and accurate.
Abstract: Consistency under paraphrase, the property that semantically equivalent prompts yield identical predictions, is increasingly used as a proxy for reliability when deploying medical vision-language models (VLMs). We show this proxy is fundamentally flawed: a model can achieve perfect consistency by relying on text patterns rather than the input image. We introduce a four-quadrant per-sample safety taxonomy that jointly evaluates consistency (stable predictions across paraphrased prompts) and image reliance (predictions that change when the image is removed). Samples are classified as Ideal (consistent and image-reliant), Fragile (inconsistent but image-reliant), Dangerous (consistent but not image-reliant), or Worst (inconsistent and not image-reliant). Evaluating five medical VLM configurations across two chest X-ray datasets (MIMIC-CXR, PadChest), we find that LoRA fine-tuning dramatically reduces flip rates but shifts a majority of samples into the Dangerous quadrant: LLaVA-Rad Base achieves a 1.5% flip rate on PadChest while 98.5% of its samples are Dangerous. Critically, Dangerous samples exhibit high accuracy (up to 99.6%) and low entropy, making them invisible to standard confidence-based screening. We observe a negative correlation between flip rate and Dangerous fraction (r = -0.89, n=10) and recommend that deployment evaluations always pair consistency checks with a text-only baseline: a single additional forward pass that exposes the false reliability trap.
[262] SkinCLIP-VL: Consistency-Aware Vision-Language Learning for Multimodal Skin Cancer Diagnosis
Zhixiang Lu, Shijie Xu, Kaicheng Yan, Xuyue Cai, Chong Zhang, Yulong Li, Angelos Stefanidis, Anh Nguyen, Jionglong Su
Main category: cs.CV
TL;DR: SkinCLIP-VL is a resource-efficient framework for trustworthy skin cancer diagnosis that addresses computational costs, data scarcity, and interpretability issues in vision-language models for dermatology.
Details
Motivation: The paper addresses three key challenges in deploying vision-language models for dermatology: high computational costs, extreme data scarcity in medical domains, and the black-box nature of deep learning models that limits clinical trust.Method: Proposes SkinCLIP-VL framework with frozen CLIP encoder integrated with lightweight, quantized Qwen2.5-VL via LoRA. Introduces Consistency-aware Focal Alignment (CFA) Loss to align visual regions with clinical semantics under long-tailed distributions, combining focal re-weighting, semantic alignment, and calibration.
Result: On ISIC and Derm7pt benchmarks, SkinCLIP-VL surpasses 13B-parameter baselines by 4.3-6.2% in accuracy with 43% fewer parameters. Blinded expert evaluation and out-of-distribution testing show visually grounded rationales significantly enhance clinical trust compared to traditional saliency maps.
Conclusion: SkinCLIP-VL provides a resource-efficient, interpretable solution for medical vision-language tasks, demonstrating that smaller, well-designed models can outperform larger baselines while improving clinical trust through better interpretability.
Abstract: The deployment of vision-language models (VLMs) in dermatology is hindered by the trilemma of high computational costs, extreme data scarcity, and the black-box nature of deep learning. To address these challenges, we present SkinCLIP-VL, a resource-efficient framework that adapts foundation models for trustworthy skin cancer diagnosis. Adopting a frozen perception, adaptive reasoning paradigm, we integrate a frozen CLIP encoder with a lightweight, quantized Qwen2.5-VL via low-rank adaptation (LoRA). To strictly align visual regions with clinical semantics under long-tailed distributions, we propose the Consistency-aware Focal Alignment (CFA) Loss. This objective synergizes focal re-weighting, semantic alignment, and calibration. On ISIC and Derm7pt benchmarks, SkinCLIP-VL surpasses 13B-parameter baselines by 4.3-6.2% in accuracy with 43% fewer parameters. Crucially, blinded expert evaluation and out-of-distribution testing confirm that our visually grounded rationales significantly enhance clinical trust compared to traditional saliency maps.
[263] LPNSR: Prior-Enhanced Diffusion Image Super-Resolution via LR-Guided Noise Prediction
Shuwei Huang, Shizhuo Liu, Zijun Wei
Main category: cs.CV
TL;DR: LPNSR is a prior-enhanced efficient diffusion framework for image super-resolution that addresses the trade-off between inference efficiency and reconstruction quality by optimizing intermediate noise and initialization in residual-shifting diffusion.
Details
Motivation: Current diffusion-based SR methods face a fundamental trade-off between inference efficiency and reconstruction quality. Residual-shifting diffusion achieves efficient 4-step inference but suffers from performance degradation due to suboptimal random Gaussian noise in intermediate steps and initialization bias from naive bicubic upsampling.Method: 1) Derives closed-form analytical solution for optimal intermediate noise in residual-shifting diffusion paradigm; 2) Designs LR-guided multi-input-aware noise predictor to replace random Gaussian noise, embedding LR structural priors while preserving efficient residual-shifting mechanism; 3) Uses high-quality pre-upsampling network to mitigate initial bias; 4) Optimizes end-to-end with compact 4-step trajectory.
Result: Extensive experiments show LPNSR achieves state-of-the-art perceptual performance on both synthetic and real-world datasets without relying on large-scale text-to-image priors.
Conclusion: LPNSR successfully addresses the efficiency-quality trade-off in diffusion-based SR by optimizing intermediate noise and initialization, achieving efficient high-quality super-resolution without external priors.
Abstract: Diffusion-based image super-resolution (SR), which aims to reconstruct high-resolution (HR) images from corresponding low-resolution (LR) observations, faces a fundamental trade-off between inference efficiency and reconstruction quality. The state-of-the-art residual-shifting diffusion framework achieves efficient 4-step inference, yet suffers from severe performance degradation in compact sampling trajectories. This is mainly attributed to two core limitations: the inherent suboptimality of unconstrained random Gaussian noise in intermediate steps, which leads to error accumulation and insufficient LR prior guidance, and the initialization bias caused by naive bicubic upsampling. In this paper, we propose LPNSR, a prior-enhanced efficient diffusion framework to address these issues. We first mathematically derive the closed-form analytical solution of the optimal intermediate noise for the residual-shifting diffusion paradigm, and accordingly design an LR-guided multi-input-aware noise predictor to replace random Gaussian noise, embedding LR structural priors into the reverse process while fully preserving the framework’s core efficient residual-shifting mechanism. We further mitigate initial bias with a high-quality pre-upsampling network to optimize the diffusion starting point. With a compact 4-step trajectory, LPNSR can be optimized in an end-to-end manner. Extensive experiments demonstrate that LPNSR achieves state-of-the-art perceptual performance on both synthetic and real-world datasets, without relying on any large-scale text-to-image priors. The source code of our method can be found at https://github.com/Faze-Hsw/LPNSR.
[264] SpatialFly: Geometry-Guided Representation Alignment for UAV Vision-and-Language Navigation in Urban Environments
Wen Jiang, Kangyao Huang, Li Wang, Wang Xu, Wei Fan, Jinyuan Liu, Shaoyu Liu, Hanfang Liang, Hongwei Duan, Bin Xu, Xiangyang Ji
Main category: cs.CV
TL;DR: SpatialFly: A geometry-guided spatial representation framework for UAV visual-language navigation that addresses the 2D-3D representation mismatch without explicit 3D reconstruction.
Details
Motivation: UAV visual-language navigation in complex 3D environments is challenging due to structural representation mismatch between 2D visual perception and 3D trajectory decision space, limiting spatial reasoning capabilities.Method: Proposes a geometry-guided spatial representation framework with two key modules: 1) geometric prior injection module that injects global structural cues into 2D semantic tokens, and 2) geometry-aware reparameterization module that aligns 2D semantic tokens with 3D geometric tokens through cross-modal attention with gated residual fusion.
Result: Consistently outperforms state-of-the-art UAV VLN baselines across both seen and unseen environments, reducing navigation error by 4.03m and improving success rate by 1.27% over strongest baseline on unseen Full split. Produces trajectories with better path alignment and smoother, more stable motion.
Conclusion: SpatialFly effectively bridges the 2D-3D representation gap for UAV visual-language navigation through geometry-guided spatial representation, enabling better spatial reasoning without explicit 3D reconstruction.
Abstract: UAVs play an important role in applications such as autonomous exploration, disaster response, and infrastructure inspection. However, UAV VLN in complex 3D environments remains challenging. A key difficulty is the structural representation mismatch between 2D visual perception and the 3D trajectory decision space, which limits spatial reasoning. To this end, we propose SpatialFly, a geometry-guided spatial representation framework for UAV VLN. Operating on RGB observations without explicit 3D reconstruction, SpatialFly introduces a geometry-guided 2D representation alignment mechanism. Specifically, the geometric prior injection module injects global structural cues into 2D semantic tokens to provide scene-level geometric guidance. The geometry-aware reparameterization module then aligns 2D semantic tokens with 3D geometric tokens through cross-modal attention, followed by gated residual fusion to preserve semantic discrimination. Experimental results show that SpatialFly consistently outperforms state-of-the-art UAV VLN baselines across both seen and unseen environments, reducing NE by 4.03m and improving SR by 1.27% over the strongest baseline on the unseen Full split. Additional trajectory-level analysis shows that SpatialFly produces trajectories with better path alignment and smoother, more stable motion.
[265] When Minor Edits Matter: LLM-Driven Prompt Attack for Medical VLM Robustness in Ultrasound
Yasamin Medghalchi, Milad Yazdani, Amirhossein Dabiriaghdam, Moein Heidari, Mojan Izadkhah, Zahra Kavian, Giuseppe Carenini, Lele Wang, Dena Shahriari, Ilker Hacihaliloglu
Main category: cs.CV
TL;DR: A framework for evaluating adversarial robustness of medical vision-language models in ultrasound using LLM-generated clinically plausible prompt variations
Details
Motivation: Ultrasound AI analysis is operator-dependent and needs robust assistance. While vision-language models show promise in medical imaging, they have trustworthiness concerns, especially adversarial robustness since they operate via natural-language instructions that are vulnerable to small variations in prompts.Method: Proposes a scalable adversarial evaluation framework using a large language model to generate clinically plausible adversarial prompt variants via “humanized” rewrites and minimal edits mimicking routine clinical communication. Evaluates state-of-the-art medical VLMs on ultrasound multiple-choice question answering benchmarks.
Result: Systematically assesses vulnerability of SOTA Med-VLMs to these attacks, examines how attacker LLM capacity influences attack success, analyzes relationship between attack success and model confidence, and identifies consistent failure patterns across models.
Conclusion: Results highlight realistic robustness gaps that must be addressed for safe clinical translation of medical vision-language models in ultrasound applications.
Abstract: Ultrasound is widely used in clinical practice due to its portability, cost-effectiveness, safety, and real-time imaging capabilities. However, image acquisition and interpretation remain highly operator dependent, motivating the development of robust AI-assisted analysis methods. Vision-language models (VLMs) have recently demonstrated strong multimodal reasoning capabilities and competitive performance in medical image analysis, including ultrasound. However, emerging evidence highlights significant concerns about their trustworthiness. In particular, adversarial robustness is critical because Med-VLMs operate via natural-language instructions, rendering prompt formulation a realistic and practically exploitable point of vulnerability. Small variations (typos, shorthand, underspecified requests, or ambiguous wording) can meaningfully shift model outputs. We propose a scalable adversarial evaluation framework that leverages a large language model (LLM) to generate clinically plausible adversarial prompt variants via “humanized” rewrites and minimal edits that mimic routine clinical communication. Using ultrasound multiple-choice question answering benchmarks, we systematically assess the vulnerability of SOTA Med-VLMs to these attacks, examine how attacker LLM capacity influences attack success, analyze the relationship between attack success and model confidence, and identify consistent failure patterns across models. Our results highlight realistic robustness gaps that must be addressed for safe clinical translation. Code will be released publicly following the review process.
[266] A Two-stage Transformer Framework for Temporal Localization of Distracted Driver Behaviors
Gia-Bao Doan, Nam-Khoa Huynh, Minh-Nhat-Huy Ho, Khanh-Thanh-Khoa Nguyen, Thanh-Hai Le
Main category: cs.CV
TL;DR: A temporal action localization framework for driver monitoring that combines VideoMAE-based feature extraction with Augmented Self-Mask Attention detector and Spatial Pyramid Pooling-Fast module to balance accuracy and computational efficiency.
Details
Motivation: Current temporal action localization techniques struggle to balance accuracy with computational efficiency for identifying hazardous driving behaviors from in-cabin video streams, which is essential for road safety and traffic violation detection.Method: Two-stage pipeline: 1) VideoMAE-based feature extraction using ViT-Giant or ViT-based backbones, 2) Augmented Self-Mask Attention detector enhanced with Spatial Pyramid Pooling-Fast module to capture multi-scale temporal features.
Result: ViT-Giant achieves 88.09% Top-1 accuracy but with high computational cost (1584.06 GFLOPs/segment), while ViT-based variant achieves 82.55% accuracy with much lower cost (101.85 GFLOPs/segment). SPPF integration consistently improves performance, with ViT-Giant + SPPF achieving 92.67% mAP.
Conclusion: The framework demonstrates a clear trade-off between model capacity and efficiency, offering both high-performance (ViT-Giant + SPPF) and lightweight (ViT-based) options suitable for different driver monitoring scenarios like safety checkpoints or fleet management.
Abstract: The identification of hazardous driving behaviors from in-cabin video streams is essential for enhancing road safety and supporting the detection of traffic violations and unsafe driver actions. However, current temporal action localization techniques often struggle to balance accuracy with computational efficiency. In this work, we develop and evaluate a temporal action localization framework tailored for driver monitoring scenarios, particularly suitable for periodic inspection settings such as transportation safety checkpoints or fleet management assessment systems. Our approach follows a two-stage pipeline that combines VideoMAE-based feature extraction with an Augmented Self-Mask Attention (AMA) detector, enhanced by a Spatial Pyramid Pooling-Fast (SPPF) module to capture multi-scale temporal features. Experimental results reveal a distinct trade-off between model capacity and efficiency. At the feature extraction stage, the ViT-Giant backbone delivers higher representations with 88.09% Top-1 test accuracy, while the ViT-based variant proves to be a practical alternative, achieving 82.55% accuracy with significantly lower computational fine-tuning costs (101.85 GFLOPs/segment compared to 1584.06 GFLOPs/segment for Giant). In the downstream localization task, the integration of SPPF consistently improves performance across all configurations. Notably, the ViT-Giant + SPPF model achieves a peak mAP of 92.67%, while the lightweight ViT-based configuration maintains robust results.
[267] SGAD-SLAM: Splatting Gaussians at Adjusted Depth for Better Radiance Fields in RGBD SLAM
Pengchong Hu, Zhizhong Han
Main category: cs.CV
TL;DR: SGAD-SLAM improves 3D Gaussian Splatting for RGBD SLAM by using pixel-aligned Gaussians with adjustable positions along rays and Gaussian depth distributions for faster tracking, achieving better rendering quality and efficiency.
Details
Motivation: Current 3DGS methods for RGBD SLAM use either too flexible 3D Gaussians (slow convergence) or too limited view-tied 3D Gaussians (poor rendering quality). There's a need for a representation that balances flexibility and constraints to improve both convergence speed and rendering quality.Method: 1) Use pixel-aligned Gaussians that can adjust positions along their rays to maximize rendering quality while maintaining scalability. 2) Model depth distribution around each pixel as a Gaussian distribution to enable fast frame-to-scene alignment for tracking. 3) Optimize the system for runtime and storage efficiency.
Result: The method shows advantages over state-of-the-art methods in view rendering quality, camera tracking accuracy, runtime efficiency, and storage complexity on widely used benchmarks.
Conclusion: SGAD-SLAM successfully addresses the limitations of existing 3DGS approaches by introducing adjustable pixel-aligned Gaussians and Gaussian depth distributions, achieving better performance across multiple metrics in RGBD SLAM applications.
Abstract: 3D Gaussian Splatting (3DGS) has made remarkable progress in RGBD SLAM. Current methods usually use 3D Gaussians or view-tied 3D Gaussians to represent radiance fields in tracking and mapping. However, these Gaussians are either too flexible or too limited in movements, resulting in slow convergence or limited rendering quality. To resolve this issue, we adopt pixel-aligned Gaussians but allow each Gaussian to adjust its position along its ray to maximize the rendering quality, even if Gaussians are simplified to improve system scalability. To speed up the tracking, we model the depth distribution around each pixel as a Gaussian distribution, and then use these distributions to align each frame to the 3D scene quickly. We report our evaluations on widely used benchmarks, justify our designs, and show advantages over the latest methods in view rendering, camera tracking, runtime, and storage complexity. Please see our project page for code and videos at https://machineperceptionlab.github.io/SGAD-SLAM-Project .
[268] Single-Eye View: Monocular Real-time Perception Package for Autonomous Driving
Haixi Zhang, Aiyinsi Zuo, Zirui Li, Chunshu Wu, Tong Geng, Zhiyao Duan
Main category: cs.CV
TL;DR: LRHPerception is a real-time monocular perception system for autonomous driving that combines end-to-end learning efficiency with detailed local mapping, achieving 29 FPS processing with integrated object tracking, road segmentation, and depth estimation.
Details
Motivation: Current camera-based autonomous driving systems prioritize effectiveness over computational efficiency. The authors aim to address this gap by creating a real-time perception system that balances both performance and speed.Method: The system uses single-view camera video to interpret the environment, combining end-to-end learning efficiency with local mapping detail. It processes monocular images into a five-channel tensor (RGB + road segmentation + depth estimation) augmented with object detection and trajectory prediction in a unified framework.
Result: Achieves real-time processing at 29 FPS on a single GPU, representing a 555% speedup over the fastest mapping-based approach, with strong performance in object tracking, prediction, road segmentation, and depth estimation.
Conclusion: LRHPerception successfully demonstrates that real-time monocular perception for autonomous driving can be achieved without sacrificing performance, offering a computationally efficient alternative to existing approaches.
Abstract: Amidst the rapid advancement of camera-based autonomous driving technology, effectiveness is often prioritized with limited attention to computational efficiency. To address this issue, this paper introduces LRHPerception, a real-time monocular perception package for autonomous driving that uses single-view camera video to interpret the surrounding environment. The proposed system combines the computational efficiency of end-to-end learning with the rich representational detail of local mapping methodologies. With significant improvements in object tracking and prediction, road segmentation, and depth estimation integrated into a unified framework, LRHPerception processes monocular image data into a five-channel tensor consisting of RGB, road segmentation, and pixel-level depth estimation, augmented with object detection and trajectory prediction. Experimental results demonstrate strong performance, achieving real-time processing at 29 FPS on a single GPU, representing a 555% speedup over the fastest mapping-based approach.
[269] Two Experts Are Better Than One Generalist: Decoupling Geometry and Appearance for Feed-Forward 3D Gaussian Splatting
Hwasik Jeong, Seungryong Lee, Gyeongjin Kang, Seungkwon Yang, Xiangyu Sun, Seungtae Nam, Eunbyung Park
Main category: cs.CV
TL;DR: 2Xplat is a pose-free feed-forward 3D Gaussian Splatting framework that separates geometry estimation from appearance modeling using a two-expert design, outperforming unified approaches.
Details
Motivation: Current pose-free 3DGS methods use unified monolithic architectures that entangle geometric reasoning and appearance modeling, which may be suboptimal for high-fidelity 3D generation. The authors propose that separating these tasks could be more effective.Method: A two-expert design: 1) Geometry expert predicts camera poses from uncalibrated multi-view images, 2) Appearance expert uses these poses to synthesize 3D Gaussian representations. The modular approach explicitly separates geometry estimation from Gaussian generation.
Result: The framework achieves performance on par with state-of-the-art posed methods and substantially outperforms prior pose-free feed-forward 3DGS approaches, requiring fewer than 5K training iterations.
Conclusion: The modular two-expert design proves highly effective, challenging the prevailing unified paradigm and suggesting advantages of separation for complex 3D geometric estimation and appearance synthesis tasks.
Abstract: Pose-free feed-forward 3D Gaussian Splatting (3DGS) has opened a new frontier for rapid 3D modeling, enabling high-quality Gaussian representations to be generated from uncalibrated multi-view images in a single forward pass. The dominant approach in this space adopts unified monolithic architectures, often built on geometry-centric 3D foundation models, to jointly estimate camera poses and synthesize 3DGS representations within a single network. While architecturally streamlined, such “all-in-one” designs may be suboptimal for high-fidelity 3DGS generation, as they entangle geometric reasoning and appearance modeling within a shared representation. In this work, we introduce 2Xplat, a pose-free feed-forward 3DGS framework based on a two-expert design that explicitly separates geometry estimation from Gaussian generation. A dedicated geometry expert first predicts camera poses, which are then explicitly passed to a powerful appearance expert that synthesizes 3D Gaussians. Despite its conceptual simplicity, being largely underexplored in prior works, the proposed approach proves highly effective. In fewer than 5K training iterations, the proposed two-experts pipeline substantially outperforms prior pose-free feed-forward 3DGS approaches and achieves performance on par with state-of-the-art posed methods. These results challenge the prevailing unified paradigm and suggest the potential advantages of modular design principles for complex 3D geometric estimation and appearance synthesis tasks.
[270] NoOVD: Novel Category Discovery and Embedding for Open-Vocabulary Object Detection
Yupeng Zhang, Ruize Han, Zhiwei Chen, Wei Feng, Liang Wan
Main category: cs.CV
TL;DR: NoOVD: A novel training framework for open-vocabulary object detection that addresses the training-testing gap by using self-distillation from frozen vision-language models and adjusting proposal scores during inference.
Details
Motivation: Current open-vocabulary object detection methods suffer from a significant gap between training and testing phases. During training, novel-category objects are often misclassified as background by RPN and RoI heads, causing proposals to be filtered out. This leads to poor recall and weakened novel-category detection performance during testing.Method: Proposes NoOVD framework with two key components: 1) K-FPN leverages pretrained knowledge from frozen vision-language models to guide novel-category object discovery and enable self-distillation without additional data, preventing forced alignment of novel objects with background. 2) R-RPN adjusts confidence scores of proposals during inference to improve recall of novel-category objects.
Result: Cross-dataset evaluations on OV-LVIS, OV-COCO, and Objects365 demonstrate consistent superior performance across multiple metrics compared to existing approaches.
Conclusion: The proposed NoOVD framework effectively bridges the training-testing gap in open-vocabulary object detection by leveraging frozen VLMs for knowledge distillation and adjusting inference mechanisms, leading to improved novel-category detection performance.
Abstract: Despite the remarkable progress in open-vocabulary object detection (OVD), a significant gap remains between the training and testing phases. During training, the RPN and RoI heads often misclassify unlabeled novel-category objects as background, causing some proposals to be prematurely filtered out by the RPN while others are further misclassified by the RoI head. During testing, these proposals again receive low scores and are removed in post-processing, leading to a significant drop in recall and ultimately weakening novel-category detection performance.To address these issues, we propose a novel training framework-NoOVD-which innovatively integrates a self-distillation mechanism grounded in the knowledge of frozen vision-language models (VLMs). Specifically, we design K-FPN, which leverages the pretrained knowledge of VLMs to guide the model in discovering novel-category objects and facilitates knowledge distillation-without requiring additional data-thus preventing forced alignment of novel objects with background.Additionally, we introduce R-RPN, which adjusts the confidence scores of proposals during inference to improve the recall of novel-category objects. Cross-dataset evaluations on OV-LVIS, OV-COCO, and Objects365 demonstrate that our approach consistently achieves superior performance across multiple metrics.
[271] CTFS : Collaborative Teacher Framework for Forward-Looking Sonar Image Semantic Segmentation with Extremely Limited Labels
Ping Guo, Chengzhou Li, Guanchen Meng, Qi Jia, Jinyuan Liu, Zhu Liu, Yu Liu, Zhongxuan Luo, Xin Fan
Main category: cs.CV
TL;DR: Proposes a collaborative teacher-student framework for sonar image semantic segmentation that uses multiple teachers (one general + multiple sonar-specific) with cross-teacher reliability assessment to handle noisy pseudo-labels in limited data scenarios.
Details
Motivation: Forward-looking sonar images suffer from severe speckle noise, low texture contrast, acoustic shadows, and geometric distortions, making traditional teacher-student frameworks ineffective for semantic segmentation with limited labeled data.Method: Introduces a multi-teacher collaborative mechanism with one general teacher and multiple sonar-specific teachers, using multi-teacher alternating guidance and a cross-teacher reliability assessment mechanism to quantify pseudo-label reliability based on consistency across multiple views and teachers.
Result: Achieves 5.08% improvement in mIoU compared to state-of-the-art approaches on the FLSMD dataset when only 2% of data is labeled.
Conclusion: The proposed collaborative teacher framework effectively addresses sonar image challenges by combining general semantic learning with sonar-specific characteristics and mitigating noisy pseudo-labels through reliability assessment.
Abstract: As one of the most important underwater sensing technologies, forward-looking sonar exhibits unique imaging characteristics. Sonar images are often affected by severe speckle noise, low texture contrast, acoustic shadows, and geometric distortions. These factors make it difficult for traditional teacher-student frameworks to achieve satisfactory performance in sonar semantic segmentation tasks under extremely limited labeled data conditions. To address this issue, we propose a Collaborative Teacher Semantic Segmentation Framework for forward-looking sonar images. This framework introduces a multi-teacher collaborative mechanism composed of one general teacher and multiple sonar-specific teachers. By adopting a multi-teacher alternating guidance strategy, the student model can learn general semantic representations while simultaneously capturing the unique characteristics of sonar images, thereby achieving more comprehensive and robust feature modeling. Considering the challenges of sonar images, which can lead teachers to generate a large number of noisy pseudo-labels, we further design a cross-teacher reliability assessment mechanism. This mechanism dynamically quantifies the reliability of pseudo-labels by evaluating the consistency and stability of predictions across multiple views and multiple teachers, thereby mitigating the negative impact caused by noisy pseudo-labels. Notably, on the FLSMD dataset, when only 2% of the data is labeled, our method achieves a 5.08% improvement in mIoU compared to other state-of-the-art approaches.
[272] CoVFT: Context-aware Visual Fine-tuning for Multimodal Large Language Models
Nan Zhou, Huiqun Wang, Yaoyan Zheng, Di Huang
Main category: cs.CV
TL;DR: CoVFT framework addresses visual preference conflicts in MLLMs by incorporating multimodal context into visual fine-tuning, achieving SOTA performance with improved stability.
Details
Motivation: There's inconsistency in whether to fine-tune or freeze vision encoders in MLLMs, with existing methods failing to consistently outperform frozen baselines due to visual preference conflicts from context-agnostic vision encoders.Method: Proposes Context-aware Visual Fine-tuning (CoVFT) with Context Vector Extraction (CVE) and Contextual Mixture-of-Experts (CoMoE) modules to incorporate multimodal context into visual adaptation, decomposing conflicting optimization signals.
Result: CoVFT achieves state-of-the-art performance on 12 multimodal benchmarks with superior stability. Fine-tuning a 7B MLLM with CoVFT surpasses average performance of 13B counterparts, revealing untapped potential in visual encoder optimization.
Conclusion: Explicitly incorporating multimodal context into visual fine-tuning resolves visual preference conflicts and enables stable, context-sensitive visual updates in MLLMs, unlocking substantial performance gains.
Abstract: Multimodal large language models (MLLMs) achieve remarkable progress in cross-modal perception and reasoning, yet a fundamental question remains unresolved: should the vision encoder be fine-tuned or frozen? Despite the success of models such as LLaVA and Qwen-VL, inconsistent design choices and heterogeneous training setups hinder a unified understanding of visual fine-tuning (VFT) in MLLMs. Through a configuration-aligned benchmark, we find that existing VFT methods fail to consistently outperform the frozen baseline across multimodal tasks. Our analysis suggests that this instability arises from visual preference conflicts, where the context-agnostic nature of vision encoders induces divergent parameter updates under diverse multimodal context. To address this issue, we propose the Context-aware Visual Fine-tuning (CoVFT) framework, which explicitly incorporates multimodal context into visual adaptation. By integrating a Context Vector Extraction (CVE) and a Contextual Mixture-of-Experts (CoMoE) module, CoVFT decomposes conflicting optimization signals and enables stable, context-sensitive visual updates. Extensive experiments on 12 multimodal benchmarks demonstrate that CoVFT achieves state-of-the-art performance with superior stability. Notably, fine-tuning a 7B MLLM with CoVFT surpasses the average performance of its 13B counterpart, revealing substantial untapped potential in visual encoder optimization within MLLMs.
[273] Hierarchical Text-Guided Brain Tumor Segmentation via Sub-Region-Aware Prompts
Bahram Mohammadi, Ta Duc Huy, Afrouz Sheikholeslami, Qi Chen, Vu Minh Hieu Phan, Sam White, Minh-Son To, Xuyun Zhang, Amin Beheshti, Luping Zhou, Yuankai Qi
Main category: cs.CV
TL;DR: TextCSP: A hierarchical text-guided framework for brain tumor segmentation that uses text-modulated soft cascade decoding with sub-region-aware prompt tuning and text-semantic channel modulators to improve segmentation of ambiguous tumor sub-regions.
Details
Motivation: Brain tumor segmentation is challenging due to ambiguous visual boundaries between sub-regions (whole tumor, tumor core, enhancing tumor). Existing multimodal approaches compress radiological reports into single global embeddings, overlooking distinct clinical characteristics of each sub-region.Method: Three novel components: (1) text-modulated soft cascade decoder predicting WT→TC→ET in coarse-to-fine anatomical hierarchy, (2) sub-region-aware prompt tuning with learnable soft prompts and LoRA-adapted BioBERT encoder for specialized text representations per sub-region, (3) text-semantic channel modulators converting representations into channel-wise refinement signals to emphasize clinically described patterns.
Result: Experiments on TextBraTS dataset show consistent improvements across all sub-regions against state-of-the-art methods by 1.7% on Dice and 6% on HD95 metrics.
Conclusion: TextCSP effectively integrates radiological text descriptions with imaging through hierarchical text guidance, addressing limitations of global text embeddings and improving brain tumor segmentation performance.
Abstract: Brain tumor segmentation remains challenging because the three standard sub-regions, i.e., whole tumor (WT), tumor core (TC), and enhancing tumor (ET), often exhibit ambiguous visual boundaries. Integrating radiological description texts with imaging has shown promise. However, most multimodal approaches typically compress a report into a single global text embedding shared across all sub-regions, overlooking their distinct clinical characteristics. We propose TextCSP (text-modulated soft cascade architecture), a hierarchical text-guided framework that builds on the TextBraTS baseline with three novel components: (1) a text-modulated soft cascade decoder that predicts WT->TC->ET in a coarse-to-fine manner consistent with their anatomical containment hierarchy. (2) sub-region-aware prompt tuning, which uses learnable soft prompts with a LoRA-adapted BioBERT encoder to generate specialized text representations tailored for each sub-region; (3) text-semantic channel modulators that convert the aforementioned representations into channel-wise refinement signals, enabling the decoder to emphasize features aligned with clinically described patterns. Experiments on the TextBraTS dataset demonstrate consistent improvements across all sub-regions against state-of-the-art methods by 1.7% and 6% on the main metrics Dice and HD95.
[274] Taming Sampling Perturbations with Variance Expansion Loss for Latent Diffusion Models
Qifan Li, Xingyu Zhou, Jinhua Zhang, Weiyi You, Shuhang Gu
Main category: cs.CV
TL;DR: Latent diffusion models suffer from sensitivity to sampling perturbations due to overly compact latent spaces from β-VAE tokenizers; proposed Variance Expansion loss improves robustness while maintaining reconstruction fidelity.
Details
Motivation: While latent diffusion models excel at image generation, they're sensitive to sampling perturbations due to overly compact latent spaces from β-VAE tokenizers, causing visual degradation despite good reconstruction accuracy.Method: Introduces Variance Expansion loss to counteract variance collapse in latent spaces, using adversarial interplay between reconstruction and variance expansion to balance fidelity and robustness to stochastic sampling.
Result: Extensive experiments show the approach consistently enhances generation quality across different latent diffusion architectures, confirming robustness in latent space is crucial for stable diffusion sampling.
Conclusion: Robustness to sampling perturbations is a key missing ingredient in latent diffusion models; the proposed method improves generation quality by addressing variance collapse while maintaining reconstruction fidelity.
Abstract: Latent diffusion models have emerged as the dominant framework for high-fidelity and efficient image generation, owing to their ability to learn diffusion processes in compact latent spaces. However, while previous research has focused primarily on reconstruction accuracy and semantic alignment of the latent space, we observe that another critical factor, robustness to sampling perturbations, also plays a crucial role in determining generation quality. Through empirical and theoretical analyses, we show that the commonly used $β$-VAE-based tokenizers in latent diffusion models, tend to produce overly compact latent manifolds that are highly sensitive to stochastic perturbations during diffusion sampling, leading to visual degradation. To address this issue, we propose a simple yet effective solution that constructs a latent space robust to sampling perturbations while maintaining strong reconstruction fidelity. This is achieved by introducing a Variance Expansion loss that counteracts variance collapse and leverages the adversarial interplay between reconstruction and variance expansion to achieve an adaptive balance that preserves reconstruction accuracy while improving robustness to stochastic sampling. Extensive experiments demonstrate that our approach consistently enhances generation quality across different latent diffusion architectures, confirming that robustness in latent space is a key missing ingredient for stable and faithful diffusion sampling.
[275] DGRNet: Disagreement-Guided Refinement for Uncertainty-Aware Brain Tumor Segmentation
Bahram Mohammadi, Yanqiu Wu, Vu Minh Hieu Phan, Sam White, Minh-Son To, Jian Yang, Michael Sheng, Yang Song, Yuankai Qi
Main category: cs.CV
TL;DR: DGRNet: A brain tumor segmentation framework that uses multi-view disagreement for uncertainty estimation and text-conditioned refinement of ambiguous regions guided by radiology reports.
Details
Motivation: Address two limitations in brain tumor segmentation: lack of reliable uncertainty quantification in single-model predictions (critical for clinical deployment) and under-utilization of rich information in radiology reports that could guide segmentation in ambiguous regions.Method: Proposes Disagreement-Guided Refinement Network (DGRNet) with: 1) Four lightweight view-specific adapters attached to shared encoder-decoder for diverse predictions and uncertainty quantification in single forward pass, 2) Disagreement maps to identify high uncertainty regions, 3) Text-conditioned refinement using clinical reports, 4) Diversity-preserving training with pairwise similarity penalties and gradient isolation to prevent view collapse.
Result: On TextBraTS dataset, DGRNet improves state-of-the-art segmentation accuracy by 2.4% in Dice and 11% in HD95 metrics, while providing meaningful uncertainty estimates.
Conclusion: DGRNet effectively addresses uncertainty quantification and leverages textual clinical information to improve brain tumor segmentation accuracy, making it suitable for clinical deployment.
Abstract: Accurate brain tumor segmentation from MRI scans is critical for diagnosis and treatment planning. Despite the strong performance of recent deep learning approaches, two fundamental limitations remain: (1) the lack of reliable uncertainty quantification in single-model predictions, which is essential for clinical deployment because the level of uncertainty may impact treatment decision-making, and (2) the under-utilization of rich information in radiology reports that can guide segmentation in ambiguous regions. In this paper, we propose the Disagreement-Guided Refinement Network (DGRNet), a novel framework that addresses both limitations through multi-view disagreement-based uncertainty estimation and text-conditioned refinement. DGRNet generates diverse predictions via four lightweight view-specific adapters attached to a shared encoder-decoder, enabling efficient uncertainty quantification within a single forward pass. Afterward, we build disagreement maps to identify regions of high segmentation uncertainty, which are then selectively refined according to clinical reports. Moreover, we introduce a diversity-preserving training strategy that combines pairwise similarity penalties and gradient isolation to prevent view collapse. The experimental results on the TextBraTS dataset show that DGRNet favorably improves state-of-the-art segmentation accuracy by 2.4% and 11% in main metrics Dice and HD95, respectively, while providing meaningful uncertainty estimates.
[276] Representation-Level Adversarial Regularization for Clinically Aligned Multitask Thyroid Ultrasound Assessment
Dina Salama, Mohamed Mahmoud, Nourhan Bayasi, David Liu, Ilker Hacihaliloglu
Main category: cs.CV
TL;DR: A multitask framework for thyroid ultrasound analysis that jointly predicts nodule segmentation masks and TI-RADS risk categories, using representation-level adversarial regularization to handle annotator variability.
Details
Motivation: Thyroid ultrasound analysis suffers from inconsistent supervision due to radiologist variability in both contouring style and risk grading (TI-RADS categories), which degrades standard learning pipelines. The paper aims to address this clinical workflow with a unified approach.Method: Proposes a clinically guided multitask framework that jointly predicts nodule masks and TI-RADS categories. Uses TI-RADS-aligned radiomics targets to ground risk prediction in clinically meaningful evidence. Introduces RLAR (representation-level adversarial gradient regularizer) to handle gradient competition between tasks by penalizing excessive angular alignment between task-specific adversarial directions in latent space.
Result: On a public TI-RADS dataset, the clinically guided multitask model with RLAR consistently improves risk stratification while maintaining segmentation quality compared to single-task training and conventional multitask baselines.
Conclusion: The proposed framework effectively addresses clinical workflow needs by jointly handling segmentation and risk classification, with RLAR mitigating gradient competition issues under annotator variability.
Abstract: Thyroid ultrasound is the first-line exam for assessing thyroid nodules and determining whether biopsy is warranted. In routine reporting, radiologists produce two coupled outputs: a nodule contour for measurement and a TI-RADS risk category based on sonographic criteria. Yet both contouring style and risk grading vary across readers, creating inconsistent supervision that can degrade standard learning pipelines. In this paper, we address this workflow with a clinically guided multitask framework that jointly predicts the nodule mask and TI-RADS category within a single model. To ground risk prediction in clinically meaningful evidence, we guide the classification embedding using a compact TI-RADS aligned radiomics target during training, while preserving complementary deep features for discriminative performance. However, under annotator variability, naive multitask optimization often fails not because the tasks are unrelated, but because their gradients compete within the shared representation. To make this competition explicit and controllable, we introduce RLAR, a representation-level adversarial gradient regularizer. Rather than performing parameter-level gradient surgery, RLAR uses each task’s normalized adversarial direction in latent space as a geometric probe of task sensitivity and penalizes excessive angular alignment between task-specific adversarial directions. On a public TI-RADS dataset, our clinically guided multitask model with RLAR consistently improves risk stratification while maintaining segmentation quality compared to single-task training and conventional multitask baselines. Code and pretrained models will be released.
[277] Learning Progressive Adaptation for Multi-Modal Tracking
He Wang, Tianyang Xu, Zhangyong Tang, Xiao-Jun Wu, Josef Kittler
Main category: cs.CV
TL;DR: PATrack introduces progressive adaptation with modality-dependent, modality-entangled, and task-level adapters to bridge RGB pre-trained models to multi-modal tracking tasks.
Details
Motivation: Existing multi-modal trackers use simple fine-tuning of RGB models, overlooking advanced adaptations for specific modalities, cross-modal interactions, and prediction head optimization.Method: Progressive adaptation framework with three adapters: modality-dependent (enhances intra-modal features via frequency decomposition), modality-entangled (enables cross-modal interactions via cross-attention), and task-level (adapts prediction head to fused information).
Result: Extensive experiments on RGB+Thermal, RGB+Depth, and RGB+Event tracking tasks demonstrate impressive performance against state-of-the-art methods.
Conclusion: PATrack effectively bridges RGB pre-trained networks to multi-modal data through progressive adaptation, integrating intra-modal, inter-modal, and task-level adaptations in a unified framework.
Abstract: Due to the limited availability of paired multi-modal data, multi-modal trackers are typically built by adopting pre-trained RGB models with parameter-efficient fine-tuning modules. However, these fine-tuning methods overlook advanced adaptations for applying RGB pre-trained models and fail to modulate a single specific modality, cross-modal interactions, and the prediction head. To address the issues, we propose to perform Progressive Adaptation for Multi-Modal Tracking (PATrack). This innovative approach incorporates modality-dependent, modality-entangled, and task-level adapters, effectively bridging the gap in adapting RGB pre-trained networks to multi-modal data through a progressive strategy. Specifically, modality-specific information is enhanced through the modality-dependent adapter, decomposing the high- and low-frequency components, which ensures a more robust feature representation within each modality. The inter-modal interactions are introduced in the modality-entangled adapter, which implements a cross-attention operation guided by inter-modal shared information, ensuring the reliability of features conveyed between modalities. Additionally, recognising that the strong inductive bias of the prediction head does not adapt to the fused information, a task-level adapter specific to the prediction head is introduced. In summary, our design integrates intra-modal, inter-modal, and task-level adapters into a unified framework. Extensive experiments on RGB+Thermal, RGB+Depth, and RGB+Event tracking tasks demonstrate that our method shows impressive performance against state-of-the-art methods. Code is available at https://github.com/ouha1998/Learning-Progressive-Adaptation-for-Multi-Modal-Tracking.
[278] Frequency Switching Mechanism for Parameter-E!cient Multi-Task Learning
Shih-Wen Liu, Yen-Chang Chen, Wei-Ta Chu, Fu-En Yang, Yu-Chiang Frank Wang
Main category: cs.CV
TL;DR: Free Sinewich is a parameter-efficient multi-task learning framework that uses frequency switching to enable near-zero-cost weight modulation for different tasks, achieving state-of-the-art performance-efficiency trade-offs.
Details
Motivation: Current parameter-efficient fine-tuning (PEFT) methods are largely limited to single-task adaptation, creating a need for efficient multi-task learning approaches that can handle multiple tasks with minimal parameter overhead.Method: Introduces Sine-AWB layers that combine low-rank factors and convolutional priors into a single kernel, modulated elementwise by sinusoidal transformations to produce task-specialized weights. A lightweight Clock Net produces bounded frequencies to stabilize modulation during training.
Result: Achieves state-of-the-art performance-efficiency trade-offs on dense prediction benchmarks, with up to +5.39% improvement over single-task fine-tuning using only 6.53M trainable parameters.
Conclusion: Free Sinewich offers a compact and scalable paradigm for multi-task learning based on frequency-based parameter sharing, enabling efficient adaptation to multiple tasks with minimal parameter overhead.
Abstract: Multi-task learning (MTL) aims to enable a single model to solve multiple tasks efficiently; however, current parameter-efficient fine-tuning (PEFT) methods remain largely limited to single-task adaptation. We introduce \textbf{Free Sinewich}, a parameter-efficient multi-task learning framework that enables near-zero-cost weight modulation via frequency switching (\textbf{Free}). Specifically, a \textbf{Sine-AWB (Sinewich)} layer combines low-rank factors and convolutional priors into a single kernel, which is then modulated elementwise by a sinusoidal transformation to produce task-specialized weights. A lightweight Clock Net is introduced to produce bounded frequencies that stabilize this modulation during training. Theoretically, sine modulation enhances the rank of low-rank adapters, while frequency separation decorrelates the weights of different tasks. On dense prediction benchmarks, Free Sinewich achieves state-of-the-art performance-efficiency trade-offs (e.g., up to +5.39% improvement over single-task fine-tuning with only 6.53M trainable parameters), offering a compact and scalable paradigm based on frequency-based parameter sharing. Project page: \href{https://casperliuliuliu.github.io/projects/Free-Sinewich/}{https://casperliuliuliu.github.io/projects/Free-Sinewich}.
[279] CVT-Bench: Counterfactual Viewpoint Transformations Reveal Unstable Spatial Representations in Multimodal LLMs
Shanmukha Vellamcheti, Uday Kiran Kothapalli, Disharee Bhowmick, Sathyanarayanan N. Aakur
Main category: cs.CV
TL;DR: MLLMs show strong single-view spatial reasoning but fail to maintain consistent spatial representations under counterfactual viewpoint changes, with systematic degradation in relational consistency across viewpoint transformations.
Details
Motivation: To evaluate whether multimodal large language models maintain stable spatial state representations under counterfactual viewpoint changes, going beyond single-view spatial reasoning performance.Method: Created a controlled diagnostic benchmark with 100 synthetic scenes and 6,000 relational queries to evaluate relational consistency under hypothetical camera orbit transformations without re-rendering images. Measured viewpoint consistency, 360° cycle agreement, and relational stability over sequential transformations.
Result: Despite high single-view accuracy, state-of-the-art MLLMs exhibit systematic degradation under counterfactual viewpoint changes, with frequent violations of cycle consistency and rapid decay in relational stability. Increasing representational structure (from visual input to textual bounding boxes to structured scene graphs) improves stability.
Conclusion: Single-view spatial accuracy overestimates the robustness of induced spatial representations, and representation structure plays a critical role in counterfactual spatial reasoning for MLLMs.
Abstract: Multimodal large language models (MLLMs) achieve strong performance on single-view spatial reasoning tasks, yet it remains unclear whether they maintain stable spatial state representations under counterfactual viewpoint changes. We introduce a controlled diagnostic benchmark that evaluates relational consistency under hypothetical camera orbit transformations without re-rendering images. Across 100 synthetic scenes and 6,000 relational queries, we measure viewpoint consistency, 360° cycle agreement, and relational stability over sequential transformations. Despite high single-view accuracy, state-of-the-art MLLMs exhibit systematic degradation under counterfactual viewpoint changes, with frequent violations of cycle consistency and rapid decay in relational stability. We further evaluate multiple input representations, visual input, textual bounding boxes, and structured scene graphs, and show that increasing representational structure improves stability. Our results suggest that single-view spatial accuracy overestimates the robustness of induced spatial representations and that representation structure plays a critical role in counterfactual spatial reasoning.
[280] LiFR-Seg: Anytime High-Frame-Rate Segmentation via Event-Guided Propagation
Xiaoshan Wu, Xiaoyang Lyu, Yifei Yu, Bo Wang, Zhongrui Wang, Xiaojuan Qi
Main category: cs.CV
TL;DR: LiFR-Seg enables dense semantic segmentation at any time using single past RGB frame + event stream, achieving HFR-like performance with LFR hardware via uncertainty-aware feature warping guided by event-driven motion fields.
Details
Motivation: Standard cameras' low-frame-rate creates perceptual gaps in dynamic environments, limiting dense semantic segmentation. Need to achieve high-frame-rate perception using only low-frame-rate hardware.Method: Proposes LiFR-Seg framework with uncertainty-aware warping process guided by event-driven motion field and explicit confidence learning. Includes temporal memory attention module for coherence in dynamic scenes.
Result: Achieves 73.82% mIoU on DSEC dataset, statistically indistinguishable from HFR upper-bound (within 0.09%). Validated on DSEC and new synthetic benchmark SHF-DSEC.
Conclusion: Presents efficient paradigm for robust high-frame-rate perception with low-frame-rate hardware through event-guided feature propagation.
Abstract: Dense semantic segmentation in dynamic environments is fundamentally limited by the low-frame-rate (LFR) nature of standard cameras, which creates critical perceptual gaps between frames. To solve this, we introduce Anytime Interframe Semantic Segmentation: a new task for predicting segmentation at any arbitrary time using only a single past RGB frame and a stream of asynchronous event data. This task presents a core challenge: how to robustly propagate dense semantic features using a motion field derived from sparse and often noisy event data, all while mitigating feature degradation in highly dynamic scenes. We propose LiFR-Seg, a novel framework that directly addresses these challenges by propagating deep semantic features through time. The core of our method is an uncertainty-aware warping process, guided by an event-driven motion field and its learned, explicit confidence. A temporal memory attention module further ensures coherence in dynamic scenarios. We validate our method on the DSEC dataset and a new high-frequency synthetic benchmark (SHF-DSEC) we contribute. Remarkably, our LFR system achieves performance (73.82% mIoU on DSEC) that is statistically indistinguishable from an HFR upper-bound (within 0.09%) that has full access to the target frame. This work presents a new, efficient paradigm for achieving robust, high-frame-rate perception with low-frame-rate hardware. Project Page: https://candy-crusher.github.io/LiFR_Seg_Proj/#; Code: https://github.com/Candy-Crusher/LiFR-Seg.git.
[281] ReDiffuse: Rotation Equivariant Diffusion Model for Multi-focus Image Fusion
Bo Li, Tingting Bao, Lingling Zhang, Weiping Fu, Yaxian Wang, Jun Liu
Main category: cs.CV
TL;DR: ReDiffuse: A rotation-equivariant diffusion model for multi-focus image fusion that preserves geometric structure consistency by embedding rotation equivariance into diffusion networks.
Details
Motivation: Diffusion models show promise for multi-focus image fusion, but defocus blur can warp geometric structures like textures and edges, causing artifacts. Rotation equivariance is needed to preserve original orientation and structural consistency in fusion results.Method: Constructs basic diffusion architectures to achieve end-to-end rotation equivariance, with rigorous theoretical analysis of intrinsic equivariance error to validate the embedding of equivariance structures.
Result: Comprehensive evaluation against various MFIF methods across four datasets (Lytro, MFFW, MFI-WHU, Road-MF) shows competitive performance with improvements of 0.28-6.64% across six evaluation metrics.
Conclusion: ReDiffuse successfully addresses the structural preservation challenge in diffusion-based multi-focus image fusion through rotation-equivariant design, achieving state-of-the-art performance while maintaining geometric consistency.
Abstract: Diffusion models have achieved impressive performance on multi-focus image fusion (MFIF). However, a key challenge in applying diffusion models to the ill-posed MFIF problem is that defocus blur can make common symmetric geometric structures (e.g., textures and edges) appear warped and deformed, often leading to unexpected artifacts in the fused images. Therefore, embedding rotation equivariance into diffusion networks is essential, as it enables the fusion results to faithfully preserve the original orientation and structural consistency of geometric patterns underlying the input images. Motivated by this, we propose ReDiffuse, a rotation-equivariant diffusion model for MFIF. Specifically, we carefully construct the basic diffusion architectures to achieve end-to-end rotation equivariance. We also provide a rigorous theoretical analysis to evaluate its intrinsic equivariance error, demonstrating the validity of embedding equivariance structures. ReDiffuse is comprehensively evaluated against various MFIF methods across four datasets (Lytro, MFFW, MFI-WHU, and Road-MF). Results demonstrate that ReDiffuse achieves competitive performance, with improvements of 0.28-6.64% across six evaluation metrics. The code is available at https://github.com/MorvanLi/ReDiffuse.
[282] One Pool Is Not Enough: Multi-Cluster Memory for Practical Test-Time Adaptation
Yu-Wen Tseng, Xingyi Zheng, Ya-Chen Wu, I-Bin Liao, Yung-Hui Li, Hong-Han Shuai, Wen-Huang Cheng
Main category: cs.CV
TL;DR: MCM introduces multi-cluster memory organization for practical test-time adaptation, addressing the multi-modal nature of test streams through cluster assignment, consolidation, and balanced retrieval.
Details
Motivation: Existing TTA methods use single unstructured memory pools, which are fundamentally mismatched to practical TTA settings where test streams are temporally correlated, non-i.i.d., and inherently multi-modal.Method: Multi-Cluster Memory (MCM) organizes stored samples into multiple clusters using pixel-level statistical descriptors, with three mechanisms: descriptor-based cluster assignment, Adjacent Cluster Consolidation to bound memory, and Uniform Cluster Retrieval for balanced supervision.
Result: MCM achieves consistent improvements across 12 configurations on CIFAR-10-C, CIFAR-100-C, ImageNet-C, and DomainNet, with gains up to 5.00% on ImageNet-C and 12.13% on DomainNet, scaling with distributional complexity.
Conclusion: Memory organization is a key design axis for practical test-time adaptation, and multi-cluster approaches better match the inherent multi-modality of real-world test streams.
Abstract: Test-time adaptation (TTA) adapts pre-trained models to distribution shifts at inference using only unlabeled test data. Under the Practical TTA (PTTA) setting, where test streams are temporally correlated and non-i.i.d., memory has become an indispensable component for stable adaptation, yet existing methods universally store amples in a single unstructured pool. We show that this single-cluster design is fundamentally mismatched to PTTA: a stream clusterability analysis reveals that test streams are inherently multi-modal, with the optimal number of mixture components consistently far exceeding one. To close this structural gap, we propose Multi-Cluster Memory (MCM), a plug-and-play framework that organizes stored samples into multiple clusters using lightweight pixel-level statistical descriptors. MCM introduces three complementary mechanisms: descriptor-based cluster assignment to capture distinct distributional modes, Adjacent Cluster Consolidation (ACC) to bound memory usage by merging the most similar temporally adjacent clusters, and Uniform Cluster Retrieval (UCR) to ensure balanced supervision across all modes during adaptation. Integrated with three contemporary TTA methods on CIFAR-10-C, CIFAR-100-C, ImageNet-C, and DomainNet, MCM achieves consistent improvements across all 12 configurations, with gains up to 5.00% on ImageNet-C and 12.13% on DomainNet. Notably, these gains scale with distributional complexity: larger label spaces with greater multi-modality benefit most from multi-cluster organization. GMM-based memory diagnostics further confirm that MCM maintains near-optimal distributional balance, entropy, and mode coverage, whereas single-cluster memory exhibits persistent imbalance and progressive mode loss. These results establish memory organization as a key design axis for practical test-time adaptation.
[283] MS-CustomNet: Controllable Multi-Subject Customization with Hierarchical Relational Semantics
Pengxiang Cai, Mengyang Li
Main category: cs.CV
TL;DR: MS-CustomNet enables multi-subject customization in text-to-image generation with explicit user control over hierarchical arrangements and spatial relationships between subjects.
Details
Motivation: Current diffusion-based text-to-image methods lack fine-grained control over multi-subject compositions, struggling with explicit user-defined control over compositional structure and precise spatial relationships between multiple distinct subjects.Method: Introduces MS-CustomNet framework for zero-shot integration of multiple user-provided objects with explicit hierarchical arrangement and spatial placement control. Also presents MSI dataset derived from COCO for training on complex multi-subject compositions.
Result: Achieves DINO-I score of 0.61 for identity preservation and YOLO-L score of 0.94 for positional control in multi-subject customization tasks, demonstrating superior capability in generating high-fidelity images with precise user-directed compositions.
Conclusion: MS-CustomNet offers enhanced fine-grained control over multi-subject image generation, addressing the challenge of maintaining subject identity while enabling explicit user control over compositional structure and spatial relationships.
Abstract: Diffusion-based text-to-image generation has advanced significantly, yet customizing scenes with multiple distinct subjects while maintaining fine-grained control over their interactions remains challenging. Existing methods often struggle to provide explicit user-defined control over the compositional structure and precise spatial relationships between subjects. To address this, we introduce MS-CustomNet, a novel framework for multi-subject customization. MS-CustomNet allows zero-shot integration of multiple user-provided objects and, crucially, empowers users to explicitly define these hierarchical arrangements and spatial placements within the generated image. Our approach ensures individual subject identity preservation while learning and enacting these user-specified inter-subject compositions. We also present the MSI dataset, derived from COCO, to facilitate training on such complex multi-subject compositions. MS-CustomNet offers enhanced, fine-grained control over multi-subject image generation. Our method achieves a DINO-I score of 0.61 for identity preservation and a YOLO-L score of 0.94 for positional control in multi-subject customization tasks, demonstrating its superior capability in generating high-fidelity images with precise, user-directed multi-subject compositions and spatial control.
[284] Incentivizing Generative Zero-Shot Learning via Outcome-Reward Reinforcement Learning with Visual Cues
Wenjin Hou, Xiaoxiao Sun, Hehe Fan
Main category: cs.CV
TL;DR: RLVC is a reinforcement learning framework with visual cues for generative zero-shot learning that improves feature synthesis through outcome-based rewards and class-wise visual alignment.
Details
Motivation: Current generative ZSL methods produce task-agnostic features and struggle with semantically similar but visually distinct classes, leading to degraded performance.Method: Uses reinforcement learning with outcome-based rewards to evolve the generative model, incorporates class-wise visual cues to align synthesized features with visual prototypes, and employs a novel cold-start training strategy.
Result: Achieves state-of-the-art results on three ZSL benchmarks with a 4.7% performance gain.
Conclusion: RLVC effectively addresses limitations of generative ZSL by making synthesized features more task-relevant and visually aligned, demonstrating the value of reinforcement learning for feature generation.
Abstract: Recent advances in zero-shot learning (ZSL) have demonstrated the potential of generative models. Typically, generative ZSL synthesizes visual features conditioned on semantic prototypes to model the data distribution of unseen classes, followed by training a classifier on the synthesized data. However, the synthesized features often remain task-agnostic, leading to degraded performance. Moreover, inferring a faithful distribution from semantic prototypes alone is insufficient for classes that are semantically similar but visually distinct. To address these and advance ZSL, we propose RLVC, an outcome-reward reinforcement learning RL framework with visual cues for generative ZSL. At its core, RL empowers the generative model to self-evolve, implicitly enhancing its generation capability. In particular, RLVC updates the generative model using an outcome-based reward, encouraging the synthesis of task-relevant features. Furthermore, we introduce class-wise visual cues that (i) align synthesized features with visual prototypes and (ii) stabilize the RL training updates. For the training process, we present a novel cold-start strategy. Comprehensive experiments and analyses on three prevalent ZSL benchmarks demonstrate that RLVC achieves state-of-the-art results with a 4.7% gain.
[285] Training-Free Instance-Aware 3D Scene Reconstruction and Diffusion-Based View Synthesis from Sparse Images
Jiatong Xia, Lingqiao Liu
Main category: cs.CV
TL;DR: Training-free 3D indoor scene reconstruction from sparse unposed RGB images using point cloud reconstruction, instance lifting, and diffusion-based rendering.
Details
Motivation: Traditional radiance field methods require dense views and per-scene optimization, limiting efficiency. The authors aim to enable high-fidelity 3D reconstruction and editing without training or pose preprocessing.Method: Three-stage pipeline: 1) Robust point cloud reconstruction with warping-based anomaly removal, 2) Warping-guided 2D-to-3D instance lifting for consistent instance-aware representation, 3) Novel rendering via point cloud projection refined with 3D-aware diffusion model.
Result: Achieves high-fidelity 3D reconstruction from sparse unposed images, supports object-level scene editing (instance removal) by modifying point cloud, and generates consistent edited views without retraining.
Conclusion: Establishes a new direction for efficient, editable 3D content generation without scene-specific optimization, leveraging diffusion models to compensate for sparse geometry.
Abstract: We introduce a novel, training-free system for reconstructing, understanding, and rendering 3D indoor scenes from a sparse set of unposed RGB images. Unlike traditional radiance field approaches that require dense views and per-scene optimization, our pipeline achieves high-fidelity results without any training or pose preprocessing. The system integrates three key innovations: (1) A robust point cloud reconstruction module that filters unreliable geometry using a warping-based anomaly removal strategy; (2) A warping-guided 2D-to-3D instance lifting mechanism that propagates 2D segmentation masks into a consistent, instance-aware 3D representation; and (3) A novel rendering approach that projects the point cloud into new views and refines the renderings with a 3D-aware diffusion model. Our method leverages the generative power of diffusion to compensate for missing geometry and enhances realism, especially under sparse input conditions. We further demonstrate that object-level scene editing such as instance removal can be naturally supported in our pipeline by modifying only the point cloud, enabling the synthesis of consistent, edited views without retraining. Our results establish a new direction for efficient, editable 3D content generation without relying on scene-specific optimization. Project page: https://jiatongxia.github.io/TID3R/
[286] GIDE: Unlocking Diffusion LLMs for Precise Training-Free Image Editing
Zifeng Zhu, Jiaming Han, Jiaxiang Zhao, Minnan Luo, Xiangyu Yue
Main category: cs.CV
TL;DR: GIDE is a training-free image editing framework for Diffusion Large Language Models that enables precise editing using text, point, and box instructions while preserving background structure through discrete noise inversion.
Details
Motivation: Diffusion Large Language Models (DLLMs) show strong multi-modal generation capabilities but struggle with precise, training-free image editing due to discrete tokenization preventing standard noise inversion techniques, leading to structural degradation.Method: GIDE introduces a unified framework with Discrete Noise Inversion to capture latent noise patterns in discrete token space, decomposing editing into grounding, inversion, and refinement stages to support various editing instructions while preserving unedited background.
Result: GIDE significantly outperforms prior training-free methods on GIDE-Bench (805 compositional editing scenarios), improving Semantic Correctness by 51.83% and Perceptual Quality by 50.39%, and shows consistent gains over trained baselines on ImgEdit-Bench.
Conclusion: GIDE successfully bridges the gap between DLLMs and precise image editing through discrete noise inversion, enabling training-free editing with various instructions while maintaining background integrity, validated by comprehensive benchmarks.
Abstract: While Diffusion Large Language Models (DLLMs) have demonstrated remarkable capabilities in multi-modal generation, performing precise, training-free image editing remains an open challenge. Unlike continuous diffusion models, the discrete tokenization inherent in DLLMs hinders the application of standard noise inversion techniques, often leading to structural degradation during editing. In this paper, we introduce GIDE (Grounded Inversion for DLLM Image Editing), a unified framework designed to bridge this gap. GIDE incorporates a novel Discrete Noise Inversion mechanism that accurately captures latent noise patterns within the discrete token space, ensuring high-fidelity reconstruction. We then decompose the editing pipeline into grounding, inversion, and refinement stages. This design enables GIDE supporting various editing instructions (text, point and box) and operations while strictly preserving the unedited background. Furthermore, to overcome the limitations of existing single-step evaluation protocols, we introduce GIDE-Bench, a rigorous benchmark comprising 805 compositional editing scenarios guided by diverse multi-modal inputs. Extensive experiments on GIDE-Bench demonstrate that GIDE significantly outperforms prior training-free methods, improving Semantic Correctness by 51.83% and Perceptual Quality by 50.39%. Additional evaluations on ImgEdit-Bench confirm its broad applicability, demonstrating consistent gains over trained baselines and yielding photorealistic consistency on par with leading models.
[287] Boundary-Aware Instance Segmentation in Microscopy Imaging
Thomas Mendelson, Joshua Francois, Galit Lahav, Tammy Riklin-Raviv
Main category: cs.CV
TL;DR: A prompt-free, boundary-aware instance segmentation framework for microscopy videos that predicts signed distance functions (SDFs) instead of binary masks to better separate touching or overlapping cell instances.
Details
Motivation: Accurate delineation of individual cells in microscopy videos is essential for studying cellular dynamics, but separating touching or overlapping instances remains challenging. While foundation models like SAM have improved accessibility, they still struggle with dense microscopy scenes without extensive prompting.Method: Proposes a prompt-free, boundary-aware instance segmentation framework that predicts signed distance functions (SDFs) instead of binary masks, enabling smooth and geometry-consistent modeling of cell contours. Uses learned sigmoid mapping to convert SDFs into probability maps for sharp boundary localization. Training is guided by a unified Modified Hausdorff Distance (MHD) loss integrating region- and boundary-based terms.
Result: Evaluations on both public and private high-throughput microscopy datasets demonstrate improved boundary accuracy and instance-level performance compared to recent SAM-based and foundation-model approaches.
Conclusion: The proposed framework effectively addresses the challenge of separating touching cell instances in dense microscopy scenes without requiring extensive prompting, offering improved boundary accuracy and instance separation compared to existing methods.
Abstract: Accurate delineation of individual cells in microscopy videos is essential for studying cellular dynamics, yet separating touching or overlapping instances remains a persistent challenge. Although foundation-model for segmentation such as SAM have broadened the accessibility of image segmentation, they still struggle to separate nearby cell instances in dense microscopy scenes without extensive prompting. We propose a prompt-free, boundary-aware instance segmentation framework that predicts signed distance functions (SDFs) instead of binary masks, enabling smooth and geometry-consistent modeling of cell contours. A learned sigmoid mapping converts SDFs into probability maps, yielding sharp boundary localization and robust separation of adjacent instances. Training is guided by a unified Modified Hausdorff Distance (MHD) loss that integrates region- and boundary-based terms. Evaluations on both public and private high-throughput microscopy datasets demonstrate improved boundary accuracy and instance-level performance compared to recent SAM-based and foundation-model approaches. Source code is available at: https://github.com/ThomasMendelson/BAISeg.git
[288] JANUS: A Lightweight Framework for Jailbreaking Text-to-Image Models via Distribution Optimization
Haolun Zheng, Yu He, Tailun Chen, Shuo Shao, Zhixuan Chu, Hongbin Zhou, Lan Tao, Zhan Qin, Kui Ren
Main category: cs.CV
TL;DR: JANUS is a lightweight jailbreak framework for text-to-image models that optimizes structured prompt distributions using black-box rewards from T2I systems and safety filters, achieving higher attack success rates than existing methods.
Details
Motivation: Current T2I models remain vulnerable to jailbreak attacks despite safety filters. Existing attacks either use proxy-loss optimization instead of true end-to-end objectives, or rely on large-scale, costly RL-trained generators. There's a need for more efficient and effective jailbreak methods.Method: JANUS formulates jailbreak as optimizing a structured prompt distribution under black-box, end-to-end rewards from T2I systems and safety filters. It replaces high-capacity generators with a low-dimensional mixing policy over two semantically anchored prompt distributions, enabling efficient exploration while preserving target semantics.
Result: JANUS outperforms state-of-the-art jailbreak methods on modern T2I models, improving ASR-8 from 25.30% to 43.15% on Stable Diffusion 3.5 Large Turbo with consistently higher CLIP and NSFW scores. It succeeds across both open-source and commercial models.
Conclusion: The findings expose structural weaknesses in current T2I safety pipelines and motivate stronger, distribution-aware defenses. The paper warns that it contains potentially offensive model outputs.
Abstract: Text-to-image (T2I) models such as Stable Diffusion and DALLE remain susceptible to generating harmful or Not-Safe-For-Work (NSFW) content under jailbreak attacks despite deployed safety filters. Existing jailbreak attacks either rely on proxy-loss optimization instead of the true end-to-end objective, or depend on large-scale and costly RL-trained generators. Motivated by these limitations, we propose JANUS , a lightweight framework that formulates jailbreak as optimizing a structured prompt distribution under a black-box, end-to-end reward from the T2I system and its safety filters. JANUS replaces a high-capacity generator with a low-dimensional mixing policy over two semantically anchored prompt distributions, enabling efficient exploration while preserving the target semantics. On modern T2I models, we outperform state-of-the-art jailbreak methods, improving ASR-8 from 25.30% to 43.15% on Stable Diffusion 3.5 Large Turbo with consistently higher CLIP and NSFW scores. JANUS succeeds across both open-source and commercial models. These findings expose structural weaknesses in current T2I safety pipelines and motivate stronger, distribution-aware defenses. Warning: This paper contains model outputs that may be offensive.
[289] Positional Segmentor-Guided Counterfactual Fine-Tuning for Spatially Localized Image Synthesis
Tian Xia, Matthew Sinclair, Andreas Schuh, Fabio De Sousa Ribeiro, Raghav Mehta, Rajat Rasal, Esther Puyol-Antón, Samuel Gerber, Kersten Petersen, Michiel Schaap, Ben Glocker
Main category: cs.CV
TL;DR: Positional Seg-CFT enables localized counterfactual image generation by subdividing anatomical structures into regional segments for independent measurement and intervention, allowing spatially precise modifications.
Details
Motivation: Existing counterfactual image generation methods are limited to global interventions or require tedious manual mask creation, failing to produce localized structural changes without artifacts.Method: Extends Seg-CFT by subdividing each anatomical structure into regional segments and deriving independent measurements per region, enabling spatially localized and anatomically coherent counterfactuals.
Result: Experiments on coronary CT angiography show realistic, region-specific modifications with finer spatial control for modeling disease progression.
Conclusion: Positional Seg-CFT provides a practical approach for generating localized counterfactuals with anatomical coherence, advancing capabilities for disease modeling and data augmentation.
Abstract: Counterfactual image generation enables controlled data augmentation, bias mitigation, and disease modeling. However, existing methods guided by external classifiers or regressors are limited to subject-level factors (e.g., age) and fail to produce localized structural changes, often resulting in global artifacts. Pixel-level guidance using segmentation masks has been explored, but requires user-defined counterfactual masks, which are tedious and impractical. Segmentor-guided Counterfactual Fine-Tuning (Seg-CFT) addressed this by using segmentation-derived measurements to supervise structure-specific variables, yet it remains restricted to global interventions. We propose Positional Seg-CFT, which subdivides each structure into regional segments and derives independent measurements per region, enabling spatially localized and anatomically coherent counterfactuals. Experiments on coronary CT angiography show that Pos-Seg-CFT generates realistic, region-specific modifications, providing finer spatial control for modeling disease progression.
[290] Reframing Long-Tailed Learning via Loss Landscape Geometry
Shenghan Chen, Yiming Liu, Yanzhen Wang, Yujia Wang, Xiankai Lu
Main category: cs.CV
TL;DR: A framework addressing long-tail classification by preventing tail performance degradation through grouped knowledge preservation and sharpness-aware optimization to find flatter minima.
Details
Motivation: The paper addresses the challenge of balancing performance trade-offs on long-tail data distributions, where models tend to overfit head classes while forgetting tail classes (tail performance degradation). The authors observe that different classes converge to divergent points in the loss landscape, especially when models settle into sharp, non-robust minima rather than shared, flat solutions beneficial for all classes.Method: Proposes a continual learning inspired framework with two key modules: 1) Grouped Knowledge Preservation module to memorize group-specific convergence parameters and promote convergence toward shared solutions without inefficient per-class parameter preservation, and 2) Grouped Sharpness Aware module to explicitly address loss landscape geometry and seek flatter minima. The framework requires no external training samples or pre-trained models.
Result: Extensive experiments on four benchmarks demonstrate significant performance gains over state-of-the-art methods.
Conclusion: The proposed framework effectively addresses tail performance degradation in long-tail classification by promoting convergence to shared, flat minima through grouped knowledge preservation and sharpness-aware optimization, achieving superior performance without requiring additional data or pre-trained models.
Abstract: Balancing performance trade-off on long-tail (LT) data distributions remains a long-standing challenge. In this paper, we posit that this dilemma stems from a phenomenon called “tail performance degradation” (the model tends to severely overfit on head classes while quickly forgetting tail classes) and pose a solution from a loss landscape perspective. We observe that different classes possess divergent convergence points in the loss landscape. Besides, this divergence is aggravated when the model settles into sharp and non-robust minima, rather than a shared and flat solution that is beneficial for all classes. In light of this, we propose a continual learning inspired framework to prevent “tail performance degradation”. To avoid inefficient per-class parameter preservation, a Grouped Knowledge Preservation module is proposed to memorize group-specific convergence parameters, promoting convergence towards a shared solution. Concurrently, our framework integrates a Grouped Sharpness Aware module to seek flatter minima by explicitly addressing the geometry of the loss landscape. Notably, our framework requires neither external training samples nor pre-trained models, facilitating the broad applicability. Extensive experiments on four benchmarks demonstrate significant performance gains over state-of-the-art methods. The code is available at:https://gkp-gsa.github.io/.
[291] A Large-Scale Remote Sensing Dataset and VLM-based Algorithm for Fine-Grained Road Hierarchy Classification
Ting Han, Xiangyi Xie, Yiping Chen, Yumeng Du, Jin Ma, Aiguang Li, Jiaan Liu, Yin Gao
Main category: cs.CV
TL;DR: SYSU-HiRoads dataset with hierarchical road annotations and RoadReasoner framework for multi-grade road mapping from remote sensing imagery using vision-language-geometry approach
Details
Motivation: Need for automated road mapping with hierarchy classification from remote sensing imagery to support transport infrastructure management and inventory updatingMethod: Vision-language-geometry framework combining frequency-sensitive cues, multi-scale context, skeleton-segment hierarchy inference with geometric descriptors and geometry-aware textual prompts queried by vision-language models
Result: RoadReasoner surpasses state-of-the-art baselines with 72.6% OA, 64.2% F1 score, and 60.6% SegAcc on SYSU-HiRoads and CHN6-CUG datasets
Conclusion: Framework enables accurate, semantically consistent road hierarchy mapping with public dataset release to support infrastructure applications
Abstract: In this work, we present SYSU-HiRoads, a large-scale hierarchical road dataset, and RoadReasoner, a vision-language-geometry framework for automatic multi-grade road mapping from remote sensing imagery. SYSU-HiRoads is built from GF-2 imagery covering 3631 km2 in Henan Province, China, and contains 1079 image tiles at 0.8 m spatial resolution. Each tile is annotated with dense road masks, vectorized centerlines, and three-level hierarchy labels, enabling the joint training and evaluation of segmentation, topology reconstruction, and hierarchy classification. Building on this dataset, RoadReasoner is designed to generate robust road surface masks, topology-preserving road networks, and semantically coherent hierarchy assignments. We strengthen road feature representation and network connectivity by explicitly enhancing frequency-sensitive cues and multi-scale context. Moreover, we perform hierarchy inference at the skeleton-segment level with geometric descriptors and geometry-aware textual prompts, queried by vision-language models to obtain linguistically interpretable grade decisions. Experiments on SYSU-HiRoads and the CHN6-CUG dataset show that RoadReasoner surpasses state-of-the-art road extraction baselines and produces accurate and semantically consistent road hierarchy maps with 72.6% OA, 64.2% F1 score, and 60.6% SegAcc. The dataset and code will be publicly released to support automated transport infrastructure mapping, road inventory updating, and broader infrastructure management applications.
[292] Plant Taxonomy Meets Plant Counting: A Fine-Grained, Taxonomic Dataset for Counting Hundreds of Plant Species
Jinyu Xu, Tianqi Hu, Xiaonan Hu, Letian Zhou, Songliang Cao, Meng Zhang, Hao Lu
Main category: cs.CV
TL;DR: TPC-268 is the first plant counting benchmark with taxonomy-aware annotations for fine-grained class-agnostic counting across multiple scales from remote sensing to microscopy.
Details
Motivation: Existing counting benchmarks focus on rigid objects like crowds and traffic, but fine-grained plant counting with taxonomy awareness remains underexplored despite plants having nonrigid morphologies and appearance variations across growth stages and environments.Method: Created TPC-268 dataset with 10,000 images and 678,050 point annotations covering 268 countable plant categories across 242 species, with Linnaean taxonomy labels (kingdom to species) and organ categories. Provides taxonomy-consistent, scale-aware data splits and benchmarks state-of-the-art regression- and detection-based class-agnostic counting approaches.
Result: Dataset includes diverse plant and fungi categories across multiple observation scales (canopy-level remote sensing to tissue-level microscopy). Provides biologically grounded testbed for evaluating fine-grained counting methods with hierarchical reasoning capabilities.
Conclusion: TPC-268 fills the gap in taxonomy-aware plant counting benchmarks, enabling hierarchical reasoning and species-aware evaluation to advance fine-grained class-agnostic counting in computer vision.
Abstract: Visually cataloging and quantifying the natural world requires pushing the boundaries of both detailed visual classification and counting at scale. Despite significant progress, particularly in crowd and traffic analysis, the fine-grained, taxonomy-aware plant counting remains underexplored in vision. In contrast to crowds, plants exhibit nonrigid morphologies and physical appearance variations across growth stages and environments. To fill this gap, we present TPC-268, the first plant counting benchmark incorporating plant taxonomy. Our dataset couples instance-level point annotations with Linnaean labels (kingdom -> species) and organ categories, enabling hierarchical reasoning and species-aware evaluation. The dataset features 10,000 images with 678,050 point annotations, includes 268 countable plant categories over 242 plant species in Plantae and Fungi, and spans observation scales from canopy-level remote sensing imagery to tissue-level microscopy. We follow the problem setting of class-agnostic counting (CAC), provide taxonomy-consistent, scale-aware data splits, and benchmark state-of-the-art regression- and detection-based CAC approaches. By capturing the biodiversity, hierarchical structure, and multi-scale nature of botanical and mycological taxa, TPC-268 provides a biologically grounded testbed to advance fine-grained class-agnostic counting. Dataset and code are available at https://github.com/tiny-smart/TPC-268.
[293] QMoP: Query Guided Mixture-of-Projector for Efficient Visual Token Compression
Zhongyang Li, Yaqian Li, Faming Fang, Rinyoichi Takezoe, Zi-Hao Bo, Cheng Qian, Mo Guang, Guixu Zhang, Kaiwen Long
Main category: cs.CV
TL;DR: QMoP is a flexible framework that adaptively compresses visual tokens in multimodal LLMs using three collaborative branches coordinated by a query-guided router, with a new benchmark VTCBench for evaluation.
Details
Motivation: Multimodal LLMs face computational and memory bottlenecks due to excessive visual tokens. Existing methods use fixed heuristics for compression, limiting adaptability across diverse scenarios.Method: Proposes Query Guided Mixture-of-Projector (QMoP) with three branches: pooling-based for global semantics, resampler for high-level semantics, and pruning-based for fine-grained detail. Uses Query Guided Router to dynamically select/weight outputs based on visual input and textual queries, with MoE-style fusion.
Result: QMoP outperforms strong baselines, delivers significant savings in memory, computation, and inference time. Also introduces VTCBench benchmark for evaluating visual token compression.
Conclusion: QMoP provides an effective adaptive framework for visual token compression in multimodal LLMs, addressing computational bottlenecks while maintaining performance.
Abstract: Multimodal large language models suffer from severe computational and memory bottlenecks, as the number of visual tokens far exceeds that of textual tokens. While recent methods employ projector modules to align and compress visual tokens into text-aligned features, they typically depend on fixed heuristics that limit adaptability across diverse scenarios. In this paper, we first propose Query Guided Mixture-of-Projector (QMoP), a novel and flexible framework that adaptively compresses visual tokens via three collaborative branches: (1) a pooling-based branch for coarse-grained global semantics, (2) a resampler branch for extracting high-level semantic representations, and (3) a pruning-based branch for fine-grained token selection to preserve critical visual detail. To adaptively coordinate these branches, we introduce the Query Guided Router (QGR), which dynamically selects and weights the outputs from different branches based on both visual input and textual queries. A Mixture-of-Experts-style fusion mechanism is designed to aggregate the outputs, harnessing the strengths of each strategy while suppressing noise. To systematically evaluate the effects of Visual Token Compression, we also develop VTCBench, a dedicated benchmark for evaluating the information loss induced by visual token compression. Extensive experiments demonstrate that despite relying on fundamental compression modules, QMoP outperforms strong baselines and delivers significant savings in memory, computation, and inference time.
[294] DepthTCM: High Efficient Depth Compression via Physics-aware Transformer-CNN Mixed Architecture
Young-Seo Chang, Yatong An, Jae-Sang Hyun
Main category: cs.CV
TL;DR: DepthTCM: A physics-aware end-to-end framework for depth map compression using multiwavelength depth encoding and Transformer-CNN mixed neural networks.
Details
Motivation: To develop an efficient depth map compression method that preserves high accuracy while reducing bitrate, inspired by physical sinusoidal fringe pattern profilometry systems.Method: Converts high-bit depth maps to 3-channel images using multiwavelength depth encoding, quantizes to 4 bits per channel, then compresses with a learned codec combining convolutional and Transformer layers.
Result: Achieves 0.307 bpp while preserving 99.38% accuracy on Middlebury 2014, with 41.48 ms encoder and 47.45 ms decoder inference times on ScanNet++ iPhone RGB-D subset.
Conclusion: DepthTCM provides efficient depth compression with high fidelity, demonstrating the effectiveness of physics-aware encoding and Transformer-CNN hybrid architectures.
Abstract: We propose DepthTCM, a physics-aware end-to-end framework for depth map compression. In our framework of DepthTCM, the high-bit depth map is first converted to a conventional 3-channel image representation losslessly using a method inspired by a physical sinusoidal fringe pattern based profiliometry system, then the 3-channel color image is encoded and decoded by a recently developed Transformer-CNN mixed neural network architecture. Specifically, DepthTCM maps depth to a smooth 3-channel using multiwavelength depth (MWD) encoding, then globally quantized the MWD encoded representation to 4 bits per channel to reduce entropy, and finally is compressed using a learned codec that combines convolutional and Transformer layers. Experiment results demonstrate the advantage of our proposed method. On Middlebury 2014, DepthTCM reaches 0.307 bpp while preserving 99.38% accuracy, a level of fidelity commensurate with lossless PNG. We additionally demonstrate practical efficiency and scalability, reporting average end-to-end inference times of 41.48 ms (encoder) and 47.45 ms (decoder) on the ScanNet++ iPhone RGB-D subset. Ablations validate our design choices: relative to 8-bit quantization, 4-bit quantization reduces bitrate by 66% while maintaining comparable reconstruction quality, with only a marginal 0.68 dB PSNR change and a 0.04% accuracy difference. In addition, Transformer–CNN blocks further improve PSNR by up to 0.75 dB over CNN-only architectures.
[295] Enhancing Brain Tumor Classification Using Vision Transformers with Colormap-Based Feature Representation on BRISC2025 Dataset
Faisal Ahmed
Main category: cs.CV
TL;DR: Vision Transformers enhanced with colormap-based feature representation achieve state-of-the-art performance (98.90% accuracy) for multi-class brain tumor classification from MRI scans.
Details
Motivation: Accurate brain tumor classification from MRI is critical for early diagnosis and treatment planning, but existing methods may not fully capture important structural and intensity variations in medical images.Method: Proposes a deep learning framework combining Vision Transformers (ViT) with colormap-based feature representation to capture long-range dependencies while emphasizing important structural and intensity variations in MRI scans.
Result: Achieves 98.90% classification accuracy and 99.97% AUC on the BRISC2025 dataset (glioma, meningioma, pituitary tumor, non-tumor), outperforming ResNet50, ResNet101, and EfficientNetB2 baselines.
Conclusion: The combination of Vision Transformers with colormap-based feature enhancement provides accurate and robust brain tumor classification with strong potential for clinical decision support applications.
Abstract: Accurate classification of brain tumors from magnetic resonance imaging (MRI) plays a critical role in early diagnosis and effective treatment planning. In this study, we propose a deep learning framework based on Vision Transformers (ViT) enhanced with colormap-based feature representation to improve multi-class brain tumor classification performance. The proposed approach leverages the ability of transformer architectures to capture long-range dependencies while incorporating color mapping techniques to emphasize important structural and intensity variations within MRI scans. Experiments are conducted on the BRISC2025 dataset, which includes four classes: glioma, meningioma, pituitary tumor, and non-tumor cases. The model is trained and evaluated using standard performance metrics such as accuracy, precision, recall, F1-score, and area under the receiver operating characteristic curve (AUC). The proposed method achieves a classification accuracy of 98.90%, outperforming baseline convolutional neural network models including ResNet50, ResNet101, and EfficientNetB2. In addition, the model demonstrates strong generalization capability with an AUC of 99.97%, indicating high discriminative performance across all classes. These results highlight the effectiveness of combining Vision Transformers with colormap-based feature enhancement for accurate and robust brain tumor classification and suggest strong potential for clinical decision support applications.
[296] CornOrb: A Multimodal Dataset of Orbscan Corneal Topography and Clinical Annotations for Keratoconus Detection
Mohammed El Amine Lazouni, Leila Ryma Lazouni, Zineb Aziza Elaouaber, Mohammed Ammar, Sofiane Zehar, Mohammed Youcef Bouayad Agha, Ahmed Lazouni, Amel Feroui, Ali H. Al-Timemy, Siamak Yousefi, Mostafa El Habib Daho
Main category: cs.CV
TL;DR: CornOrb: A multimodal dataset of Orbscan corneal topography images with clinical annotations from Algerian patients for AI-driven keratoconus detection
Details
Motivation: To create a publicly accessible multimodal dataset for AI research on keratoconus detection, addressing the lack of large-scale Orbscan-based resources from Africa and enabling robust AI-driven analysis using multimodal data.Method: Retrospective collection of 1,454 eyes from 744 patients (889 normal eyes, 565 keratoconus cases) with four corneal maps per eye (axial curvature, anterior elevation, posterior elevation, pachymetry) and structured tabular clinical data. Data were anonymized and pre-processed into standardized PNG and CSV formats.
Result: Created a comprehensive multimodal dataset with 1,454 eyes, including four types of corneal topography images and clinical parameters, openly available at Zenodo for AI research purposes.
Conclusion: CornOrb represents one of the first large-scale Orbscan-based multimodal datasets from Africa, specifically designed to enable AI-driven detection and analysis of keratoconus using both image and tabular data.
Abstract: In this paper, we present CornOrb, a publicly accessible multimodal dataset of Orbscan corneal topography images and clinical annotations collected from patients in Algeria. The dataset comprises 1,454 eyes from 744 patients, including 889 normal eyes and 565 keratoconus cases. For each eye, four corneal maps are provided (axial curvature, anterior elevation, posterior elevation, and pachymetry), together with structured tabular data including demographic information and key clinical parameters such as astigmatism, maximum keratometry (Kmax), central and thinnest pachymetry, and anterior/posterior asphericity. All data were retrospectively acquired, fully anonymized, and pre-processed into standardized PNG and CSV formats to ensure direct usability for artificial intelligence research. This dataset represents one of the first large-scale Orbscan-based resources from Africa, specifically built to enable robust AI-driven detection and analysis of keratoconus using multimodal data. The data are openly available at Zenodo.
[297] Focus on Background: Exploring SAM’s Potential in Few-shot Medical Image Segmentation with Background-centric Prompting
Yuntian Bo, Yazhou Zhu, Piotr Koniusz, Haofeng Zhang
Main category: cs.CV
TL;DR: FoB reformulates SAM-based few-shot medical image segmentation as a prompt localization task, using background-centric prompts to constrain SAM’s over-segmentation in medical images.
Details
Motivation: Direct application of Segment Anything Model (SAM) to medical images leads to over-segmentation due to ambiguous anatomical boundaries, and conventional few-shot medical image segmentation approaches face performance bottlenecks limiting clinical applicability.Method: Proposes FoB (Focus on Background), a background-centric prompt generator that treats segmentation as prompt localization. It generates category-agnostic background prompts from support images and localizes them in query images, modeling contextual information for spatial dependencies and using structural patterns of background prompts as constraints for progressive refinement.
Result: FoB outperforms other baselines by large margins on three diverse medical image datasets, achieving state-of-the-art performance on few-shot medical image segmentation and exhibiting strong cross-domain generalization.
Conclusion: Reformulating SAM-based few-shot segmentation as prompt localization with background-centric constraints effectively addresses over-segmentation in medical images, enabling better clinical applicability through improved performance and generalization.
Abstract: Conventional few-shot medical image segmentation (FSMIS) approaches face performance bottlenecks that hinder broader clinical applicability. Although the Segment Anything Model (SAM) exhibits strong category-agnostic segmentation capabilities, its direct application to medical images often leads to over-segmentation due to ambiguous anatomical boundaries. In this paper, we reformulate SAM-based FSMIS as a prompt localization task and propose FoB (Focus on Background), a background-centric prompt generator that provides accurate background prompts to constrain SAM’s over-segmentation. Specifically, FoB bridges the gap between segmentation and prompt localization by category-agnostic generation of support background prompts and localizing them directly in the query image. To address the challenge of prompt localization for novel categories, FoB models rich contextual information to capture foreground-background spatial dependencies. Moreover, inspired by the inherent structural patterns of background prompts in medical images, FoB models this structure as a constraint to progressively refine background prompt predictions. Experiments on three diverse medical image datasets demonstrate that FoB outperforms other baselines by large margins, achieving state-of-the-art performance on FSMIS, and exhibiting strong cross-domain generalization. Our code is available at https://github.com/primebo1/FoB_SAM.
[298] When Models Judge Themselves: Unsupervised Self-Evolution for Multimodal Reasoning
Zhengxian Wu, Kai Shi, Chuanrui Zhang, Zirui Liao, Jun Yang, Ni Yang, Qiuying Peng, Luyuan Zhang, Hangrui Xu, Tianhuang Su, Zhenyu Yang, Haonan Lu, Haoqian Wang
Main category: cs.CV
TL;DR: Unsupervised self-evolution framework for multimodal reasoning using group structure modeling and self-consistency signals without human annotations or external rewards.
Details
Motivation: Current multimodal LLM improvements rely on costly high-quality annotated data or teacher-model distillation, which are difficult to scale. Need for scalable unsupervised training methods.Method: Proposes Group Relative Policy Optimization (GRPO): samples multiple reasoning trajectories, models within-group structure, uses Actor’s self-consistency as training prior, introduces bounded Judge-based modulation to reweight trajectories, converts absolute scores to relative advantages within groups.
Result: Consistently improves reasoning performance and generalization on five mathematical reasoning benchmarks using only unlabeled data.
Conclusion: Offers scalable path toward self-evolving multimodal models through unsupervised self-evolution training framework without human annotations or external reward models.
Abstract: Recent progress in multimodal large language models has led to strong performance on reasoning tasks, but these improvements largely rely on high-quality annotated data or teacher-model distillation, both of which are costly and difficult to scale.To address this, we propose an unsupervised self-evolution training framework for multimodal reasoning that achieves stable performance improvements without using human-annotated answers or external reward models. For each input, we sample multiple reasoning trajectories and jointly model their within group structure.We use the Actor’s self-consistency signal as a training prior, and introduce a bounded Judge based modulation to continuously reweight trajectories of different quality.We further model the modulated scores as a group level distribution and convert absolute scores into relative advantages within each group, enabling more robust policy updates. Trained with Group Relative Policy Optimization (GRPO) on unlabeled data, our method consistently improves reasoning performance and generalization on five mathematical reasoning benchmarks, offering a scalable path toward self-evolving multimodal models.The code are available at https://dingwu1021.github.io/SelfJudge/.
[299] BHDD: A Burmese Handwritten Digit Dataset
Swan Htet Aung, Hein Htet, Htoo Say Wah Khaing, Thuya Myo Nyunt
Main category: cs.CV
TL;DR: A new Burmese handwritten digit dataset (BHDD) with 87,561 MNIST-format grayscale images collected from 150+ contributors, with baseline models achieving up to 99.83% accuracy.
Details
Motivation: To create a publicly available dataset for Burmese handwritten digit recognition, addressing the lack of such resources for the Myanmar script which has unique challenges due to its round shapes and potential confusion between certain digit pairs.Method: Collected 87,561 handwritten Burmese digit samples from over 150 contributors of different ages and backgrounds, organized into MNIST-compatible 28x28 grayscale images with 60,000 training and 27,561 test samples. Analyzed class distribution, pixel statistics, and morphological variation. Evaluated with simple baselines: MLP, two-layer CNN, and improved CNN with batch normalization and data augmentation.
Result: Baseline models achieved 99.40% (MLP), 99.75% (two-layer CNN), and 99.83% (improved CNN with batch normalization and augmentation) test accuracy. Identified digit pairs that are easily confused due to the round shapes of Myanmar script.
Conclusion: BHDD provides a valuable resource for Burmese handwritten digit recognition research, with baseline results demonstrating the dataset’s quality and the challenges posed by Myanmar script’s characteristics.
Abstract: We introduce the Burmese Handwritten Digit Dataset (BHDD), a collection of 87,561 grayscale images of handwritten Burmese digits in ten classes. Each image is 28x28 pixels, following the MNIST format. The training set has 60,000 samples split evenly across classes; the test set has 27,561 samples with class frequencies as they arose during collection. Over 150 people of different ages and backgrounds contributed samples. We analyze the dataset’s class distribution, pixel statistics, and morphological variation, and identify digit pairs that are easily confused due to the round shapes of the Myanmar script. Simple baselines (an MLP, a two-layer CNN, and an improved CNN with batch normalization and augmentation) reach 99.40%, 99.75%, and 99.83% test accuracy respectively. BHDD is available under CC BY-SA 4.0 at https://github.com/baseresearch/BHDD
[300] Text-Image Conditioned 3D Generation
Jiazhong Cen, Jiemin Fang, Sikuang Li, Guanjun Wu, Chen Yang, Taoran Yi, Zanwei Zhou, Zhikuan Bao, Lingxi Xie, Wei Shen, Qi Tian
Main category: cs.CV
TL;DR: TIGON introduces a dual-branch 3D generation model that combines text and image conditioning to overcome limitations of single-modality approaches, achieving better results through cross-modal fusion.
Details
Motivation: Existing 3D generators use either image conditioning (high visual fidelity but viewpoint bias) or text conditioning (broad semantic guidance but lacks visual detail). The paper explores combining both modalities for more flexible and faithful 3D generation.Method: TIGON uses a minimalist dual-branch architecture with separate image- and text-conditioned backbones, plus lightweight cross-modal fusion for joint reasoning over visual exemplars and textual specifications.
Result: Text-image conditioning consistently outperforms single-modality methods, demonstrating strong cross-modal complementarity and improved 3D generation quality.
Conclusion: Combining vision and language guidance is a promising direction for future 3D generation research, addressing limitations of current single-modality approaches.
Abstract: High-quality 3D assets are essential for VR/AR, industrial design, and entertainment, motivating growing interest in generative models that create 3D content from user prompts. Most existing 3D generators, however, rely on a single conditioning modality: image-conditioned models achieve high visual fidelity by exploiting pixel-aligned cues but suffer from viewpoint bias when the input view is limited or ambiguous, while text-conditioned models provide broad semantic guidance yet lack low-level visual detail. This limits how users can express intent and raises a natural question: can these two modalities be combined for more flexible and faithful 3D generation? Our diagnostic study shows that even simple late fusion of text- and image-conditioned predictions outperforms single-modality models, revealing strong cross-modal complementarity. We therefore formalize Text-Image Conditioned 3D Generation, which requires joint reasoning over a visual exemplar and a textual specification. To address this task, we introduce TIGON, a minimalist dual-branch baseline with separate image- and text-conditioned backbones and lightweight cross-modal fusion. Extensive experiments show that text-image conditioning consistently improves over single-modality methods, highlighting complementary vision-language guidance as a promising direction for future 3D generation research. Project page: https://jumpat.github.io/tigon-page
[301] Identity-Consistent Video Generation under Large Facial-Angle Variations
Bin Hu, Zipeng Qi, Guoxi Huang, Zunnan Xu, Ruicheng Zhang, Chongjie Ye, Jun Zhou, Xiu Li, Jingdong Wang
Main category: cs.CV
TL;DR: Mv²ID: A multi-view conditioned framework for reference-to-video generation that improves identity consistency under large facial-angle variations while maintaining motion naturalness through region-masking training and reference decoupled-RoPE.
Details
Motivation: Single-view reference-to-video methods struggle with identity consistency under large facial-angle variations. Multi-view references help but exacerbate the "view-dependent copy-paste" artifact that reduces facial motion naturalness. Cross-paired data can help but is costly to collect.Method: Proposes Mv²ID framework with: 1) Region-masking training strategy to prevent shortcut learning and extract essential identity features by aggregating complementary identity cues across views; 2) Reference decoupled-RoPE mechanism assigning distinct positional encoding to video and conditioning tokens for better modeling of heterogeneous properties; 3) Construction of large-scale dataset with diverse facial-angle variations.
Result: Extensive experiments demonstrate significant improvement in identity consistency while maintaining motion naturalness, outperforming existing approaches trained with cross-paired data. Dedicated evaluation metrics for identity consistency and motion naturalness were also proposed.
Conclusion: Mv²ID effectively balances identity consistency and motion naturalness in multi-view reference-to-video generation, addressing the view-dependent copy-paste problem without requiring costly cross-paired data.
Abstract: Single-view reference-to-video methods often struggle to preserve identity consistency under large facial-angle variations. This limitation naturally motivates the incorporation of multi-view facial references. However, simply introducing additional reference images exacerbates the \textit{copy-paste} problem, particularly the \textbf{\textit{view-dependent copy-paste}} artifact, which reduces facial motion naturalness. Although cross-paired data can alleviate this issue, collecting such data is costly. To balance the consistency and naturalness, we propose $\mathrm{Mv}^2\mathrm{ID}$, a multi-view conditioned framework under in-paired supervision. We introduce a region-masking training strategy to prevent shortcut learning and extract essential identity features by encouraging the model to aggregate complementary identity cues across views. In addition, we design a reference decoupled-RoPE mechanism that assigns distinct positional encoding to video and conditioning tokens for better modeling of their heterogeneous properties. Furthermore, we construct a large-scale dataset with diverse facial-angle variations and propose dedicated evaluation metrics for identity consistency and motion naturalness. Extensive experiments demonstrate that our method significantly improves identity consistency while maintaining motion naturalness, outperforming existing approaches trained with cross-paired data.
[302] F4Splat: Feed-Forward Predictive Densification for Feed-Forward 3D Gaussian Splatting
Injae Kim, Chaehyeon Kim, Minseong Bae, Minseok Joo, Hyunwoo J. Kim
Main category: cs.CV
TL;DR: F4Splat introduces a feed-forward predictive densification method for 3D Gaussian Splatting that adaptively allocates Gaussians based on spatial complexity and multi-view overlap, enabling explicit control over Gaussian budget while maintaining reconstruction quality.
Details
Motivation: Current feed-forward 3D Gaussian Splatting methods use rigid allocation pipelines that uniformly distribute Gaussians, leading to redundancy across views and lacking mechanisms to control the total number of Gaussians while preserving reconstruction fidelity.Method: F4Splat uses a densification-score-guided allocation strategy that predicts per-region densification scores to estimate required Gaussian density based on spatial complexity and multi-view overlap, allowing explicit control over final Gaussian budget without retraining.
Result: The model achieves superior novel-view synthesis performance compared to prior uncalibrated feed-forward methods while using significantly fewer Gaussians, reducing redundancy in simple regions and minimizing duplicate Gaussians across overlapping views.
Conclusion: F4Splat provides an effective solution for compact yet high-quality 3D representations through spatially adaptive Gaussian allocation, addressing limitations of rigid allocation pipelines in feed-forward 3D Gaussian Splatting.
Abstract: Feed-forward 3D Gaussian Splatting methods enable single-pass reconstruction and real-time rendering. However, they typically adopt rigid pixel-to-Gaussian or voxel-to-Gaussian pipelines that uniformly allocate Gaussians, leading to redundant Gaussians across views. Moreover, they lack an effective mechanism to control the total number of Gaussians while maintaining reconstruction fidelity. To address these limitations, we present F4Splat, which performs Feed-Forward predictive densification for Feed-Forward 3D Gaussian Splatting, introducing a densification-score-guided allocation strategy that adaptively distributes Gaussians according to spatial complexity and multi-view overlap. Our model predicts per-region densification scores to estimate the required Gaussian density and allows explicit control over the final Gaussian budget without retraining. This spatially adaptive allocation reduces redundancy in simple regions and minimizes duplicate Gaussians across overlapping views, producing compact yet high-quality 3D representations. Extensive experiments demonstrate that our model achieves superior novel-view synthesis performance compared to prior uncalibrated feed-forward methods, while using significantly fewer Gaussians.
[303] Privacy-Preserving Federated Action Recognition via Differentially Private Selective Tuning and Efficient Communication
Idris Zakariyya, Pai Chet Ng, Kaushik Bhargav Sivangi, S. Mohammad Sheikholeslami, Konstantinos N. Plataniotis, Fani Deligianni
Main category: cs.CV
TL;DR: FedDP-STECAR: A federated learning framework for video action recognition that uses selective layer fine-tuning with differential privacy to reduce privacy leakage and communication overhead by 99%.
Details
Motivation: Federated video action recognition faces two key challenges: model exposure (gradients can leak private motion patterns) and communication overhead (full-model synchronization of high-dimensional video networks causes significant bandwidth costs).Method: Proposes FedDP-STECAR framework that selectively fine-tunes and perturbs only a small subset of task-relevant layers under Differential Privacy (DP), reducing information leakage while preserving temporal coherence in video features. Only transmits tuned layers during aggregation.
Result: Achieves up to 70.2% higher accuracy under strict privacy (ε=0.65) in centralized settings and 48% faster training with 73.1% accuracy in federated setups. Communication traffic reduced by over 99% compared to full-model updates.
Conclusion: FedDP-STECAR enables scalable and privacy-preserving video action recognition by addressing both privacy leakage and communication efficiency challenges in federated learning for video understanding.
Abstract: Federated video action recognition enables collaborative model training without sharing raw video data, yet remains vulnerable to two key challenges: \textit{model exposure} and \textit{communication overhead}. Gradients exchanged between clients and the server can leak private motion patterns, while full-model synchronization of high-dimensional video networks causes significant bandwidth and communication costs. To address these issues, we propose \textit{Federated Differential Privacy with Selective Tuning and Efficient Communication for Action Recognition}, namely \textit{FedDP-STECAR}. Our \textit{FedDP-STECAR} framework selectively fine-tunes and perturbs only a small subset of task-relevant layers under Differential Privacy (DP), reducing the surface of information leakage while preserving temporal coherence in video features. By transmitting only the tuned layers during aggregation, communication traffic is reduced by over 99% compared to full-model updates. Experiments on the UCF-101 dataset using the MViT-B-16x4 transformer show that \textit{FedDP-STECAR} achieves up to \textbf{70.2% higher accuracy} under strict privacy ($ε=0.65$) in centralized settings and \textbf{48% faster training} with \textbf{73.1% accuracy} in federated setups, enabling scalable and privacy-preserving video action recognition. Code available at https://github.com/izakariyya/mvit-federated-videodp
[304] Test-Time Adaptation via Cache Personalization for Facial Expression Recognition in Videos
Masoumeh Sharafi, Muhammad Osama Zeeshan, Soufiane Belharbi, Alessandro Lameiras Koerich, Marco Pedersoli, Eric Granger
Main category: cs.CV
TL;DR: TTA-CaP: A cache-based test-time adaptation method for personalizing vision-language models in video facial expression recognition, using three coordinated caches and tri-gate mechanism for efficient gradient-free adaptation.
Details
Motivation: FER in videos suffers from inter-subject variations; while VLMs offer good transfer, they degrade under distribution shifts. Existing TTA methods require computationally expensive parameter optimization, making them impractical for real-world deployment.Method: Proposes TTA-CaP with three coordinated caches: personalized source cache (source prototypes), positive target cache (reliable subject samples), and negative target cache (low-confidence cases). Uses tri-gate mechanism (temporal stability, confidence, consistency) for cache updates and fusion of embeddings for refined predictions.
Result: Outperforms state-of-the-art TTA methods on three challenging video FER datasets (BioVid, StressID, BAH) under subject-specific and environmental shifts while maintaining low computational and memory overhead.
Conclusion: TTA-CaP enables cost-effective personalization of VLMs for video FER through cache-based adaptation, addressing computational limitations of traditional TTA methods while improving performance under distribution shifts.
Abstract: Facial expression recognition (FER) in videos requires model personalization to capture the considerable variations across subjects. Vision-language models (VLMs) offer strong transfer to downstream tasks through image-text alignment, but their performance can still degrade under inter-subject distribution shifts. Personalizing models using test-time adaptation (TTA) methods can mitigate this challenge. However, most state-of-the-art TTA methods rely on unsupervised parameter optimization, introducing computational overhead that is impractical in many real-world applications. This paper introduces TTA through Cache Personalization (TTA-CaP), a cache-based TTA method that enables cost-effective (gradient-free) personalization of VLMs for video FER. Prior cache-based TTA methods rely solely on dynamic memories that store test samples, which can accumulate errors and drift due to noisy pseudo-labels. TTA-CaP leverages three coordinated caches: a personalized source cache that stores source-domain prototypes, a positive target cache that accumulates reliable subject-specific samples, and a negative target cache that stores low-confidence cases as negative samples to reduce the impact of noisy pseudo-labels. Cache updates and replacement are controlled by a tri-gate mechanism based on temporal stability, confidence, and consistency with the personalized cache. Finally, TTA-CaP refines predictions through fusion of embeddings, yielding refined representations that support temporally stable video-level predictions. Our experiments on three challenging video FER datasets, BioVid, StressID, and BAH, indicate that TTA-CaP can outperform state-of-the-art TTA methods under subject-specific and environmental shifts, while maintaining low computational and memory overhead for real-world deployment.
[305] KHMP: Frequency-Domain Kalman Refinement for High-Fidelity Human Motion Prediction
Wenhan Wu, Zhishuai Guo, Chen Chen, Srijan Das, Hongfei Xue, Pu Wang, Aidong Lu
Main category: cs.CV
TL;DR: KHMP introduces an adaptive Kalman filter in DCT domain with SNR-based parameter adjustment and physics constraints to generate smooth, physically plausible human motion predictions while reducing jitter artifacts.
Details
Motivation: Existing stochastic human motion prediction methods often produce predictions corrupted by high-frequency jitter and temporal discontinuities, which degrade motion quality and physical plausibility.Method: KHMP applies an adaptive Kalman filter in the DCT domain, treating high-frequency DCT coefficients as noisy signals. The filter’s noise parameters are dynamically adjusted based on estimated SNR, enabling aggressive denoising for jittery predictions and conservative filtering for clean motions. This is complemented by training-time physical constraints (temporal smoothness and joint angle limits) that encode biomechanical principles.
Result: Experiments on Human3.6M and HumanEva-I datasets show KHMP achieves state-of-the-art accuracy and effectively mitigates jitter artifacts to produce smooth and physically plausible motions.
Conclusion: KHMP establishes a new paradigm integrating adaptive signal processing with physics-informed learning for high-fidelity human motion prediction, addressing key challenges of jitter and temporal discontinuities.
Abstract: Stochastic human motion prediction aims to generate diverse, plausible futures from observed sequences. Despite advances in generative modeling, existing methods often produce predictions corrupted by high-frequency jitter and temporal discontinuities. To address these challenges, we introduce KHMP, a novel framework featuring an adaptiveKalman filter applied in the DCT domain to generate high-fidelity human motion predictions. By treating high-frequency DCT coefficients as a frequency-indexed noisy signal, the Kalman filter recursively suppresses noise while preserving motion details. Notably, its noise parameters are dynamically adjusted based on estimated Signal-to-Noise Ratio (SNR), enabling aggressive denoising for jittery predictions and conservative filtering for clean motions. This refinement is complemented by training-time physical constraints (temporal smoothness and joint angle limits) that encode biomechanical principles into the generative model. Together, these innovations establish a new paradigm integrating adaptive signal processing with physics-informed learning. Experiments on the Human3.6M and HumanEva-I datasets demonstrate that KHMP achieves state-of-the-art accuracy, effectively mitigating jitter artifacts to produce smooth and physically plausible motions.
[306] EmoTaG: Emotion-Aware Talking Head Synthesis on Gaussian Splatting with Few-Shot Personalization
Haolan Xu, Keli Cheng, Lei Wang, Ning Bi, Xiaoming Liu
Main category: cs.CV
TL;DR: EmoTaG: A few-shot emotion-aware 3D talking head synthesis framework using structured FLAME parameter space and gated residual motion network for stable, expressive audio-driven facial animation.
Details
Motivation: Existing few-shot 3D talking head methods suffer from geometric instability and audio-emotion mismatch during expressive facial motion, highlighting the need for better emotion-aware motion modeling.Method: Uses Pretrain-and-Adapt paradigm with structured FLAME parameter space instead of direct 3D Gaussian deformation. Introduces Gated Residual Motion Network (GRMN) to capture emotional prosody from audio while supplementing head pose and upper-face cues.
Result: Achieves state-of-the-art performance in emotional expressiveness, lip synchronization, visual realism, and motion stability through extensive experiments.
Conclusion: EmoTaG provides an effective framework for few-shot emotion-aware 3D talking head synthesis with improved geometric stability and emotional coherence.
Abstract: Audio-driven 3D talking head synthesis has advanced rapidly with Neural Radiance Fields (NeRF) and 3D Gaussian Splatting (3DGS). By leveraging rich pre-trained priors, few-shot methods enable instant personalization from just a few seconds of video. However, under expressive facial motion, existing few-shot approaches often suffer from geometric instability and audio-emotion mismatch, highlighting the need for more effective emotion-aware motion modeling. In this work, we present EmoTaG, a few-shot emotion-aware 3D talking head synthesis framework built on the Pretrain-and-Adapt paradigm. Our key insight is to reformulate motion prediction in a structured FLAME parameter space rather than directly deforming 3D Gaussians, thereby introducing explicit geometric priors that improve motion stability. Building upon this, we propose a Gated Residual Motion Network (GRMN), which captures emotional prosody from audio while supplementing head pose and upper-face cues absent from audio, enabling expressive and coherent motion generation. Extensive experiments demonstrate that EmoTaG achieves state-of-the-art performance in emotional expressiveness, lip synchronization, visual realism, and motion stability.
[307] ThinkJEPA: Empowering Latent World Models with Large Vision-Language Reasoning Model
Haichao Zhang, Yijiang Li, Shwai He, Tushar Nagarajan, Mingfei Chen, Jianglin Lu, Ang Li, Yun Fu
Main category: cs.CV
TL;DR: A VLM-guided JEPA framework combines dense-frame dynamics modeling with long-horizon semantic guidance through dual-temporal pathways for improved video prediction.
Details
Motivation: Current latent world models (like V-JEPA2) focus on dense prediction from short observation windows, limiting temporal context and biasing toward local extrapolation, while VLMs provide semantic grounding but have sparse sampling and language-output bottlenecks. The goal is to combine both approaches for better long-horizon video prediction.Method: Proposes a dual-temporal pathway framework: 1) dense JEPA branch for fine-grained motion and interaction cues, and 2) uniformly sampled VLM “thinker” branch with larger temporal stride for knowledge-rich guidance. Uses hierarchical pyramid representation extraction to aggregate multi-layer VLM representations into guidance features compatible with latent prediction.
Result: Outperforms both VLM-only and JEPA-predictor baselines on hand-manipulation trajectory prediction, yielding more robust long-horizon rollout behavior.
Conclusion: The VLM-guided JEPA framework successfully combines the strengths of dense dynamics modeling and semantic guidance, demonstrating improved performance in video prediction tasks requiring both fine-grained motion understanding and long-horizon reasoning.
Abstract: Recent progress in latent world models (e.g., V-JEPA2) has shown promising capability in forecasting future world states from video observations. Nevertheless, dense prediction from a short observation window limits temporal context and can bias predictors toward local, low-level extrapolation, making it difficult to capture long-horizon semantics and reducing downstream utility. Vision–language models (VLMs), in contrast, provide strong semantic grounding and general knowledge by reasoning over uniformly sampled frames, but they are not ideal as standalone dense predictors due to compute-driven sparse sampling, a language-output bottleneck that compresses fine-grained interaction states into text-oriented representations, and a data-regime mismatch when adapting to small action-conditioned datasets. We propose a VLM-guided JEPA-style latent world modeling framework that combines dense-frame dynamics modeling with long-horizon semantic guidance via a dual-temporal pathway: a dense JEPA branch for fine-grained motion and interaction cues, and a uniformly sampled VLM \emph{thinker} branch with a larger temporal stride for knowledge-rich guidance. To transfer the VLM’s progressive reasoning signals effectively, we introduce a hierarchical pyramid representation extraction module that aggregates multi-layer VLM representations into guidance features compatible with latent prediction. Experiments on hand-manipulation trajectory prediction show that our method outperforms both a strong VLM-only baseline and a JEPA-predictor baseline, and yields more robust long-horizon rollout behavior.
[308] Efficient Coarse-to-Fine Diffusion Models with Time Step Sequence Redistribution
Yu-Shan Tai, An-Yeu, Wu
Main category: cs.CV
TL;DR: Proposes Coarse-to-Fine Diffusion Models with Time Step Sequence Redistribution to reduce computational overhead in diffusion models while maintaining near-lossless performance.
Details
Motivation: Diffusion models have high computational costs due to multi-step denoising, limiting deployment on resource-constrained edge devices. Existing methods overlook input redundancy and require lengthy search times.Method: 1) Coarse-to-Fine Denoising (C2F) reduces computation during early-stage coarse feature generation where images are indistinguishable. 2) Time Step Sequence Redistribution (TRD) efficiently adjusts sampling trajectories with less than 10 minutes search time.
Result: Achieves near-lossless performance with 80% to 90% reduction in computation on CIFAR10 and LSUN-Church datasets.
Conclusion: The proposed methods effectively reduce computational overhead in diffusion models while maintaining generation quality, making them suitable for edge device deployment.
Abstract: Recently, diffusion models (DMs) have made significant strides in high-quality image generation. However, the multi-step denoising process often results in considerable computational overhead, impeding deployment on resource-constrained edge devices. Existing methods mitigate this issue by compressing models and adjusting the time step sequence. However, they overlook input redundancy and require lengthy search times. In this paper, we propose Coarse-to-Fine Diffusion Models with Time Step Sequence Redistribution. Recognizing indistinguishable early-stage generated images, we introduce Coarse-to-Fine Denoising (C2F) to reduce computation during coarse feature generation. Furthermore, we design Time Step Sequence Redistribution (TRD) for efficient sampling trajectory adjustment, requiring less than 10 minutes for search. Experimental results demonstrate that the proposed methods achieve near-lossless performance with an 80% to 90% reduction in computation on CIFAR10 and LSUN-Church.
[309] WorldCache: Content-Aware Caching for Accelerated Video World Models
Umair Nawaz, Ahmed Heakl, Ufaq Khan, Abdelrahman Shaker, Salman Khan, Fahad Shahbaz Khan
Main category: cs.CV
TL;DR: WorldCache: A training-free feature caching framework for diffusion transformers that improves video generation efficiency by introducing motion-adaptive thresholds, saliency-weighted drift estimation, and optimal feature reuse strategies.
Details
Motivation: Diffusion Transformers (DiTs) for video generation are computationally expensive due to sequential denoising and spatio-temporal attention. Existing training-free feature caching methods rely on Zero-Order Hold assumptions that cause ghosting artifacts, blur, and motion inconsistencies in dynamic scenes.Method: WorldCache introduces a Perception-Constrained Dynamical Caching framework with: 1) motion-adaptive thresholds, 2) saliency-weighted drift estimation, 3) optimal approximation via blending and warping, and 4) phase-aware threshold scheduling across diffusion steps.
Result: On Cosmos-Predict2.5-2B evaluated on PAI-Bench, WorldCache achieves 2.3× inference speedup while preserving 99.4% of baseline quality, substantially outperforming prior training-free caching approaches.
Conclusion: WorldCache enables adaptive, motion-consistent feature reuse without retraining, significantly accelerating video generation inference while maintaining high quality.
Abstract: Diffusion Transformers (DiTs) power high-fidelity video world models but remain computationally expensive due to sequential denoising and costly spatio-temporal attention. Training-free feature caching accelerates inference by reusing intermediate activations across denoising steps; however, existing methods largely rely on a Zero-Order Hold assumption i.e., reusing cached features as static snapshots when global drift is small. This often leads to ghosting artifacts, blur, and motion inconsistencies in dynamic scenes. We propose \textbf{WorldCache}, a Perception-Constrained Dynamical Caching framework that improves both when and how to reuse features. WorldCache introduces motion-adaptive thresholds, saliency-weighted drift estimation, optimal approximation via blending and warping, and phase-aware threshold scheduling across diffusion steps. Our cohesive approach enables adaptive, motion-consistent feature reuse without retraining. On Cosmos-Predict2.5-2B evaluated on PAI-Bench, WorldCache achieves \textbf{2.3$\times$} inference speedup while preserving \textbf{99.4%} of baseline quality, substantially outperforming prior training-free caching approaches. Our code can be accessed on \href{https://umair1221.github.io/World-Cache/}{World-Cache}.
[310] Respiratory Status Detection with Video Transformers
Thomas Savage, Evan Madill
Main category: cs.CV
TL;DR: Video transformers can recognize respiratory distress from video by analyzing temporal patterns in breathing recovery after exercise
Details
Motivation: Recognition of respiratory distress through visual inspection is a life-saving clinical skill, and AI systems could help detect early signs of respiratory deterioration for earlier interventionMethod: Collected videos of healthy volunteers recovering after strenuous exercise, used natural recovery to create labeled dataset, split videos into clips with earlier clips corresponding to more shortness of breath, designed temporal ordering challenge, used ViViT encoder with Lie Relative Encodings (LieRE) and Motion Guided Masking, combined with embedding-based comparison strategy
Result: Achieved F1 score of 0.81 on the respiratory distress recognition task, suggesting modern video transformers can recognize subtle changes in respiratory mechanics
Conclusion: Video transformers show promise for recognizing respiratory distress from video, potentially enabling AI systems to assist in clinical monitoring and early intervention
Abstract: Recognition of respiratory distress through visual inspection is a life saving clinical skill. Clinicians can detect early signs of respiratory deterioration, creating a valuable window for earlier intervention. In this study, we evaluate whether recent advances in video transformers can enable Artificial Intelligence systems to recognize the signs of respiratory distress from video. We collected videos of healthy volunteers recovering after strenuous exercise and used the natural recovery of each participants respiratory status to create a labeled dataset for respiratory distress. Splitting the video into short clips, with earlier clips corresponding to more shortness of breath, we designed a temporal ordering challenge to assess whether an AI system can detect respiratory distress. We found a ViViT encoder augmented with Lie Relative Encodings (LieRE) and Motion Guided Masking, combined with an embedding based comparison strategy, can achieve an F1 score of 0.81 on this task. Our findings suggest that modern video transformers can recognize subtle changes in respiratory mechanics.
[311] FluidGaussian: Propagating Simulation-Based Uncertainty Toward Functionally-Intelligent 3D Reconstruction
Yuqiu Liu, Jialin Song, Marissa Ramirez de Chanlatte, Rochishnu Chowdhury, Rushil Paresh Desai, Wuyang Chen, Daniel Martin, Michael Mahoney
Main category: cs.CV
TL;DR: FluidGaussian: A 3D reconstruction method that incorporates fluid-structure interactions to improve both visual and physical fidelity of reconstructed objects.
Details
Motivation: Current 3D reconstruction methods focus only on visual fidelity, ignoring physical plausibility and object functionality, leading to unphysical interactions.Method: Proposes a plug-and-play method that couples geometry reconstruction with fluid-structure interactions, using simulation-based uncertainty metrics and active learning to prioritize views that improve both visual and physical fidelity.
Result: Achieves up to +8.6% visual PSNR improvement and -62.3% velocity divergence during fluid simulations on NeRF Synthetic, Mip-NeRF 360, and DrivAerNet++ datasets.
Conclusion: FluidGaussian successfully integrates physical interaction awareness into 3D reconstruction, improving both visual quality and physical plausibility of reconstructed objects.
Abstract: Real objects that inhabit the physical world follow physical laws and thus behave plausibly during interaction with other physical objects. However, current methods that perform 3D reconstructions of real-world scenes from multi-view 2D images optimize primarily for visual fidelity, i.e., they train with photometric losses and reason about uncertainty in the image or representation space. This appearance-centric view overlooks body contacts and couplings, conflates function-critical regions (e.g., aerodynamic or hydrodynamic surfaces) with ornamentation, and reconstructs structures suboptimally, even when physical regularizers are added. All these can lead to unphysical and implausible interactions. To address this, we consider the question: How can 3D reconstruction become aware of real-world interactions and underlying object functionality, beyond visual cues? To answer this question, we propose FluidGaussian, a plug-and-play method that tightly couples geometry reconstruction with ubiquitous fluid-structure interactions to assess surface quality at high granularity. We define a simulation-based uncertainty metric induced by fluid simulations and integrate it with active learning to prioritize views that improve both visual and physical fidelity. In an empirical evaluation on NeRF Synthetic (Blender), Mip-NeRF 360, and DrivAerNet++, our FluidGaussian method yields up to +8.6% visual PSNR (Peak Signal-to-Noise Ratio) and -62.3% velocity divergence during fluid simulations. Our code is available at https://github.com/delta-lab-ai/FluidGaussian.
[312] Relax Forcing: Relaxed KV-Memory for Consistent Long Video Generation
Zengqun Zhao, Yanzuo Lu, Ziquan Liu, Jifei Song, Jiankang Deng, Ioannis Patras
Main category: cs.CV
TL;DR: Relax Forcing: A structured temporal memory mechanism for autoregressive video diffusion that decomposes historical context into functional roles (Sink, Tail, History) to improve long-horizon video generation by mitigating error accumulation while preserving motion evolution.
Details
Motivation: Current self-forcing strategies for autoregressive video diffusion still struggle with minute-scale generation due to progressive temporal degradation. The authors find this limitation stems not from insufficient memory but from how temporal memory is utilized during inference, suggesting temporal memory should not be treated as a homogeneous buffer.Method: Relax Forcing decomposes temporal context into three functional roles: Sink for global stability, Tail for short-term continuity, and dynamically selected History for structural motion guidance. This structured approach selectively incorporates only the most relevant past information instead of attending to dense generated history.
Result: Experiments on VBench-Long demonstrate that Relax Forcing improves motion dynamics and overall temporal consistency while reducing attention overhead. The approach mitigates error accumulation during extrapolation while preserving motion evolution.
Conclusion: Structured temporal memory is essential for scalable long video generation, complementing existing forcing-based training strategies. The work shows that how temporal memory is utilized, not just its quantity, is critical for long-horizon video synthesis.
Abstract: Autoregressive (AR) video diffusion has recently emerged as a promising paradigm for long video generation, enabling causal synthesis beyond the limits of bidirectional models. To address training-inference mismatch, a series of self-forcing strategies have been proposed to improve rollout stability by conditioning the model on its own predictions during training. While these approaches substantially mitigate exposure bias, extending generation to minute-scale horizons remains challenging due to progressive temporal degradation. In this work, we show that this limitation is not primarily caused by insufficient memory, but by how temporal memory is utilised during inference. Through empirical analysis, we find that increasing memory does not consistently improve long-horizon generation, and that the temporal placement of historical context significantly influences motion dynamics while leaving visual quality largely unchanged. These findings suggest that temporal memory should not be treated as a homogeneous buffer. Motivated by this insight, we introduce Relax Forcing, a structured temporal memory mechanism for AR diffusion. Instead of attending to the dense generated history, Relax Forcing decomposes temporal context into three functional roles: Sink for global stability, Tail for short-term continuity, and dynamically selected History for structural motion guidance, and selectively incorporates only the most relevant past information. This design mitigates error accumulation during extrapolation while preserving motion evolution. Experiments on VBench-Long demonstrate that Relax Forcing improves motion dynamics and overall temporal consistency while reducing attention overhead. Our results suggest that structured temporal memory is essential for scalable long video generation, complementing existing forcing-based training strategies.
[313] HamVision: Hamiltonian Dynamics as Inductive Bias for Medical Image Analysis
Mohamed A Mabrok
Main category: cs.CV
TL;DR: HamVision: A medical image analysis framework using damped harmonic oscillator as structured inductive bias for segmentation and classification, achieving SOTA results on multiple medical imaging benchmarks.
Details
Motivation: To leverage fundamental signal processing principles (damped harmonic oscillator) as structured inductive bias for medical image analysis, providing interpretable representations that emerge from dynamics rather than supervision.Method: Uses phase-space decomposition of oscillator dynamics to yield three representations: position (feature content), momentum (spatial gradients for boundaries/texture), and energy (parameter-free saliency). For segmentation (HamSeg), energy gates skip connections while momentum injects boundary info. For classification (HamCls), representations are pooled and concatenated into phase-space feature vector.
Result: Achieves SOTA Dice scores on ISIC 2018 (89.38%), ISIC 2017 (88.40%), TN3K (87.05%), ACDC (92.40%) for segmentation. SOTA accuracy on BloodMNIST (98.85%) and PathMNIST (96.65%) for classification, with competitive results on other MedMNIST datasets. Only 8.57M parameters.
Conclusion: The damped harmonic oscillator provides effective structured inductive bias for medical image analysis, yielding interpretable representations that emerge from Hamiltonian dynamics and achieve strong performance across diverse imaging modalities.
Abstract: We present HamVision, a framework for medical image analysis that uses the damped harmonic oscillator, a fundamental building block of signal processing, as a structured inductive bias for both segmentation and classification tasks. The oscillator’s phase-space decomposition yields three functionally distinct representations: position~$q$ (feature content), momentum~$p$ (spatial gradients that encode boundary and texture information), and energy $H = \tfrac{1}{2}|z|^2$ (a parameter-free saliency map). These representations emerge from the dynamics, not from supervision, and can be exploited by different task-specific heads without any modification to the oscillator itself. For segmentation, energy gates the skip connections while momentum injects boundary information at every decoder level (HamSeg). For classification, the three representations are globally pooled and concatenated into a phase-space feature vector (HamCls). We evaluate HamVision across ten medical imaging benchmarks spanning five imaging modalities. On segmentation, HamSeg achieves state-of-the-art Dice scores on ISIC,2018 (89.38%), ISIC,2017 (88.40%), TN3K (87.05%), and ACDC (92.40%), outperforming most baselines with only 8.57M parameters. On classification, HamCls achieves state-of-the-art accuracy on BloodMNIST (98.85%) and PathMNIST (96.65%), and competitive results on the remaining MedMNIST datasets against MedMamba and MedViT. Diagnostic analysis confirms that the oscillator’s momentum consistently encodes an interior$,{>},$boundary$,{>},$exterior gradient for segmentation and that the energy map correlates with discriminative regions for classification, properties that emerge entirely from the Hamiltonian dynamics. Code is available at https://github.com/Minds-R-Lab/hamvision.
[314] An InSAR Phase Unwrapping Framework for Large-scale and Complex Events
Yijia Song, Juliet Biggs, Alin Achim, Robert Popescu, Simon Orrego, Nantheera Anantrasirichai
Main category: cs.CV
TL;DR: A diffusion model-based phase unwrapping framework for large-scale InSAR interferograms that handles deformation discontinuities and fault-related phase jumps.
Details
Motivation: Phase unwrapping in InSAR processing is challenging for complex deformation patterns like earthquakes, where surface-breaking faults create abrupt displacement discontinuities that disrupt phase continuity. Existing learning-based methods have fixed small input sizes, while real interferograms are large-scale and heterogeneous, limiting practical applicability.Method: Proposes a phase unwrapping framework based on a diffusion model architecture designed to process large-scale interferograms and address phase discontinuities caused by deformation. The diffusion model can recover physically consistent unwrapped phase fields even with fault-related phase jumps.
Result: Experimental results on both synthetic and real datasets show the method effectively addresses discontinuities associated with near-surface deformation and scales well to large InSAR images.
Conclusion: The diffusion model-based approach offers a practical alternative to manual unwrapping in challenging scenarios involving complex deformation patterns and large-scale interferograms.
Abstract: Phase unwrapping remains a critical and challenging problem in InSAR processing, particularly in scenarios involving complex deformation patterns. In earthquake-related deformation, shallow sources can generate surface-breaking faults and abrupt displacement discontinuities, which severely disrupt phase continuity and often cause conventional unwrapping algorithms to fail. Another limitation of existing learning-based unwrapping methods is their reliance on fixed and relatively small input sizes, while real InSAR interferograms are typically large-scale and spatially heterogeneous. This mismatch restricts the applicability of many neural network approaches to real-world data. In this work, we present a phase unwrapping framework based on a diffusion model, developed to process large-scale interferograms and to address phase discontinuities caused by deformation. By leveraging a diffusion model architecture, the proposed method can recover physically consistent unwrapped phase fields even in the presence of fault-related phase jumps. Experimental results on both synthetic and real datasets demonstrate that the method effectively addresses discontinuities associated with near-surface deformation and scales well to large InSAR images, offering a practical alternative to manual unwrapping in challenging scenarios.
[315] Mitigating Objectness Bias and Region-to-Text Misalignment for Open-Vocabulary Panoptic Segmentation
Nikolay Kormushev, Josip Šarić, Matej Kristan
Main category: cs.CV
TL;DR: OVRCOAT is a modular framework for open-vocabulary panoptic segmentation that addresses mask selection bias and limited regional understanding in vision-language models through CLIP-conditioned objectness adjustment and mask-to-text refinement.
Details
Motivation: The paper addresses two key limitations in open-vocabulary panoptic segmentation: (1) mask selection bias where objectness heads trained on closed vocabularies suppress masks of unseen categories, and (2) limited regional understanding in vision-language models like CLIP that were optimized for global image classification rather than localized segmentation.Method: OVRCOAT introduces two modular components: (1) CLIP-conditioned objectness adjustment (COAT) that updates background/foreground probabilities to preserve high-quality masks for out-of-vocabulary objects, and (2) open-vocabulary mask-to-text refinement (OVR) that strengthens CLIP’s region-level alignment to improve classification of both seen and unseen classes with lower memory cost than prior fine-tuning approaches.
Result: OVRCOAT achieves state-of-the-art performance on ADE20K (+5.5% PQ) and delivers significant gains on Mapillary Vistas (+7.1% PQ) and Cityscapes (+3% PQ). The framework improves both objectness estimation and mask recognition while maintaining simplicity and modularity.
Conclusion: The proposed OVRCOAT framework effectively addresses the coupled issues of mask selection bias and limited regional understanding in vision-language models, enabling consistent improvements in open-vocabulary panoptic segmentation across multiple benchmarks with a simple, modular approach.
Abstract: Open-vocabulary panoptic segmentation remains hindered by two coupled issues: (i) mask selection bias, where objectness heads trained on closed vocabularies suppress masks of categories not observed in training, and (ii) limited regional understanding in vision-language models such as CLIP, which were optimized for global image classification rather than localized segmentation. We introduce OVRCOAT, a simple, modular framework that tackles both. First, a CLIP-conditioned objectness adjustment (COAT) updates background/foreground probabilities, preserving high-quality masks for out-of-vocabulary objects. Second, an open-vocabulary mask-to-text refinement (OVR) strengthens CLIP’s region-level alignment to improve classification of both seen and unseen classes with markedly lower memory cost than prior fine-tuning schemes. The two components combine to jointly improve objectness estimation and mask recognition, yielding consistent panoptic gains. Despite its simplicity, OVRCOAT sets a new state of the art on ADE20K (+5.5% PQ) and delivers clear gains on Mapillary Vistas and Cityscapes (+7.1% and +3% PQ, respectively). The code is available at: https://github.com/nickormushev/OVRCOAT
[316] Knowledge Priors for Identity-Disentangled Open-Set Privacy-Preserving Video FER
Feng Xu, Xun Li, Lars Petersson, Yulei Sui, David Ahmedt Aristizabal, Dadong Wang
Main category: cs.CV
TL;DR: Two-stage framework for privacy-preserving facial expression recognition in videos without identity labels, using identity suppression and denoising modules.
Details
Motivation: Facial expression recognition uses facial data that exposes identity, raising privacy concerns. Current methods fail in realistic open-set video settings where identities are unknown and identity labels are unavailable.Method: Proposes a two-stage framework: 1) Train identity-suppression network using intra- and inter-video knowledge priors from real-world videos without identity labels to anonymize identity while preserving expressive cues. 2) Denoising module restores expression-related information. Also introduces falsification-based validation method using recognition priors to evaluate privacy robustness without annotated identity labels.
Result: Experiments on three video datasets demonstrate the method effectively protects privacy while maintaining FER accuracy comparable to identity-supervised baselines.
Conclusion: The framework enables privacy-preserving facial expression recognition in challenging open-set video settings without requiring identity labels at any stage.
Abstract: Facial expression recognition relies on facial data that inherently expose identity and thus raise significant privacy concerns. Current privacy-preserving methods typically fail in realistic open-set video settings where identities are unknown, and identity labels are unavailable. We propose a two-stage framework for video-based privacy-preserving FER in challenging open-set settings that requires no identity labels at any stage. To decouple privacy and utility, we first train an identity-suppression network using intra- and inter-video knowledge priors derived from real-world videos without identity labels. This network anonymizes identity while preserving expressive cues. A subsequent denoising module restores expression-related information and helps recover FER performance. Furthermore, we introduce a falsification-based validation method that uses recognition priors to rigorously evaluate privacy robustness without requiring annotated identity labels. Experiments on three video datasets demonstrate that our method effectively protects privacy while maintaining FER accuracy comparable to identity-supervised baselines.
[317] Uncertainty-Aware Knowledge Distillation for Multimodal Large Language Models
Jingchen Sun, Shaobo Han, Deep Patel, Wataru Kohno, Can Jin, Changyou Chen
Main category: cs.CV
TL;DR: Beta-KD is an uncertainty-aware knowledge distillation framework that adaptively balances learning from data vs. teacher guidance using Bayesian principles, improving student VLMs on multimodal VQA tasks.
Details
Motivation: Traditional knowledge distillation struggles to balance data supervision and teacher guidance optimally, as some samples may be noisy while others have teacher uncertainty. There's a need for adaptive balancing between these two supervision sources.Method: Proposes Beta-weighted Knowledge Distillation (Beta-KD) that formulates teacher-student learning from a Bayesian perspective, interpreting teacher supervision as a Gibbs prior over student activations. This yields a closed-form, uncertainty-aware weighting mechanism that supports arbitrary distillation objectives and their combinations.
Result: Extensive experiments on multimodal VQA benchmarks show that distilling student Vision-Language Models from large teacher VLMs consistently improves performance. Beta-KD outperforms existing knowledge distillation methods.
Conclusion: Beta-KD provides an effective uncertainty-aware framework for adaptive knowledge distillation that balances data and teacher supervision, demonstrating superior performance on multimodal VQA tasks.
Abstract: Knowledge distillation establishes a learning paradigm that leverages both data supervision and teacher guidance. However, determining the optimal balance between learning from data and learning from the teacher is challenging, as some samples may be noisy while others are subject to teacher uncertainty. This motivates the need for adaptively balancing data and teacher supervision. We propose Beta-weighted Knowledge Distillation (Beta-KD), an uncertainty-aware distillation framework that adaptively modulates how much the student relies on teacher guidance. Specifically, we formulate teacher–student learning from a unified Bayesian perspective and interpret teacher supervision as a Gibbs prior over student activations. This yields a closed-form, uncertainty-aware weighting mechanism and supports arbitrary distillation objectives and their combinations. Extensive experiments on multimodal VQA benchmarks demonstrate that distilling student Vision-Language Models from a large teacher VLM consistently improves performance. The results show that Beta-KD outperforms existing knowledge distillation methods. The code is available at https://github.com/Jingchensun/beta-kd.
[318] Image-Based Structural Analysis Using Computer Vision and LLMs: PhotoBeamSolver
Altamirano-Muñiz Emilio Fernando
Main category: cs.CV
TL;DR: A computer vision system that solves idealized beam models from hand-drawn diagrams using visual interpretation and statistical learning techniques.
Details
Motivation: To bridge the gap between hand-drawn structural diagrams and computational analysis by developing a system that can automatically interpret and solve beam models from drawings, addressing challenges in applying computer vision to civil engineering.Method: Uses computer vision and statistical learning techniques for detection and visual interpretation of structural elements in drawings, implemented in the PhotoBeamSolver program.
Result: Development of a documented program capable of solving idealized beam models from drawings, with analysis of challenges and limitations in computer vision integration for structural analysis.
Conclusion: The system demonstrates potential for applying computer vision in civil engineering for structural analysis, infrastructure inspection, and decision-support systems, though challenges remain for reliable field application.
Abstract: This paper presents the development of a documented program capable of solving idealized beam models, such as those commonly used in textbooks and academic exercises, from drawings made by a person. The system is based on computer vision and statistical learning techniques for the detection and visual interpretation of structural elements. Likewise, the main challenges and limitations associated with the integration of computer vision into structural analysis are analyzed, as well as the requirements necessary for its reliable application in the field of civil engineering. In this context, the implementation of the PhotoBeamSolver program is explored, and the current state of computer vision in civil engineering is discussed, particularly in relation to structural analysis, infrastructure inspection, and engineering decision-support systems.
[319] PAS3R: Pose-Adaptive Streaming 3D Reconstruction for Long Video Sequences
Lanbo Xu, Liang Guo, Caigui Jiang, Cheng Wang
Main category: cs.CV
TL;DR: PAS3R: A pose-adaptive streaming 3D reconstruction framework that dynamically modulates state updates based on camera motion and scene structure to address the stability-adaptation dilemma in online monocular reconstruction.
Details
Motivation: Address the fundamental stability-adaptation dilemma in online monocular 3D reconstruction, where existing approaches fail to handle abrupt viewpoint transitions, leading to trajectory drift and geometric inconsistencies over long video sequences.Method: Introduces motion-aware update mechanism using inter-frame pose variation and image frequency cues to estimate frame importance; trajectory-consistent training with relative pose constraints and acceleration regularization; lightweight online stabilization module.
Result: Significantly improves trajectory accuracy, depth estimation, and point cloud reconstruction quality in long video sequences while maintaining competitive performance on shorter sequences across multiple benchmarks.
Conclusion: PAS3R effectively addresses the stability-adaptation dilemma through pose-adaptive streaming, enabling more robust and accurate online 3D reconstruction from monocular video.
Abstract: Online monocular 3D reconstruction enables dense scene recovery from streaming video but remains fundamentally limited by the stability-adaptation dilemma: the reconstruction model must rapidly incorporate novel viewpoints while preserving previously accumulated scene structure. Existing streaming approaches rely on uniform or attention-based update mechanisms that often fail to account for abrupt viewpoint transitions, leading to trajectory drift and geometric inconsistencies over long sequences. We introduce PAS3R, a pose-adaptive streaming reconstruction framework that dynamically modulates state updates according to camera motion and scene structure. Our key insight is that frames contributing significant geometric novelty should exert stronger influence on the reconstruction state, while frames with minor viewpoint variation should prioritize preserving historical context. PAS3R operationalizes this principle through a motion-aware update mechanism that jointly leverages inter-frame pose variation and image frequency cues to estimate frame importance. To further stabilize long-horizon reconstruction, we introduce trajectory-consistent training objectives that incorporate relative pose constraints and acceleration regularization. A lightweight online stabilization module further suppresses high-frequency trajectory jitter and geometric artifacts without increasing memory consumption. Extensive experiments across multiple benchmarks demonstrate that PAS3R significantly improves trajectory accuracy, depth estimation, and point cloud reconstruction quality in long video sequences while maintaining competitive performance on shorter sequences.
[320] EpiMask: Leveraging Epipolar Distance Based Masks in Cross-Attention for Satellite Image Matching
Rahul Deshmukh, Aditya Chauhan, Avinash Kak
Main category: cs.CV
TL;DR: EpiMask is a semi-dense image matching network specifically designed for satellite images that incorporates patch-wise affine approximations to camera geometry and uses epipolar distance-based attention masks to improve matching accuracy by up to 30% compared to ground-based models.
Details
Motivation: Existing deep-learning based image matching networks are optimized for pinhole camera geometry from ground-based datasets, leading to suboptimal performance when applied to satellite images which have different camera geometry (moving satellite camera records one line at a time).Method: EpiMask incorporates three key components: 1) patch-wise affine approximations to camera modeling geometry, 2) epipolar distance-based attention mask to restrict cross-attention to geometrically plausible regions, and 3) fine-tuning of a foundational pretrained image encoder for robust feature extraction.
Result: Experiments on the SatDepth dataset demonstrate up to 30% improvement in matching accuracy compared to re-trained ground-based models.
Conclusion: EpiMask effectively addresses the limitations of ground-based image matching networks for satellite imagery by incorporating domain-specific geometric constraints and attention mechanisms, significantly improving matching performance for satellite images.
Abstract: The deep-learning based image matching networks can now handle significantly larger variations in viewpoints and illuminations while providing matched pairs of pixels with sub-pixel precision. These networks have been trained with ground-based image datasets and, implicitly, their performance is optimized for the pinhole camera geometry. Consequently, you get suboptimal performance when such networks are used to match satellite images since those images are synthesized as a moving satellite camera records one line at a time of the points on the ground. In this paper, we present EpiMask, a semi-dense image matching network for satellite images that (1) Incorporates patch-wise affine approximations to the camera modeling geometry; (2) Uses an epipolar distance-based attention mask to restrict cross-attention to geometrically plausible regions; and (3) That fine-tunes a foundational pretrained image encoder for robust feature extraction. Experiments on the SatDepth dataset demonstrate up to 30% improvement in matching accuracy compared to re-trained ground-based models.
[321] ALADIN:Attribute-Language Distillation Network for Person Re-Identification
Wang Zhou, Boran Duan, Haojun Ai, Ruiqi Lan, Ziyue Zhou
Main category: cs.CV
TL;DR: ALADIN distills CLIP knowledge for person re-identification using fine-grained attribute-language alignment and scene-aware prompts, improving robustness to occlusions and generalization.
Details
Motivation: Current CLIP-guided ReID methods rely on global features and fixed prompts, limiting their ability to capture fine-grained attribute cues and adapt to diverse appearances. There's a need for more adaptive alignment between visual and textual attributes.Method: Proposes ALADIN (Attribute-Language Distillation Network) that distills knowledge from frozen CLIP teacher to lightweight ReID student. Key components: 1) Fine-grained attribute-local alignment for adaptive text-visual correspondence, 2) Scene-Aware Prompt Generator for image-specific soft prompts, 3) Attribute-local distillation for consistency between textual attributes and local visual features, 4) Cross-modal contrastive and relation distillation to preserve attribute relationships, 5) Uses Multimodal LLMs to generate structured attribute descriptions converted to localized attention maps via CLIP.
Result: Experiments on Market-1501, DukeMTMC-reID, and MSMT17 show improvements over CNN-, Transformer-, and CLIP-based methods. Demonstrates better generalization and interpretability, with enhanced robustness under occlusions.
Conclusion: ALADIN effectively bridges the gap between global CLIP features and fine-grained ReID requirements through attribute-language distillation, achieving state-of-the-art performance with improved robustness and interpretability.
Abstract: Recent vision-language models such as CLIP provide strong cross-modal alignment, but current CLIP-guided ReID pipelines rely on global features and fixed prompts. This limits their ability to capture fine-grained attribute cues and adapt to diverse appearances. We propose ALADIN, an attribute-language distillation network that distills knowledge from a frozen CLIP teacher to a lightweight ReID student. ALADIN introduces fine-grained attribute-local alignment to establish adaptive text-visual correspondence and robust representation learning. A Scene-Aware Prompt Generator produces image-specific soft prompts to facilitate adaptive alignment. Attribute-local distillation enforces consistency between textual attributes and local visual features, significantly enhancing robustness under occlusions. Furthermore, we employ cross-modal contrastive and relation distillation to preserve the inherent structural relationships among attributes. To provide precise supervision, we leverage Multimodal LLMs to generate structured attribute descriptions, which are then converted into localized attention maps via CLIP. At inference, only the student is used. Experiments on Market-1501, DukeMTMC-reID, and MSMT17 show improvements over CNN-, Transformer-, and CLIP-based methods, with better generalization and interpretability.
[322] Which Concepts to Forget and How to Refuse? Decomposing Concepts for Continual Unlearning in Large Vision-Language Models
Hyundong Jin, Dongyoon Han, Eunwoo Kim
Main category: cs.CV
TL;DR: A continual unlearning framework for vision-language models that grounds refusal behavior in fine-grained visual-textual concepts to enable selective forgetting while preserving general utility.
Details
Motivation: Continual unlearning in vision-language models faces challenges where sequential deletion requests create spurious associations between vision-language pairs and refusal behaviors, leading to inappropriate refusals and distorted shared representations.Method: Proposes a concept modulator to identify visual-linguistic concept combinations for forget categories, and a mixture of refusal experts (refusers) specialized for concept-aligned refusal generation. Uses multimodal concept-driven routing to reuse refusers for similar concepts and adapt underutilized ones for novel concepts.
Result: Extensive experiments on vision-language benchmarks show the framework outperforms existing methods by generating concept-grounded refusal responses and preserving general utility across unlearning sequences.
Conclusion: The proposed continual unlearning framework effectively addresses the challenge of selective refusal in vision-language models by grounding refusal behavior in fine-grained concepts, enabling precise identification of refusal targets while maintaining model utility.
Abstract: Continual unlearning poses the challenge of enabling large vision-language models to selectively refuse specific image-instruction pairs in response to sequential deletion requests, while preserving general utility. However, sequential unlearning updates distort shared representations, creating spurious associations between vision-language pairs and refusal behaviors that hinder precise identification of refusal targets, resulting in inappropriate refusals. To address this challenge, we propose a novel continual unlearning framework that grounds refusal behavior in fine-grained descriptions of visual and textual concepts decomposed from deletion targets. We first identify which visual-linguistic concept combinations characterize each forget category through a concept modulator, then determine how to generate appropriate refusal responses via a mixture of refusal experts, termed refusers, each specialized for concept-aligned refusal generation. To generate concept-specific refusal responses across sequential tasks, we introduce a multimodal, concept-driven routing scheme that reuses refusers for tasks sharing similar concepts and adapts underutilized ones for novel concepts. Extensive experiments on vision-language benchmarks demonstrate that the proposed framework outperforms existing methods by generating concept-grounded refusal responses and preserving the general utility across unlearning sequences.
[323] Learning Trajectory-Aware Multimodal Large Language Models for Video Reasoning Segmentation
Jingnan Luo, Mingqi Gao, Jun Liu, Bin-Bin Gao, Feng Zheng
Main category: cs.CV
TL;DR: TrajSeg is a unified MLLM-based framework for video reasoning segmentation that introduces bidirectional text-trajectory alignment to better perceive object trajectories in dynamic videos.
Details
Motivation: Existing video reasoning segmentation methods struggle with trajectory perception in dynamic videos due to unidirectional and implicit text-trajectory alignment. There's a need for better correspondence between text instructions and object trajectories.Method: Proposes bidirectional text-trajectory alignment where MLLMs handle both grounding-intended (text-to-trajectory) and captioning-intended (trajectory-to-text) instructions. Uses frame-level content integration to adapt trajectory tokens to frame-specific information and a unified mask decoder for end-to-end training.
Result: Outperforms all existing video reasoning segmentation methods on all metrics across referring and reasoning video segmentation datasets.
Conclusion: TrajSeg demonstrates that bidirectional alignment enhances trajectory perception in MLLMs for video reasoning segmentation, achieving state-of-the-art performance with a simplified, end-to-end trainable framework.
Abstract: The prosperity of Multimodal Large Language Models (MLLMs) has stimulated the demand for video reasoning segmentation, which aims to segment video objects based on human instructions. Previous studies rely on unidirectional and implicit text-trajectory alignment, which struggles with trajectory perception when faced with severe video dynamics. In this work, we propose TrajSeg, a simple and unified framework built upon MLLMs. Concretely, we introduce bidirectional text-trajectory alignment, where MLLMs accept grounding-intended (text-to-trajectory) and captioning-intended (trajectory-to-text) instructions. This way, MLLMs can benefit from enhanced correspondence and better perceive object trajectories in videos. The mask generation from trajectories is achieved via a frame-level content integration (FCI) module and a unified mask decoder. The former adapts the MLLM-parsed trajectory-level token to frame-specific information. The latter unifies segmentation for all frames into a single structure, enabling the proposed framework to be simplified and end-to-end trainable. Extensive experiments on referring and reasoning video segmentation datasets demonstrate the effectiveness of TrajSeg, which outperforms all video reasoning segmentation methods on all metrics. The code will be publicly available at https://github.com/haodi19/TrajSeg.
[324] Parameter-efficient Prompt Tuning and Hierarchical Textual Guidance for Few-shot Whole Slide Image Classification
Jayanie Bogahawatte, Sachith Seneviratne, Saman Halgamuge
Main category: cs.CV
TL;DR: Parameter-efficient prompt tuning with feature scaling/shifting and soft hierarchical textual guidance for few-shot weakly supervised WSI classification, reducing parameters and improving performance.
Details
Motivation: Few-shot weakly supervised WSI classification is crucial due to costly instance-level annotations. Existing VLM-based methods have limitations: prompt tuning increases parameters/inference overhead, and hard instance filtering causes information loss.Method: 1) Parameter-efficient prompt tuning via scaling/shifting features in text encoder. 2) WSI representation learning with soft hierarchical textual guidance (no hard instance filtering), leveraging both VLM knowledge and WSI hierarchical structure.
Result: Consistent improvements up to 10.9%, 7.8%, and 13.8% on breast, lung, and ovarian cancer datasets over SOTA. Reduces trainable parameters by 18.1% (breast/lung) and 5.8% (ovarian). Excels at weakly-supervised tumor localization.
Conclusion: Proposed method effectively addresses computational efficiency and information loss issues in FSWC, achieving superior performance with fewer parameters while maintaining localization capabilities.
Abstract: Whole Slide Images (WSIs) are giga-pixel in scale and are typically partitioned into small instances in WSI classification pipelines for computational feasibility. However, obtaining extensive instance level annotations is costly, making few-shot weakly supervised WSI classification (FSWC) crucial for learning from limited slide-level labels. Recently, pre-trained vision-language models (VLMs) have been adopted in FSWC, yet they exhibit several limitations. Existing prompt tuning methods in FSWC substantially increase both the number of trainable parameters and inference overhead. Moreover, current methods discard instances with low alignment to text embeddings from VLMs, potentially leading to information loss. To address these challenges, we propose two key contributions. First, we introduce a new parameter efficient prompt tuning method by scaling and shifting features in text encoder, which significantly reduces the computational cost. Second, to leverage not only the pre-trained knowledge of VLMs, but also the inherent hierarchical structure of WSIs, we introduce a WSI representation learning approach with a soft hierarchical textual guidance strategy without utilizing hard instance filtering. Comprehensive evaluations on pathology datasets covering breast, lung, and ovarian cancer types demonstrate consistent improvements up-to 10.9%, 7.8%, and 13.8% respectively, over the state-of-the-art methods in FSWC. Our method reduces the number of trainable parameters by 18.1% on both breast and lung cancer datasets, and 5.8% on the ovarian cancer dataset, while also excelling at weakly-supervised tumor localization. Code at https://github.com/Jayanie/HIPSS.
[325] Back to Point: Exploring Point-Language Models for Zero-Shot 3D Anomaly Detection
Kaiqiang Li, Gang Li, Mingle Zhou, Min Li, Delong Han, Jin Wan
Main category: cs.CV
TL;DR: BTP is a zero-shot 3D anomaly detection framework that aligns 3D point cloud features with textual embeddings using pre-trained Point-Language Models, avoiding 2D rendering to preserve geometric details.
Details
Motivation: Existing approaches render 3D point clouds into 2D images and use Vision-Language Models, which discard geometric details and have limited sensitivity to local anomalies. The authors want to leverage intrinsic 3D representations and pre-trained Point-Language Models for better zero-shot 3D anomaly detection.Method: Proposes BTP framework that aligns multi-granularity patch features from 3D point clouds with textual representations for localized anomaly detection. Incorporates geometric descriptors to enhance sensitivity to structural anomalies. Uses joint representation learning with auxiliary point cloud data to improve robustness and enrich anomaly semantics.
Result: Extensive experiments on Real3D-AD and Anomaly-ShapeNet datasets demonstrate superior performance in zero-shot 3D anomaly detection compared to existing methods.
Conclusion: BTP effectively leverages pre-trained Point-Language Models for zero-shot 3D anomaly detection by aligning 3D point cloud and textual embeddings while preserving geometric details, outperforming 2D rendering approaches.
Abstract: Zero-shot (ZS) 3D anomaly detection is crucial for reliable industrial inspection, as it enables detecting and localizing defects without requiring any target-category training data. Existing approaches render 3D point clouds into 2D images and leverage pre-trained Vision-Language Models (VLMs) for anomaly detection. However, such strategies inevitably discard geometric details and exhibit limited sensitivity to local anomalies. In this paper, we revisit intrinsic 3D representations and explore the potential of pre-trained Point-Language Models (PLMs) for ZS 3D anomaly detection. We propose BTP (Back To Point), a novel framework that effectively aligns 3D point cloud and textual embeddings. Specifically, BTP aligns multi-granularity patch features with textual representations for localized anomaly detection, while incorporating geometric descriptors to enhance sensitivity to structural anomalies. Furthermore, we introduce a joint representation learning strategy that leverages auxiliary point cloud data to improve robustness and enrich anomaly semantics. Extensive experiments on Real3D-AD and Anomaly-ShapeNet demonstrate that BTP achieves superior performance in ZS 3D anomaly detection. Code will be available at \href{https://github.com/wistful-8029/BTP-3DAD}{https://github.com/wistful-8029/BTP-3DAD}.
[326] VIGIL: Part-Grounded Structured Reasoning for Generalizable Deepfake Detection
Xinghan Li, Junhao Xu, Jingjing Chen
Main category: cs.CV
TL;DR: VIGIL is a part-centric structured forensic framework for interpretable deepfake detection using multimodal LLMs, featuring a plan-then-examine pipeline with stage-gated evidence injection and part-aware reinforcement learning.
Details
Motivation: Current MLLM-based deepfake detection methods combine evidence generation and manipulation localization into a unified step, which blurs the boundary between faithful observations and hallucinated explanations, leading to unreliable conclusions. The paper aims to create a more interpretable and reliable forensic framework inspired by expert forensic practice.Method: VIGIL uses a part-centric structured forensic framework with: 1) Plan-then-examine pipeline: first plans which facial parts to inspect based on global visual cues, then examines each part with independently sourced forensic evidence; 2) Stage-gated injection mechanism: delivers part-level forensic evidence only during examination to prevent bias; 3) Progressive three-stage training with reinforcement learning using part-aware rewards for anatomical validity and evidence-conclusion coherence; 4) OmniFake benchmark: hierarchical 5-Level benchmark for rigorous generalizability evaluation.
Result: Extensive experiments on OmniFake and cross-dataset evaluations show VIGIL consistently outperforms both expert detectors and concurrent MLLM-based methods across all generalizability levels. The model trained on only three foundational generators demonstrates strong performance even on in-the-wild social-media data.
Conclusion: VIGIL provides a more reliable and interpretable approach to deepfake detection by structuring the forensic process into distinct planning and examination stages, preventing evidence contamination and hallucination while maintaining strong generalization capabilities across diverse manipulation scenarios.
Abstract: Multimodal large language models (MLLMs) offer a promising path toward interpretable deepfake detection by generating textual explanations. However, the reasoning process of current MLLM-based methods combines evidence generation and manipulation localization into a unified step. This combination blurs the boundary between faithful observations and hallucinated explanations, leading to unreliable conclusions. Building on this, we present VIGIL, a part-centric structured forensic framework inspired by expert forensic practice through a plan-then-examine pipeline: the model first plans which facial parts warrant inspection based on global visual cues, then examines each part with independently sourced forensic evidence. A stage-gated injection mechanism delivers part-level forensic evidence only during examination, ensuring that part selection remains driven by the model’s own perception rather than biased by external signals. We further propose a progressive three-stage training paradigm whose reinforcement learning stage employs part-aware rewards to enforce anatomical validity and evidence–conclusion coherence. To enable rigorous generalizability evaluation, we construct OmniFake, a hierarchical 5-Level benchmark where the model, trained on only three foundational generators, is progressively tested up to in-the-wild social-media data. Extensive experiments on OmniFake and cross-dataset evaluations demonstrate that VIGIL consistently outperforms both expert detectors and concurrent MLLM-based methods across all generalizability levels.
[327] PEARL: Geometry Aligns Semantics for Training-Free Open-Vocabulary Semantic Segmentation
Gensheng Pei, Xiruo Jiang, Xinhao Cai, Tao Chen, Yazhou Yao, Byeungwoo Jeon
Main category: cs.CV
TL;DR: PEARL is a training-free open-vocabulary semantic segmentation method that uses Procrustes alignment and text-aware Laplacian propagation for efficient cross-modal reasoning without retraining.
Details
Motivation: Existing training-free OVSS methods either rely on heavy post-processing, handle text and vision modalities in isolation, or introduce complex auxiliary pipelines that increase latency and compromise design simplicity. There's a need for a more efficient approach that better utilizes cross-modal geometry.Method: PEARL follows an align-then-propagate principle: 1) Procrustes alignment performs orthogonal projection in the last self-attention block, rotating keys toward query subspace via stable polar iteration; 2) Text-aware Laplacian propagation refines per-pixel logits on a small grid through confidence-weighted, text-guided graph solve where text provides data-trust signals and neighbor gating, while image gradients preserve boundaries.
Result: PEARL sets new state-of-the-art in training-free OVSS without extra data or auxiliary backbones across standard benchmarks, achieving superior performance under both with-background and without-background protocols.
Conclusion: PEARL provides a compact, efficient training-free solution for open-vocabulary semantic segmentation that effectively leverages cross-modal geometry while maintaining simplicity and low latency.
Abstract: Training-free open-vocabulary semantic segmentation (OVSS) promises rapid adaptation to new label sets without retraining. Yet, many methods rely on heavy post-processing or handle text and vision in isolation, leaving cross-modal geometry underutilized. Others introduce auxiliary vision backbones or multi-model pipelines, which increase complexity and latency while compromising design simplicity. We present PEARL, \textbf{\underline{P}}rocrust\textbf{\underline{e}}s \textbf{\underline{a}}lignment with text-awa\textbf{\underline{r}}e \textbf{\underline{L}}aplacian propagation, a compact two-step inference that follows an align-then-propagate principle. The Procrustes alignment step performs an orthogonal projection inside the last self-attention block, rotating keys toward the query subspace via a stable polar iteration. The text-aware Laplacian propagation then refines per-pixel logits on a small grid through a confidence-weighted, text-guided graph solve: text provides both a data-trust signal and neighbor gating, while image gradients preserve boundaries. In this work, our method is fully training-free, plug-and-play, and uses only fixed constants, adding minimal latency with a small per-head projection and a few conjugate-gradient steps. Our approach, PEARL, sets a new state-of-the-art in training-free OVSS without extra data or auxiliary backbones across standard benchmarks, achieving superior performance under both with-background and without-background protocols.
[328] PROBE: Diagnosing Residual Concept Capacity in Erased Text-to-Video Diffusion Models
Yiwei Xie, Zheng Zhang, Ping Liu
Main category: cs.CV
TL;DR: PROBE introduces a diagnostic protocol to test if erased concepts in text-to-video diffusion models can be reactivated, revealing that current erasure methods only achieve output-level suppression rather than true representational removal.
Details
Motivation: Current evaluation of concept erasure in T2V models only checks if target concepts are absent from generated frames, treating output-level suppression as evidence of representational removal. This paper questions whether erased concepts can actually be reactivated, suggesting current methods may not truly remove representations.Method: PROBE optimizes lightweight pseudo-token embeddings through denoising reconstruction with latent alignment constraints that anchor recovery to spatiotemporal structure. It uses multi-level evaluation including classifier detection, semantic similarity, temporal reactivation analysis, and human validation across three T2V architectures, three concept categories, and three erasure strategies.
Result: All tested erasure methods leave measurable residual capacity whose robustness correlates with intervention depth. The paper identifies temporal re-emergence as a video-specific failure mode where suppressed concepts progressively resurface across frames, invisible to frame-level metrics.
Conclusion: Current concept erasure methods achieve only output-level suppression rather than true representational removal. The PROBE protocol provides a reproducible safety auditing framework for T2V models.
Abstract: Concept erasure techniques for text-to-video (T2V) diffusion models report substantial suppression of sensitive content, yet current evaluation is limited to checking whether the target concept is absent from generated frames, treating output-level suppression as evidence of representational removal. We introduce PROBE, a diagnostic protocol that quantifies the \textit{reactivation potential} of erased concepts in T2V models. With all model parameters frozen, PROBE optimizes a lightweight pseudo-token embedding through a denoising reconstruction objective combined with a novel latent alignment constraint that anchors recovery to the spatiotemporal structure of the original concept. We make three contributions: (1) a multi-level evaluation framework spanning classifier-based detection, semantic similarity, temporal reactivation analysis, and human validation; (2) systematic experiments across three T2V architectures, three concept categories, and three erasure strategies revealing that all tested methods leave measurable residual capacity whose robustness correlates with intervention depth; and (3) the identification of temporal re-emergence, a video-specific failure mode where suppressed concepts progressively resurface across frames, invisible to frame-level metrics. These findings suggest that current erasure methods achieve output-level suppression rather than representational removal. We release our protocol to support reproducible safety auditing. Our code is available at https://github.com/YiweiXie/PRObingBasedEvaluation.
[329] From Part to Whole: 3D Generative World Model with an Adaptive Structural Hierarchy
Bi’an Du, Daizong Liu, Pufan Li, Wei Hu
Main category: cs.CV
TL;DR: A novel part-to-whole 3D generative world model that autonomously discovers latent structural slots from image tokens, enabling adaptive part-whole hierarchy learning for single-image 3D generation across diverse categories.
Details
Motivation: Existing single-image 3D generation methods struggle with reliable generalization across diverse semantic categories and structural complexity, often leading to overfitting, fragmented components, and limited compositional generalization with novel object layouts.Method: Proposes a part-to-whole 3D generative world model with adaptive slot-gating mechanism that dynamically determines slot activation probabilities and consolidates redundant slots. Uses a class-agnostic prototype bank for cross-category shape sharing, and a lightweight 3D denoiser with unified diffusion objectives.
Result: Experiments show consistent gains in cross-category transfer and part-count extrapolation, with ablations confirming benefits of the prototype bank for shape-prior sharing and slot-gating for structural adaptation.
Conclusion: The approach successfully rethinks single-image 3D generation as learning adaptive part-whole hierarchies in flexible 3D latent space, enabling better generalization across diverse categories and structural complexity.
Abstract: Single-image 3D generation lies at the core of vision-to-graphics models in the real world. However, it remains a fundamental challenge to achieve reliable generalization across diverse semantic categories and highly variable structural complexity under sparse supervision. Existing approaches typically model objects in a monolithic manner or rely on a fixed number of parts, including recent part-aware models such as PartCrafter, which still require a labor-intensive user-specified part count. Such designs easily lead to overfitting, fragmented or missing structural components, and limited compositional generalization when encountering novel object layouts. To this end, this paper rethinks single-image 3D generation as learning an adaptive part-whole hierarchy in the flexible 3D latent space. We present a novel part-to-whole 3D generative world model that autonomously discovers latent structural slots by inferring soft and compositional masks directly from image tokens. Specifically, an adaptive slot-gating mechanism dynamically determines the slot-wise activation probabilities and smoothly consolidates redundant slots within different objects, ensuring that the emergent structure remains compact yet expressive across categories. Each distilled slot is then aligned to a learnable, class-agnostic prototype bank, enabling powerful cross-category shape sharing and denoising through universal geometric prototypes in the real world. Furthermore, a lightweight 3D denoiser is introduced to reconstruct geometry and appearance via unified diffusion objectives. Experiments show consistent gains in cross-category transfer and part-count extrapolation, and ablations confirm complementary benefits of the prototype bank for shape-prior sharing as well as slot-gating for structural adaptation.
[330] Revisiting Weakly-Supervised Video Scene Graph Generation via Pair Affinity Learning
Minseok Kang, Minhyeok Lee, Minjung Kim, Jungho Lee, Donghyeong Kim, Sungmin Woo, Inseok Jeon, Sangyoun Lee
Main category: cs.CV
TL;DR: Weakly-supervised video scene graph generation method that introduces learnable pair affinity to filter non-interactive object pairs, addressing noise from off-the-shelf detectors in video understanding.
Details
Motivation: Current weakly-supervised video scene graph generation methods rely on off-the-shelf detectors that detect all visible objects indiscriminately, creating noisy object pairs that overwhelm relation models, unlike fully-supervised pipelines that filter non-interactive objects.Method: Proposes Pair Affinity Learning and Scoring (PALS) with learnable pair affinity to estimate interaction likelihood, Pair Affinity Modulation (PAM) for contextual reasoning, and Relation-Aware Matching (RAM) using vision-language grounding for cleaner supervision in pseudo-label generation.
Result: Extensive experiments on Action Genome dataset show substantial improvements across different baselines and backbones, achieving state-of-the-art performance in weakly-supervised video scene graph generation.
Conclusion: The proposed approach effectively addresses the fundamental discrepancy between weakly-supervised and fully-supervised pipelines by filtering non-interactive object pairs, leading to significant performance gains in video scene graph generation with reduced annotation costs.
Abstract: Weakly-supervised video scene graph generation (WS-VSGG) aims to parse video content into structured relational triplets without bounding box annotations and with only sparse temporal labeling, significantly reducing annotation costs. Without ground-truth bounding boxes, these methods rely on off-the-shelf detectors to generate object proposals, yet largely overlook a fundamental discrepancy from fullysupervised pipelines. Fully-supervised detectors implicitly filter out noninteractive objects, while off-the-shelf detectors indiscriminately detect all visible objects, overwhelming relation models with noisy pairs.We address this by introducing a learnable pair affinity that estimates the likelihood of interaction between subject-object pairs. Through Pair Affinity Learning and Scoring (PALS), pair affinity is incorporated into inferencetime ranking and further integrated into contextual reasoning through Pair Affinity Modulation (PAM), enabling the model to suppress noninteractive pairs and focus on relationally meaningful ones. To provide cleaner supervision for pair affinity learning, we further propose Relation- Aware Matching (RAM), which leverages vision-language grounding to resolve class-level ambiguity in pseudo-label generation. Extensive experiments on Action Genome demonstrate that our approach consistently yields substantial improvements across different baselines and backbones, achieving state-of-the-art WS-VSGG performance.
[331] Exploring Multimodal Prompts For Unsupervised Continuous Anomaly Detection
Mingle Zhou, Jiahui Liu, Jin Wan, Gang Li, Min Li
Main category: cs.CV
TL;DR: Proposes a multimodal prompting framework for unsupervised continual anomaly detection that integrates visual and textual information to better capture normality patterns across sequential tasks.
Details
Motivation: Existing unsupervised continual anomaly detection methods relying solely on visual information are insufficient for capturing complex normality manifolds, limiting anomaly detection accuracy in evolving environments.Method: Introduces Continual Multimodal Prompt Memory Bank (CMPMB) to distill prototypical normal patterns from both visual and textual domains across tasks, and a Defect-Semantic-Guided Adaptive Fusion Mechanism (DSG-AFM) with Adaptive Normalization Module and Dynamic Fusion Strategy for enhanced accuracy and robustness.
Result: Achieves state-of-the-art performance on MVTec AD and VisA datasets for both image-level AUROC and pixel-level AUPR metrics.
Conclusion: Multimodal prompting effectively enhances unsupervised continual anomaly detection by leveraging both visual and textual information to better represent normality across sequential learning tasks.
Abstract: Unsupervised Continuous Anomaly Detection (UCAD) is gaining attention for effectively addressing the catastrophic forgetting and heavy computational burden issues in traditional Unsupervised Anomaly Detection (UAD). However, existing UCAD approaches that rely solely on visual information are insufficient to capture the manifold of normality in complex scenes, thereby impeding further gains in anomaly detection accuracy. To overcome this limitation, we propose an unsupervised continual anomaly detection framework grounded in multimodal prompting. Specifically, we introduce a Continual Multimodal Prompt Memory Bank (CMPMB) that progressively distills and retains prototypical normal patterns from both visual and textual domains across consecutive tasks, yielding a richer representation of normality. Furthermore, we devise a Defect-Semantic-Guided Adaptive Fusion Mechanism (DSG-AFM) that integrates an Adaptive Normalization Module (ANM) with a Dynamic Fusion Strategy (DFS) to jointly enhance detection accuracy and adversarial robustness. Benchmark experiments on MVTec AD and VisA datasets show that our approach achieves state-of-the-art (SOTA) performance on image-level AUROC and pixel-level AUPR metrics.
[332] Rethinking SAR ATR: A Target-Aware Frequency-Spatial Enhancement Framework with Noise-Resilient Knowledge Guidance
Yansong Lin, Zihan Cheng, Jielei Wang, Guoming Lua, Zongyong Cui
Main category: cs.CV
TL;DR: A target-aware frequency-spatial enhancement framework (FSCE) with noise-resilient knowledge guidance for SAR automatic target recognition, improving robustness to speckle noise through frequency-spatial feature enhancement and teacher-student distillation.
Details
Motivation: SAR imagery suffers from coherent speckle noise that obscures target features, degrading recognition accuracy and limiting model generalization in marine navigation and disaster monitoring applications.Method: Proposes FSCE framework with frequency-spatial shallow feature adaptive enhancement (DSAF) module using spatial multi-scale convolution and frequency-domain wavelet convolution. Uses teacher-student learning with online knowledge distillation to guide focus on target regions and enhance noise robustness.
Result: DSAFNet-L achieves competitive/superior performance on MSTAR, FUSARShip and OpenSARShip datasets; DSAFNet-M reduces model complexity while maintaining comparable accuracy. Framework shows strong cross-model generalization.
Conclusion: The FSCE framework effectively enhances SAR target recognition under noisy conditions through collaborative optimization of attention transfer and noise-resilient representation learning.
Abstract: Synthetic aperture radar automatic target recognition (SAR ATR) is of considerable importance in marine navigation and disaster monitoring. However, the coherent speckle noise inherent in SAR imagery often obscures salient target features, leading to degraded recognition accuracy and limited model generalization. To address this issue, this paper proposes a target-aware frequency-spatial enhancement framework with noise-resilient knowledge guidance (FSCE) for SAR target recognition. The proposed framework incorporates a frequency-spatial shallow feature adaptive enhancement (DSAF) module, which processes shallow features through spatial multi-scale convolution and frequency-domain wavelet convolution. In addition, a teacher-student learning paradigm combined with an online knowledge distillation method (KD) is employed to guide the student network to focus more effectively on target regions, thereby enhancing its robustness to high-noise backgrounds. Through the collaborative optimization of attention transfer and noise-resilient representation learning, the proposed approach significantly improves the stability of target recognition under noisy conditions. Based on the FSCE framework, two network architectures with different performance emphases are developed: lightweight DSAFNet-M and high-precision DSAFNet-L. Extensive experiments are conducted on the MSTAR, FUSARShip and OpenSARShip datasets. The results show that DSAFNet-L achieves competitive or superior performance compared with various methods on three datasets; DSAFNet-M significantly reduces the model complexity while maintaining comparable accuracy. These results indicate that the proposed FSCE framework exhibits strong cross-model generalization.
[333] CataractSAM-2: A Domain-Adapted Model for Anterior Segment Surgery Segmentation and Scalable Ground-Truth Annotation
Mohammad Eslami, Dhanvinkumar Ganeshkumar, Saber Kazeminasab, Michael G. Morley, Michael V. Boland, Michael M. Lin, John B. Miller, David S. Friedman, Nazlee Zebardast, Lucia Sobrin, Tobias Elze
Main category: cs.CV
TL;DR: CataractSAM-2 is a domain-adapted version of Segment Anything Model 2 for real-time semantic segmentation of cataract surgery videos, with an interactive annotation framework to reduce labeling burden and strong zero-shot generalization to other eye surgeries.
Details
Motivation: The need for precise intraoperative perception in robotic-assisted and computer-guided surgical systems, particularly for cataract ophthalmic surgery, combined with the burden of manual labeling for creating high-quality surgical video datasets.Method: Domain adaptation of Meta’s Segment Anything Model 2 (SAM-2) for cataract surgery videos, plus an interactive annotation framework combining sparse prompts with video-based mask propagation to accelerate dataset creation.
Result: Achieves real-time semantic segmentation with high accuracy for cataract surgery videos, significantly reduces annotation time, demonstrates strong zero-shot generalization to glaucoma trabeculectomy procedures, and releases model/toolkit as open-source.
Conclusion: CataractSAM-2 establishes a foundation for expanding anterior ophthalmic surgical datasets and advancing real-time AI-driven solutions in medical robotics and surgical video understanding through domain adaptation and efficient annotation tools.
Abstract: We present CataractSAM-2, a domain-adapted extension of Meta’s Segment Anything Model 2, designed for real-time semantic segmentation of cataract ophthalmic surgery videos with high accuracy. Positioned at the intersection of computer vision and medical robotics, CataractSAM-2 enables precise intraoperative perception crucial for robotic-assisted and computer-guided surgical systems. Furthermore, to alleviate the burden of manual labeling, we introduce an interactive annotation framework that combines sparse prompts with video-based mask propagation. This tool significantly reduces annotation time and facilitates the scalable creation of high-quality ground-truth masks, accelerating dataset development for ocular anterior segment surgeries. We also demonstrate the model’s strong zero-shot generalization to glaucoma trabeculectomy procedures, confirming its cross-procedural utility and potential for broader surgical applications. The trained model and annotation toolkit are released as open-source resources, establishing CataractSAM-2 as a foundation for expanding anterior ophthalmic surgical datasets and advancing real-time AI-driven solutions in medical robotics, as well as surgical video understanding.
[334] Rethinking Visual Privacy: A Compositional Privacy Risk Framework for Severity Assessment with VLMs
Efthymios Tsaprazlis, Tiantian Feng, Anil Ramakrishna, Sai Praneeth Karimireddy, Rahul Gupta, Shrikanth Narayanan
Main category: cs.CV
TL;DR: The paper introduces a compositional privacy risk taxonomy (CPRT) for visual privacy assessment, arguing that privacy is compositional rather than binary, and develops a framework with graded severity levels and scoring function for evaluating privacy risks in images.
Details
Motivation: Existing visual privacy benchmarks treat privacy as binary (private/non-private) based on visible sensitive content, but privacy is fundamentally compositional - benign attributes in isolation can combine to produce severe privacy violations. There's a need for a more nuanced framework that captures compositional privacy risks.Method: Introduces Compositional Privacy Risk Taxonomy (CPRT), a regulation-aware framework organizing visual attributes by standalone identifiability and compositional harm potential. Defines four graded severity levels with interpretable scoring function. Constructs taxonomy-aligned dataset of 6.7K images with ground-truth compositional risk scores. Evaluates frontier and open-weight VLMs, and introduces deployable 8B supervised fine-tuned model for compositional privacy assessment.
Result: Frontier models align well with compositional severity when given structured guidance but systematically underestimate composition-driven risks. Smaller models struggle with graded privacy reasoning. The introduced 8B SFT model closely matches frontier-level performance on compositional privacy assessment.
Conclusion: Privacy assessment requires compositional reasoning beyond binary classification. The CPRT framework provides a more nuanced approach to visual privacy evaluation, and the 8B SFT model demonstrates that deployable models can achieve frontier-level performance on compositional privacy assessment tasks.
Abstract: Existing visual privacy benchmarks largely treat privacy as a binary property, labeling images as private or non-private based on visible sensitive content. We argue that privacy is fundamentally compositional. Attributes that are benign in isolation may combine to produce severe privacy violations. We introduce the Compositional Privacy Risk Taxonomy (CPRT), a regulation-aware framework that organizes visual attributes according to standalone identifiability and compositional harm potential. CPRT defines four graded severity levels and is paired with an interpretable scoring function that assigns continuous privacy severity scores. We further construct a taxonomy-aligned dataset of 6.7K images and derive ground-truth compositional risk scores. By evaluating frontier and open-weight VLMs we find that frontier models align well with compositional severity when provided structured guidance, but systematically underestimate composition-driven risks. Smaller models struggle to internalize graded privacy reasoning. To bridge this gap, we introduce a deployable 8B supervised fine-tuned (SFT) model that closely matches frontier-level performance on compositional privacy assessment.
[335] HACMatch Semi-Supervised Rotation Regression with Hardness-Aware Curriculum Pseudo Labeling
Mei Li, Huayi Zhou, Suizhi Huang, Yuxiang Lu, Yue Ding, Hongtao Lu
Main category: cs.CV
TL;DR: A hardness-aware curriculum learning framework for semi-supervised 3D rotation regression from 2D images, using dynamic pseudo-label selection and structured data augmentation.
Details
Motivation: Existing rotation regression models require large labeled datasets or additional information (point clouds, CAD models). Semi-supervised learning with limited labeled 2D images is valuable but current methods like FisherMatch have rigid pseudo-label filtering that fails to distinguish reliable vs unreliable samples.Method: Proposes hardness-aware curriculum learning that dynamically selects pseudo-labeled samples based on difficulty (easy to complex). Includes multi-stage and adaptive curriculum strategies to replace fixed-threshold filtering. Also introduces structured data augmentation tailored for rotation estimation, assembling composite images from augmented patches while preserving geometric integrity.
Result: Comprehensive experiments on PASCAL3D+ and ObjectNet3D show the method outperforms existing supervised and semi-supervised baselines, particularly in low-data regimes.
Conclusion: The hardness-aware curriculum learning framework and structured augmentation approach are effective for semi-supervised 3D rotation regression from limited 2D images.
Abstract: Regressing 3D rotations of objects from 2D images is a crucial yet challenging task, with broad applications in autonomous driving, virtual reality, and robotic control. Existing rotation regression models often rely on large amounts of labeled data for training or require additional information beyond 2D images, such as point clouds or CAD models. Therefore, exploring semi-supervised rotation regression using only a limited number of labeled 2D images is highly valuable. While recent work FisherMatch introduces semi-supervised learning to rotation regression, it suffers from rigid entropy-based pseudo-label filtering that fails to effectively distinguish between reliable and unreliable unlabeled samples. To address this limitation, we propose a hardness-aware curriculum learning framework that dynamically selects pseudo-labeled samples based on their difficulty, progressing from easy to complex examples. We introduce both multi-stage and adaptive curriculum strategies to replace fixed-threshold filtering with more flexible, hardness-aware mechanisms. Additionally, we present a novel structured data augmentation strategy specifically tailored for rotation estimation, which assembles composite images from augmented patches to introduce feature diversity while preserving critical geometric integrity. Comprehensive experiments on PASCAL3D+ and ObjectNet3D demonstrate that our method outperforms existing supervised and semi-supervised baselines, particularly in low-data regimes, validating the effectiveness of our curriculum learning framework and structured augmentation approach.
[336] SARe: Structure-Aware Large-Scale 3D Fragment Reassembly
Hanze Jia, Chunshi Wang, Yuxiao Yang, Zhonghua Jiang, Yawei Luo, Shuainan Ye, Tan Tang
Main category: cs.CV
TL;DR: SARe is a generative framework for 3D fragment reassembly with explicit contact modeling, featuring SARe-Gen for assembly generation and SARe-Refine for inference-time refinement to handle many-fragment scenarios.
Details
Motivation: Existing 3D fragment reassembly methods struggle with many fragments due to unreliable contact reasoning and inaccurate fragment adjacencies, leading to cascading failures. The problem is challenging because the target shape is unknown and fragments provide weak semantic cues.Method: SARe uses a two-stage approach: 1) SARe-Gen jointly predicts fracture-surface token probabilities and an inter-fragment contact graph using query-point-based conditioning and aligned local geometric tokens from a frozen geometry encoder; 2) SARe-Refine performs inference-time refinement by verifying candidate contacts with geometric-consistency checks, selecting reliable substructures, and resampling uncertain regions.
Result: State-of-the-art performance across synthetic fractures, simulated fractures from scanned real objects, and real physically fractured scans, with more graceful degradation and higher success rates as fragment count increases in large-scale reassembly.
Conclusion: SARe addresses key challenges in many-fragment 3D reassembly through explicit contact modeling and inference-time refinement, demonstrating robust performance across diverse fracture scenarios.
Abstract: 3D fragment reassembly aims to recover the rigid poses of unordered fragment point clouds or meshes in a common object coordinate system to reconstruct the complete shape. The problem becomes particularly challenging as the number of fragments grows, since the target shape is unknown and fragments provide weak semantic cues. Existing end-to-end approaches are prone to cascading failures due to unreliable contact reasoning, most notably inaccurate fragment adjacencies. To address this, we propose Structure-Aware Reassembly (SARe), a generative framework with SARe-Gen for Euclidean-space assembly generation and SARe-Refine for inference-time refinement, with explicit contact modeling. SARe-Gen jointly predicts fracture-surface token probabilities and an inter-fragment contact graph to localize contact regions and infer candidate adjacencies. It adopts a query-point-based conditioning scheme and extracts aligned local geometric tokens at query locations from a frozen geometry encoder, yielding queryable structural representations without additional structural pretraining. We further introduce an inference-time refinement stage, SARe-Refine. By verifying candidate contact edges with geometric-consistency checks, it selects reliable substructures and resamples the remaining uncertain regions while keeping verified parts fixed, leading to more stable and consistent assemblies in the many-fragment regime. We evaluate SARe across three settings, including synthetic fractures, simulated fractures from scanned real objects, and real physically fractured scans. The results demonstrate state-of-the-art performance, with more graceful degradation and higher success rates as the fragment count increases in challenging large-scale reassembly.
[337] AdaEdit: Adaptive Temporal and Channel Modulation for Flow-Based Image Editing
Guandong Li, Zhaobin Chu
Main category: cs.CV
TL;DR: AdaEdit is a training-free adaptive editing framework for flow matching models that addresses the injection dilemma through progressive injection schedules and channel-selective latent perturbation.
Details
Motivation: Existing inversion-based image editing methods use fixed injection strategies that ignore heterogeneous injection demands across temporal and channel dimensions, leading to background preservation vs. content generation trade-offs.Method: Two innovations: 1) Progressive Injection Schedule with continuous decay functions (sigmoid, cosine, linear) instead of binary cutoffs, 2) Channel-Selective Latent Perturbation that estimates per-channel importance and applies differentiated perturbation strengths.
Result: On PIE-Bench benchmark (700 images, 10 editing types), AdaEdit achieves 8.7% reduction in LPIPS, 2.6% improvement in SSIM, and 2.3% improvement in PSNR over baselines while maintaining competitive CLIP similarity.
Conclusion: AdaEdit effectively resolves the injection dilemma in flow matching models through adaptive strategies, enabling better balance between source preservation and target generation without training.
Abstract: Inversion-based image editing in flow matching models has emerged as a powerful paradigm for training-free, text-guided image manipulation. A central challenge in this paradigm is the injection dilemma: injecting source features during denoising preserves the background of the original image but simultaneously suppresses the model’s ability to synthesize edited content. Existing methods address this with fixed injection strategies – binary on/off temporal schedules, uniform spatial mixing ratios, and channel-agnostic latent perturbation – that ignore the inherently heterogeneous nature of injection demand across both the temporal and channel dimensions. In this paper, we present AdaEdit, a training-free adaptive editing framework that resolves this dilemma through two complementary innovations. First, we propose a Progressive Injection Schedule that replaces hard binary cutoffs with continuous decay functions (sigmoid, cosine, or linear), enabling a smooth transition from source-feature preservation to target-feature generation and eliminating feature discontinuity artifacts. Second, we introduce Channel-Selective Latent Perturbation, which estimates per-channel importance based on the distributional gap between the inverted and random latents and applies differentiated perturbation strengths accordingly – strongly perturbing edit-relevant channels while preserving structure-encoding channels. Extensive experiments on the PIE-Bench benchmark (700 images, 10 editing types) demonstrate that AdaEdit achieves an 8.7% reduction in LPIPS, a 2.6% improvement in SSIM, and a 2.3% improvement in PSNR over strong baselines, while maintaining competitive CLIP similarity. AdaEdit is fully plug-and-play and compatible with multiple ODE solvers including Euler, RF-Solver, and FireFlow. Code is available at https://github.com/leeguandong/AdaEdit
[338] 4DGS360: 360° Gaussian Reconstruction of Dynamic Objects from a Single Video
Jae Won Jang, Yeonjin Chang, Wonsik Shin, Juhwan Cho, Nojun Kwak
Main category: cs.CV
TL;DR: 4DGS360 is a diffusion-free framework for 360° dynamic object reconstruction from monocular video using 3D-native initialization and AnchorTAP3D tracker to overcome geometric ambiguity in occluded regions.
Details
Motivation: Existing methods fail to reconstruct consistent 360° geometry from monocular video because they rely too heavily on 2D-native priors, causing overfitting to visible surfaces and poor reconstruction of occluded regions.Method: Proposes 4DGS360 with AnchorTAP3D tracker that uses confident 2D track points as anchors to produce reinforced 3D point trajectories, suppressing drift and providing reliable 3D-native initialization that preserves geometry in occluded regions.
Result: Achieves state-of-the-art performance on iPhone360 (new benchmark), iPhone, and DAVIS datasets, both qualitatively and quantitatively, enabling coherent 360° 4D reconstructions.
Conclusion: 4DGS360 successfully addresses the challenge of 360° dynamic object reconstruction from monocular video through advanced 3D-native initialization, outperforming existing methods and introducing a new benchmark for comprehensive evaluation.
Abstract: We introduce 4DGS360, a diffusion-free framework for 360$^{\circ}$ dynamic object reconstruction from casual monocular video. Existing methods often fail to reconstruct consistent 360$^{\circ}$ geometry, as their heavy reliance on 2D-native priors causes initial points to overfit to visible surface in each training view. 4DGS360 addresses this challenge through a advanced 3D-native initialization that mitigates the geometric ambiguity of occluded regions. Our proposed 3D tracker, AnchorTAP3D, produces reinforced 3D point trajectories by leveraging confident 2D track points as anchors, suppressing drift and providing reliable initialization that preserves geometry in occluded regions. This initialization, combined with optimization, yields coherent 360$^{\circ}$ 4D reconstructions. We further present iPhone360, a new benchmark where test cameras are placed up to 135$^{\circ}$ apart from training views, enabling 360$^{\circ}$ evaluation that existing datasets cannot provide. Experiments show that 4DGS360 achieves state-of-the-art performance on the iPhone360, iPhone, and DAVIS datasets, both qualitatively and quantitatively.
[339] Efficient Zero-Shot AI-Generated Image Detection
Ryosuke Sonoda, Ramya Srinivasan
Main category: cs.CV
TL;DR: A training-free AI-generated image detection method using frequency perturbation sensitivity analysis that is computationally efficient and achieves state-of-the-art performance.
Details
Motivation: Current AI-generated image detectors face challenges: training-based methods have limited generalization to unseen images, while training-free approaches lack sensitivity to subtle discrepancies between real and synthetic images. There's a need for robust, efficient detection methods as text-to-image models produce increasingly realistic content.Method: Proposes a training-free detection method that measures representation sensitivity to structured frequency perturbations. The approach uses Fourier transforms to generate perturbations and analyzes how image representations respond to these frequency-based modifications to detect minute manipulations characteristic of AI-generated content.
Result: Achieves 1-2 orders of magnitude faster inference than most training-free detectors. On the OpenFake benchmark, improves AUC by nearly 10% compared to state-of-the-art methods while maintaining substantially lower computational cost.
Conclusion: The proposed frequency perturbation sensitivity method provides an effective, computationally lightweight solution for AI-generated image detection that outperforms existing approaches in both accuracy and efficiency.
Abstract: The rapid progress of text-to-image models has made AI-generated images increasingly realistic, posing significant challenges for accurate detection of generated content. While training-based detectors often suffer from limited generalization to unseen images, training-free approaches offer better robustness, yet struggle to capture subtle discrepancies between real and synthetic images. In this work, we propose a training-free AI-generated image detection method that measures representation sensitivity to structured frequency perturbations, enabling detection of minute manipulations. The proposed method is computationally lightweight, as perturbation generation requires only a single Fourier transform for an input image. As a result, it achieves one to two orders of magnitude faster inference than most training-free detectors.Extensive experiments on challenging benchmarks demonstrate the efficacy of our method over state-of-the-art (SoTA). In particular, on OpenFake benchmark, our method improves AUC by nearly $10%$ compared to SoTA, while maintaining substantially lower computational cost.
[340] PGR-Net: Prior-Guided ROI Reasoning Network for Brain Tumor MRI Segmentation
Jiacheng Lu, Hui Ding, Shiyu Zhang, Guoping Huo
Main category: cs.CV
TL;DR: PGR-Net is a brain tumor MRI segmentation framework that uses data-driven spatial priors and ROI reasoning to address spatial sparsity, achieving state-of-the-art performance with minimal parameters.
Details
Motivation: Brain tumor MRI segmentation faces challenges due to spatial sparsity (tumors occupy small volumetric space) and existing networks overlook clinically observed spatial priors, leading to redundant feature computation over background regions.Method: Proposes PGR-Net with: 1) Data-driven spatial prior set capturing tumor distribution/scale characteristics, 2) Hierarchical Top-K ROI decision mechanism for progressive lesion candidate selection, 3) WinGS-ROI module using multi-window Gaussian templates with spatial decay for center-enhanced guidance maps, 4) Windowed RetNet backbone for enhanced localization reliability.
Result: Outperforms existing approaches on BraTS-2019/2023 and MSD Task01 with only 8.64M parameters, achieving Dice scores of 89.02%, 91.82%, and 89.67% on Whole Tumor region.
Conclusion: PGR-Net effectively addresses spatial sparsity in brain tumor segmentation by incorporating spatial priors and ROI reasoning, providing stable and precise segmentation with computational efficiency.
Abstract: Brain tumor MRI segmentation is essential for clinical diagnosis and treatment planning, enabling accurate lesion detection and radiotherapy target delineation. However, tumor lesions occupy only a small fraction of the volumetric space, resulting in severe spatial sparsity, while existing segmentation networks often overlook clinically observed spatial priors of tumor occurrence, leading to redundant feature computation over extensive background regions. To address this issue, we propose PGR-Net (Prior-Guided ROI Reasoning Network) - an explicit ROI-aware framework that incorporates a data-driven spatial prior set to capture the distribution and scale characteristics of tumor lesions, providing global guidance for more stable segmentation. Leveraging these priors, PGR-Net introduces a hierarchical Top-K ROI decision mechanism that progressively selects the most confident lesion candidate regions across encoder layers to improve localization precision. We further develop the WinGS-ROI (Windowed Gaussian-Spatial Decay ROI) module, which uses multi-window Gaussian templates with a spatial decay function to produce center-enhanced guidance maps, thus directing feature learning throughout the network. With these ROI features, a windowed RetNet backbone is adopted to enhance localization reliability. Experiments on BraTS-2019/2023 and MSD Task01 show that PGR-Net consistently outperforms existing approaches while using only 8.64M Params, achieving Dice scores of 89.02%, 91.82%, and 89.67% on the Whole Tumor region. Code is available at https://github.com/CNU-MedAI-Lab/PGR-Net.
[341] Dual-level Adaptation for Multi-Object Tracking: Building Test-Time Calibration from Experience and Intuition
Wen Guo, Pengfei Zhao, Zongmeng Wang, Yufan Hu, Junyu Gao
Main category: cs.CV
TL;DR: TCEI is a test-time adaptation framework for Multiple Object Tracking that addresses distribution shifts by combining intuitive predictions from recent observations with experiential calibration from prior test videos.
Details
Motivation: Existing test-time adaptation methods fail in MOT because they focus only on frame-level adaptation while ignoring temporal consistency and identity association across frames. Distribution shifts in appearance, motion patterns, and categories cause significant performance degradation during online inference.Method: Proposes a Test-time Calibration from Experience and Intuition (TCEI) framework with two systems: 1) Intuitive system uses transient memory to recall recently observed objects for rapid predictions, 2) Experiential system leverages accumulated experience from prior test videos to reassess and calibrate intuitive predictions. Both confident and uncertain objects are exploited as historical priors and reflective cases.
Result: Extensive experiments show TCEI consistently achieves superior performance across multiple benchmark datasets and significantly enhances model adaptability under distribution shifts.
Conclusion: TCEI effectively addresses distribution shifts in MOT by combining intuitive and experiential systems, outperforming existing test-time adaptation methods through better temporal consistency and identity association handling.
Abstract: Multiple Object Tracking (MOT) has long been a fundamental task in computer vision, with broad applications in various real-world scenarios. However, due to distribution shifts in appearance, motion pattern, and catagory between the training and testing data, model performance degrades considerably during online inference in MOT. Test-Time Adaptation (TTA) has emerged as a promising paradigm to alleviate such distribution shifts. However, existing TTA methods often fail to deliver satisfactory results in MOT, as they primarily focus solely on frame-level adaptation while neglecting temporal consistency and identity association across frames and videos. Inspired by human decision-making process, this paper propose a Test-time Calibration from Experience and Intuition (TCEI) framework. In this framework, the Intuitive system utilizes transient memory to recall recently observed objects for rapid predictions, while the Experiential system leverages the accumulated experience from prior test videos to reassess and calibrate these intuitive predictions. Furthermore, both confident and uncertain objects during online testing are exploited as historical priors and reflective cases, respectively, enabling the model to adapt to the testing environment and alleviate performance degradation. Extensive experiments demonstrate that the proposed TCEI framework consistently achieves superior performance across multiple benchmark datasets and significantly enhances the model’s adaptability under distribution shifts. The code will be released at https://github.com/1941Zpf/TCEI.
[342] No Dense Tensors Needed: Fully Sparse Object Detection on Event-Camera Voxel Grids
Mohamad Yazan Sadoun, Sarah Sharif, Yaser Mike Banad
Main category: cs.CV
TL;DR: SparseVoxelDet is the first fully sparse object detector for event cameras that processes only active voxels using 3D sparse convolutions, achieving high efficiency with minimal accuracy trade-off compared to dense methods.
Details
Motivation: Event cameras produce sparse, asynchronous data streams ideal for detecting fast-moving objects like drones, but existing detectors convert sparse events into dense tensors, losing the efficiency benefits of neuromorphic sensing. The authors aim to create a detector that preserves sparsity throughout the pipeline.Method: Proposes SparseVoxelDet, a fully sparse object detector where backbone feature extraction, feature pyramid fusion, and detection head operate exclusively on occupied voxel positions using 3D sparse convolutions. No dense feature tensors are created at any stage, processing only active voxels in the scene.
Result: Achieves 83.38% mAP at IoU 0.50 on FRED benchmark (629,832 frames), processing only 14,900 active voxels per frame (0.23% of grid). Shows 858× GPU memory compression and 3,670× storage reduction vs dense 3D voxel tensor. Accuracy gap to dense baseline (87.68% mAP) is mostly due to box regression precision rather than detection capability.
Conclusion: Native sparse processing is viable for event-camera object detection, exploiting structural sparsity without requiring neuromorphic hardware. The framework’s representation cost scales with scene activity rather than pixel count, making it increasingly valuable as event cameras scale to higher resolutions.
Abstract: Event cameras produce asynchronous, high-dynamic-range streams well suited for detecting small, fast-moving drones, yet most event-based detectors convert the sparse event stream into dense tensors, discarding the representational efficiency of neuromorphic sensing. We propose SparseVoxelDet, to our knowledge the first fully sparse object detector for event cameras, in which backbone feature extraction, feature pyramid fusion, and the detection head all operate exclusively on occupied voxel positions through 3D sparse convolutions; no dense feature tensor is instantiated at any stage of the pipeline. On the FRED benchmark (629,832 annotated frames), SparseVoxelDet achieves 83.38% mAP at 50 while processing only 14,900 active voxels per frame (0.23% of the T.H.W grid), compared to 409,600 pixels for the dense YOLOv11 baseline (87.68% mAP at 50). Relaxing the IoU threshold from 0.50 to 0.40 recovers mAP to 89.26%, indicating that the remaining accuracy gap is dominated by box regression precision rather than detection capability. The sparse representation yields 858 times GPU memory compression and 3,670 times storage reduction relative to the equivalent dense 3D voxel tensor, with data-structure size that scales with scene dynamics rather than sensor resolution. Error forensics across 119,459 test frames confirms that 71 percent of failures are localization near-misses rather than missed targets. These results demonstrate that native sparse processing is a viable paradigm for event-camera object detection, exploiting the structural sparsity of neuromorphic sensor data without requiring neuromorphic computing hardware, and providing a framework whose representation cost is governed by scene activity rather than pixel count, a property that becomes increasingly valuable as event cameras scale to higher resolutions.
[343] FedCVU: Federated Learning for Cross-View Video Understanding
Shenghan Zhang, Run Ling, Ke Cao, Ao Ma, Zhanjie Zhang
Main category: cs.CV
TL;DR: FedCVU is a federated learning framework for cross-view video understanding that addresses view heterogeneity, representation misalignment, and communication overhead through view-specific normalization, contrastive alignment, and selective layer aggregation.
Details
Motivation: Applying federated learning to multi-camera video understanding faces challenges: (1) heterogeneous viewpoints create non-IID data distributions causing overfitting to view-specific patterns, (2) local distribution biases lead to misaligned representations that hinder cross-view semantic consistency, and (3) large video architectures create prohibitive communication overhead in federated settings.Method: FedCVU proposes three components: VS-Norm preserves view-specific normalization parameters to handle different camera statistics; CV-Align uses lightweight contrastive regularization to improve cross-view representation alignment; and SLA (Selective Layer Aggregation) reduces communication by selectively aggregating only critical layers without sacrificing accuracy.
Result: Extensive experiments on action understanding and person re-identification tasks under cross-view protocols show FedCVU consistently boosts unseen-view accuracy while maintaining strong seen-view performance, outperforming state-of-the-art FL baselines and demonstrating robustness to domain heterogeneity and communication constraints.
Conclusion: FedCVU effectively addresses key challenges in federated cross-view video understanding by handling view heterogeneity, aligning representations across cameras, and reducing communication overhead, making it a practical solution for privacy-preserving multi-camera video analysis.
Abstract: Federated learning (FL) has emerged as a promising paradigm for privacy-preserving multi-camera video understanding. However, applying FL to cross-view scenarios faces three major challenges: (i) heterogeneous viewpoints and backgrounds lead to highly non-IID client distributions and overfitting to view-specific patterns, (ii) local distribution biases cause misaligned representations that hinder consistent cross-view semantics, and (iii) large video architectures incur prohibitive communication overhead. To address these issues, we propose FedCVU, a federated framework with three components: VS-Norm, which preserves normalization parameters to handle view-specific statistics; CV-Align, a lightweight contrastive regularization module to improve cross-view representation alignment; and SLA, a selective layer aggregation strategy that reduces communication without sacrificing accuracy. Extensive experiments on action understanding and person re-identification tasks under a cross-view protocol demonstrate that FedCVU consistently boosts unseen-view accuracy while maintaining strong seen-view performance, outperforming state-of-the-art FL baselines and showing robustness to domain heterogeneity and communication constraints.
[344] OmniFM: Toward Modality-Robust and Task-Agnostic Federated Learning for Heterogeneous Medical Imaging
Meilin Liu, Jiaying Wang, Jing Shan
Main category: cs.CV
TL;DR: OmniFM is a modality- and task-agnostic federated learning framework for medical image analysis that uses frequency-domain insights to handle heterogeneous imaging modalities and diverse downstream tasks without re-engineering optimization pipelines.
Details
Motivation: Existing federated learning frameworks for medical image analysis are tightly coupled to task-specific backbones and fragile under heterogeneous imaging modalities, hindering real-world deployment where institutions vary in modality distributions and need to support diverse downstream tasks.Method: OmniFM leverages frequency-domain insights that low-frequency spectral components exhibit cross-modality consistency and encode modality-invariant anatomical structures. It integrates: (1) Global Spectral Knowledge Retrieval to inject global frequency priors, (2) Embedding-wise Cross-Attention Fusion to align representations, (3) Prefix-Suffix Spectral Prompting to jointly condition global and personalized cues, all regularized by a Spectral-Proximal Alignment objective that stabilizes aggregation.
Result: Experiments on real-world datasets show OmniFM consistently surpasses state-of-the-art FL baselines across intra- and cross-modality heterogeneity, achieving superior results under both fine-tuning and training-from-scratch setups.
Conclusion: OmniFM provides a unified federated learning framework that effectively handles modality heterogeneity and diverse medical imaging tasks through frequency-domain regularization and cross-modality alignment techniques.
Abstract: Federated learning (FL) has become a promising paradigm for collaborative medical image analysis, yet existing frameworks remain tightly coupled to task-specific backbones and are fragile under heterogeneous imaging modalities. Such constraints hinder real-world deployment, where institutions vary widely in modality distributions and must support diverse downstream tasks. To address this limitation, we propose OmniFM, a modality- and task-agnostic FL framework that unifies training across classification, segmentation, super-resolution, visual question answering, and multimodal fusion without re-engineering the optimization pipeline. OmniFM builds on a key frequency-domain insight: low-frequency spectral components exhibit strong cross-modality consistency and encode modality-invariant anatomical structures. Accordingly, OmniFM integrates (i) Global Spectral Knowledge Retrieval to inject global frequency priors, (ii) Embedding-wise Cross-Attention Fusion to align representations, and (iii) Prefix-Suffix Spectral Prompting to jointly condition global and personalized cues, together regularized by a Spectral-Proximal Alignment objective that stabilizes aggregation. Experiments on real-world datasets show that OmniFM consistently surpasses state-of-the-art FL baselines across intra- and cross-modality heterogeneity, achieving superior results under both fine-tuning and training-from-scratch setups.
[345] HumanOmni-Speaker: Identifying Who said What and When
Detao Bai, Shimin Yao, Weixuan Chen, Xihan Wei, Zhiheng Ma
Main category: cs.CV
TL;DR: HumanOmni-Speaker addresses multimodal LLMs’ inability to handle complex multi-person conversational dynamics by introducing a Visual Delta Encoder that captures fine-grained lip movements and speaker trajectories at 25 fps, enabling true spatio-temporal identity binding without visual shortcuts.
Details
Motivation: Current multimodal LLMs struggle with accurately answering "Who said what and when" in multi-person conversations due to reliance on visual biases in benchmarks and sparse sampling that destroys crucial high-frequency dynamics like lip movements.Method: Proposes HumanOmni-Speaker with Visual Delta Encoder that samples raw video at 25 fps and compresses inter-frame motion residuals into just 6 tokens per frame to capture fine-grained visemes and speaker trajectories without token explosion.
Result: Demonstrates strong multimodal synergy, enabling end-to-end lip-reading and high-precision spatial localization without intrusive cropping, achieving superior performance across speaker-centric tasks.
Conclusion: The approach overcomes architectural perception gaps in multimodal LLMs by providing true spatio-temporal identity binding for complex conversational dynamics through efficient high-frequency visual processing.
Abstract: While Omni-modal Large Language Models have made strides in joint sensory processing, they fundamentally struggle with a cornerstone of human interaction: deciphering complex, multi-person conversational dynamics to accurately answer Who said what and when.'' Current models suffer from an illusion of competence’’ – they exploit visual biases in conventional benchmarks to bypass genuine cross-modal alignment, while relying on sparse, low-frame-rate visual sampling that destroys crucial high-frequency dynamics like lip movements. To shatter this illusion, we introduce Visual-Registered Speaker Diarization and Recognition (VR-SDR) and the HumanOmni-Speaker Benchmark. By strictly eliminating visual shortcuts, this rigorous paradigm demands true end-to-end spatio-temporal identity binding using only natural language queries. To overcome the underlying architectural perception gap, we propose HumanOmni-Speaker, powered by a Visual Delta Encoder. By sampling raw video at 25 fps and explicitly compressing inter-frame motion residuals into just 6 tokens per frame, it captures fine-grained visemes and speaker trajectories without triggering a catastrophic token explosion. Ultimately, HumanOmni-Speaker demonstrates strong multimodal synergy, natively enabling end-to-end lip-reading and high-precision spatial localization without intrusive cropping, and achieving superior performance across a wide spectrum of speaker-centric tasks.
[346] RefracGS: Novel View Synthesis Through Refractive Water Surfaces with 3D Gaussian Ray Tracing
Yiming Shao, Qiyu Dai, Chong Gao, Guanbin Li, Yeqiang Wang, He Sun, Qiong Zeng, Baoquan Chen, Wenzheng Chen
Main category: cs.CV
TL;DR: RefracGS is a novel framework for novel view synthesis through refractive surfaces that jointly reconstructs the refractive water surface and underlying scene using neural height fields and 3D Gaussian fields with refraction-aware ray tracing.
Details
Motivation: Traditional NeRF and 3D Gaussian Splatting methods assume straight-line ray propagation, which fails for scenes viewed through non-planar refractive surfaces like water waves, causing significant artifacts in novel view synthesis.Method: The method decouples the refractive boundary from target objects: uses a neural height field to model the refractive water surface geometry, and a 3D Gaussian field for the underlying scene. It implements refraction-aware Gaussian ray tracing using Snell’s law to compute non-linear ray trajectories.
Result: RefracGS outperforms prior refractive methods in visual quality on both synthetic and real-world scenes with complex waves, achieves 15x faster training, and enables real-time rendering at 200 FPS.
Conclusion: The framework successfully addresses the challenge of novel view synthesis through refractive surfaces by jointly optimizing refractive surface and scene representations with physically accurate ray tracing, enabling high-fidelity rendering and view-consistent surface recovery.
Abstract: Novel view synthesis (NVS) through non-planar refractive surfaces presents fundamental challenges due to severe, spatially varying optical distortions. While recent representations like NeRF and 3D Gaussian Splatting (3DGS) excel at NVS, their assumption of straight-line ray propagation fails under these conditions, leading to significant artifacts. To overcome this limitation, we introduce RefracGS, a framework that jointly reconstructs the refractive water surface and the scene beneath the interface. Our key insight is to explicitly decouple the refractive boundary from the target objects: the refractive surface is modeled via a neural height field, capturing wave geometry, while the underlying scene is represented as a 3D Gaussian field. We formulate a refraction-aware Gaussian ray tracing approach that accurately computes non-linear ray trajectories using Snell’s law and efficiently renders the underlying Gaussian field while backpropagating the loss gradients to the parameterized refractive surface. Through end-to-end joint optimization of both representations, our method ensures high-fidelity NVS and view-consistent surface recovery. Experiments on both synthetic and real-world scenes with complex waves demonstrate that RefracGS outperforms prior refractive methods in visual quality, while achieving 15x faster training and real-time rendering at 200 FPS. The project page for RefracGS is available at https://yimgshao.github.io/refracgs/.
[347] PPGL-Swarm: Integrated Multimodal Risk Stratification and Hereditary Syndrome Detection in Pheochromocytoma and Paraganglioma
Zelin Liu, Xiangfu Yu, Jie Huang, Ge Wang, Yizhe Yuan, Zhenyu Yi, Jing Xie, Haotian Jiang, Lichi Zhang
Main category: cs.CV
TL;DR: PPGL-Swarm: An agentic diagnostic system for pheochromocytomas and paragangliomas that automates GAPP scoring, incorporates genotype risk analysis, and provides traceable reasoning through specialized agents.
Details
Motivation: Current PPGL diagnosis using GAPP scoring has limitations: high clinician workload due to manual evaluation of six components, subjective criteria for key components like cellularity and Ki-67, and failure to capture clinically relevant metastatic risk factors like SDHB mutations. Existing agent-driven diagnostic systems lack traceable reasoning and don't incorporate domain-specific knowledge like genotype information.Method: PPGL-Swarm uses an agentic system that decomposes diagnosis into micro-tasks assigned to specialized agents. It includes automated GAPP scoring with quantified cellularity and Ki-67, genotype risk alerts, and multimodal report generation. The system employs knowledge enhancement for gene and table agents to interpret genotype and laboratory findings, and uses reinforcement learning during training to refine tool selection and task assignment.
Result: The system generates comprehensive diagnostic reports including automated GAPP scoring, genotype risk alerts, and integrated multimodal evidence. It provides an auditable reasoning trail through the decomposition of diagnosis into micro-tasks handled by specialized agents.
Conclusion: PPGL-Swarm addresses limitations of current PPGL diagnostic approaches by automating complex scoring, incorporating genotype information, and providing traceable reasoning through an agentic architecture with specialized agents and knowledge enhancement.
Abstract: Pheochromocytomas and paragangliomas (PPGLs) are rare neuroendocrine tumors, of which 15-25% develop metastatic disease with 5-year survival rates reported as low as 34%. PPGL may indicate hereditary syndromes requiring stricter, syndrome-specific treatment and surveillance, but clinicians often fail to recognize these associations in routine care. Clinical practice uses GAPP score for PPGL grading, but several limitations remain for PPGL diagnosis: (1) GAPP scoring demands a high workload for clinician because it requires the manual evaluation of six independent components; (2) key components such as cellularity and Ki-67 are often evaluated with subjective criteria; (3) several clinically relevant metastatic risk factors are not captured by GAPP, such as SDHB mutations, which have been associated with reported metastatic rates of 35-75%. Agent-driven diagnostic systems appear promising, but most lack traceable reasoning for decision-making and do not incorporate domain-specific knowledge such as PPGL genotype information. To address these limitations, we present PPGL-Swarm, an agentic PPGL diagnostic system that generates a comprehensive report, including automated GAPP scoring (with quantified cellularity and Ki-67), genotype risk alerts, and multimodal report with integrated evidence. The system provides an auditable reasoning trail by decomposing diagnosis into micro-tasks, each assigned to a specialized agent. The gene and table agents use knowledge enhancement to better interpret genotype and laboratory findings, and during training we use reinforcement learning to refine tool selection and task assignment.
[348] Rethinking Token Reduction for Large Vision-Language Models
Yi Wang, Haofei Zhang, Qihan Huang, Anda Cao, Gongfan Fang, Wei Wang, Xuan Jin, Jie Song, Mingli Song, Xinchao Wang
Main category: cs.CV
TL;DR: MetaCompress is a learning-based token reduction method for multi-turn VQA that optimizes visual token compression through a unified learnable mapping, improving efficiency while maintaining accuracy across dialogue turns.
Details
Motivation: Current token reduction methods for LVLMs focus on single-turn VQA and are ineffective for multi-turn scenarios where subsequent questions are unknown and may refer to arbitrary image regions. Existing approaches either bias toward initial prompts or rely on suboptimal heuristic metrics.Method: Formulates token reduction as a learnable compression mapping that unifies pruning and merging approaches. Introduces a data-efficient training paradigm to learn optimal compression mappings with limited computational costs, making it prompt-agnostic and suitable for multi-turn settings.
Result: Extensive experiments on MT-VQA benchmarks across multiple LVLM architectures show MetaCompress achieves superior efficiency-accuracy trade-offs while maintaining strong generalization across dialogue turns.
Conclusion: MetaCompress overcomes limitations of heuristic token reduction methods for multi-turn VQA, providing an effective learning-based approach that balances computational efficiency with visual reasoning accuracy in practical dialogue scenarios.
Abstract: Large Vision-Language Models (LVLMs) excel in visual understanding and reasoning, but the excessive visual tokens lead to high inference costs. Although recent token reduction methods mitigate this issue, they mainly target single-turn Visual Question Answering (VQA), leaving the more practical multi-turn VQA (MT-VQA) scenario largely unexplored. MT-VQA introduces additional challenges, as subsequent questions are unknown beforehand and may refer to arbitrary image regions, making existing reduction strategies ineffective. Specifically, current approaches fall into two categories: prompt-dependent methods, which bias toward the initial text prompt and discard information useful for subsequent turns; prompt-agnostic ones, which, though technically applicable to multi-turn settings, rely on heuristic reduction metrics such as attention scores, leading to suboptimal performance. In this paper, we propose a learning-based prompt-agnostic method, termed MetaCompress, overcoming the limitations of heuristic designs. We begin by formulating token reduction as a learnable compression mapping, unifying existing formats such as pruning and merging into a single learning objective. Upon this formulation, we introduce a data-efficient training paradigm capable of learning optimal compression mappings with limited computational costs. Extensive experiments on MT-VQA benchmarks and across multiple LVLM architectures demonstrate that MetaCompress achieves superior efficiency-accuracy trade-offs while maintaining strong generalization across dialogue turns. Our code is available at https://github.com/MArSha1147/MetaCompress.
[349] Getting to the Point: Why Pointing Improves LVLMs
Simone Alghisi, Massimo Rizzoli, Seyed Mahed Mousavi, Giuseppe Riccardi
Main category: cs.CV
TL;DR: Point-then-Count approach improves LVLMs’ zero-shot counting accuracy by grounding objects with coordinates before counting, enhancing generalization and revealing spatial biases.
Details
Motivation: To understand the mechanisms behind pointing's benefits in LVLMs and study its reliability as visual explanations, particularly in cognitive tasks like zero-shot counting.Method: Fine-tuned state-of-the-art LVLMs with two approaches: Direct Counting (predict only total count) and Point-then-Count (generate object coordinates first, then count). Analyzed spatial biases and conducted mechanistic analyses.
Result: Point-then-Count achieved higher out-of-distribution generalization than Direct Counting. Predicted points were accurately grounded (over 89% F1), but performance varied across image regions revealing spatial biases. Gains attributed to spatial information in coordinates.
Conclusion: Pointing helps LVLMs learn generalizable skills rather than overfitting, coordinates provide valuable spatial information for counting, but spatial biases exist that affect reliability as visual explanations.
Abstract: Pointing increases the accuracy and explainability of Large Vision-Language Models (LVLMs) by modeling grounding and reasoning as explicit sequential steps. The model grounds the objects mentioned in the natural-language query by predicting their coordinates, and then generates an answer conditioned on these points. While pointing has been shown to increase LVLMs’ accuracy, it is unclear which mechanism supports these gains and its relevance in cognitive tasks. In addition, the reliability of the intermediate points remains understudied, limiting their use as visual explanations. In this work, we study the role of pointing in a cognitive task: zero-shot counting from a visual scene. We fine-tune state-of-the-art LVLMs following two approaches: Direct Counting, where models only predict the total number of objects, and Point-then-Count, where LVLMs generate the target objects’ coordinates followed by their count. The results show that Point-then-Count achieves higher out-of-distribution generalization, suggesting that coordinates help LVLMs learn skills rather than overfitting on narrow tasks. Although predicted points are accurately grounded in the image in over 89% of cases (as measured by F1), performance varies across image regions, revealing spatial biases. Finally, mechanistic analyses show that gains in counting arise from the spatial information encoded in the coordinates.
[350] Let’s Think with Images Efficiently! An Interleaved-Modal Chain-of-Thought Reasoning Framework with Dynamic and Precise Visual Thoughts
Xu Liu, Yongheng Zhang, Qiguang Chen, Yao Li, Sheng Wang, Libo Qin
Main category: cs.CV
TL;DR: DaP-ICoT improves multimodal reasoning by dynamically integrating visual information when needed and ensuring coherent visual representations, reducing token usage by 72.6% while achieving state-of-the-art performance.
Details
Motivation: Current Interleaved-modal Chain-of-Thought (ICoT) methods have two major limitations: (1) Static Visual Thought Positioning - visual information is inserted at fixed steps, making reasoning inefficient and inflexible; (2) Broken Visual Thought Representation - visual tokens are discontinuous and semantically incoherent.Method: Introduces DaP-ICoT with two key components: (1) Dynamic Visual Thought Integration - adaptively introduces visual inputs based on reasoning needs to reduce redundancy; (2) Precise Visual Thought Guidance - ensures visual representations are semantically coherent and contextually aligned.
Result: DaP-ICoT achieves state-of-the-art performance across multiple benchmarks and models. It significantly reduces the number of inserted images, leading to a 72.6% decrease in token consumption, enabling more efficient ICoT reasoning.
Conclusion: DaP-ICoT addresses key limitations in current ICoT methods by making visual integration dynamic and precise, resulting in both improved performance and computational efficiency for multimodal reasoning.
Abstract: Recently, Interleaved-modal Chain-of-Thought (ICoT) reasoning has achieved remarkable success by leveraging both multimodal inputs and outputs, attracting increasing attention. While achieving promising performance, current ICoT methods still suffer from two major limitations: (1) Static Visual Thought Positioning, which statically inserts visual information at fixed steps, resulting in inefficient and inflexible reasoning; and (2) Broken Visual Thought Representation, which involves discontinuous and semantically incoherent visual tokens. To address these limitations, we introduce Interleaved-modal Chain-of-Thought reasoning with Dynamic and Precise Visual Thoughts (DaP-ICoT), which incorporates two key components: (1) Dynamic Visual Thought Integration adaptively introduces visual inputs based on reasoning needs, reducing redundancy and improving efficiency. (2) Precise Visual Thought Guidance ensures visual semantically coherent and contextually aligned representations. Experiments across multiple benchmarks and models demonstrate that DaP-ICoT achieves state-of-the-art performance. In addition, DaP-ICoT significantly reduces the number of inserted images, leading to a 72.6% decrease in token consumption, enabling more efficient ICoT reasoning.
[351] SHARP: Spectrum-aware Highly-dynamic Adaptation for Resolution Promotion in Remote Sensing Synthesis
Bingxuan Zhao, Qing Zhou, Chuang Yang, Qi Wang
Main category: cs.CV
TL;DR: RS-FLUX fine-tunes FLUX on remote sensing images and SHARP introduces dynamic RoPE rescaling for training-free resolution promotion in diffusion-based remote sensing image generation.
Details
Motivation: Remote sensing image synthesis lags behind general text-to-image generation due to lack of domain-specialized diffusion transformers and high training costs at large resolutions. Existing static RoPE rescaling methods harm RS imagery by uniformly compressing positional information, damaging fine structural details critical for aerial scene realism.Method: Two main components: 1) RS-FLUX - fine-tuning FLUX diffusion model on 100k+ curated remote sensing images to build domain prior; 2) SHARP - training-free method with rational fractional time schedule k_rs(t) for dynamic RoPE rescaling that applies strong positional promotion early and relaxes it during detail recovery.
Result: SHARP consistently outperforms all training-free baselines across six square and rectangular resolutions on CLIP Score, Aesthetic Score, and HPSv2 metrics, with widening margins at more aggressive extrapolation factors and negligible computational overhead.
Conclusion: The combination of domain-specialized RS-FLUX prior and dynamic SHARP resolution promotion enables high-quality remote sensing image generation at large resolutions without expensive retraining, addressing key barriers in RS synthesis.
Abstract: Text-to-image generation powered by Diffusion Transformers (DiTs) has made remarkable strides, yet remote sensing (RS) synthesis lags behind due to two barriers: the absence of a domain-specialized DiT prior and the prohibitive cost of training at the large resolutions that RS applications demand. Training-free resolution promotion via Rotary Position Embedding (RoPE) rescaling offers a practical remedy, but every existing method applies a static positional scaling rule throughout the denoising process. This uniform compression is particularly harmful for RS imagery, whose substantially denser medium- and high-frequency energy encodes the fine structures critical for aerial-scene realism, such as vehicles, building contours, and road markings. Addressing both challenges requires a domain-specialized generative prior coupled with a denoising-aware positional adaptation strategy. To this end, we fine-tune FLUX on over 100,000 curated RS images to build a strong domain prior (RS-FLUX), and propose Spectrum-aware Highly-dynamic Adaptation for Resolution Promotion (SHARP), a training-free method that introduces a rational fractional time schedule k_rs(t) into RoPE. SHARP applies strong positional promotion during the early layout-formation stage and progressively relaxes it during detail recovery, aligning extrapolation strength with the frequency-progressive nature of diffusion denoising. Its resolution-agnostic formulation further enables robust multi-scale generation from a single set of hyperparameters. Extensive experiments across six square and rectangular resolutions show that SHARP consistently outperforms all training-free baselines on CLIP Score, Aesthetic Score, and HPSv2, with widening margins at more aggressive extrapolation factors and negligible computational overhead. Code and weights are available at https://github.com/bxuanz/SHARP.
[352] Dynamic Exposure Burst Image Restoration
Woohyeok Kim, Jaesung Rim, Daeyeon Kim, Sunghyun Cho
Main category: cs.CV
TL;DR: DEBIR introduces a burst image restoration pipeline that dynamically predicts optimal exposure times for each burst image to enhance restoration quality, outperforming fixed exposure approaches.
Details
Motivation: Current burst image restoration methods use manually designed exposure settings, but optimal exposure settings significantly impact restoration quality and have been overlooked. The authors aim to develop a system that dynamically adapts exposure times to the shooting environment.Method: DEBIR consists of two main components: 1) Burst Auto-Exposure Network (BAENet) that estimates optimal exposure times for each burst image based on preview image, motion magnitude, and gain; 2) A burst image restoration network that reconstructs high-quality images from burst images captured with these optimal exposure times. Uses differentiable burst simulator and three-stage training strategy.
Result: The pipeline achieves state-of-the-art restoration quality and has been validated on a real-world camera system, demonstrating practical applicability.
Conclusion: Dynamic exposure prediction significantly improves burst image restoration quality compared to fixed exposure settings, and the approach is practical for real-world camera systems.
Abstract: Burst image restoration aims to reconstruct a high-quality image from burst images, which are typically captured using manually designed exposure settings. Although these exposure settings significantly influence the final restoration performance, the problem of finding optimal exposure settings has been overlooked. In this paper, we present Dynamic Exposure Burst Image Restoration (DEBIR), a novel burst image restoration pipeline that enhances restoration quality by dynamically predicting exposure times tailored to the shooting environment. In our pipeline, Burst Auto-Exposure Network (BAENet) estimates the optimal exposure time for each burst image based on a preview image, as well as motion magnitude and gain. Subsequently, a burst image restoration network reconstructs a high-quality image from burst images captured using these optimal exposure times. For training, we introduce a differentiable burst simulator and a three-stage training strategy. Our experiments demonstrate that our pipeline achieves state-of-the-art restoration quality. Furthermore, we validate the effectiveness of our approach on a real-world camera system, demonstrating its practicality.
[353] Image-Conditioned Adaptive Parameter Tuning for Visual Odometry Frontends
Simone Nascivera, Leonard Bauersfeld, Jeff Delaune, Davide Scaramuzza
Main category: cs.CV
TL;DR: RL framework for online tuning of VO frontend parameters using image-conditioned policy to adapt feature detection and tracking based on visual input
Details
Motivation: Current VO systems use fixed hyperparameters that don't adapt to varying scene conditions (texture, illumination, motion blur), leading to brittle performance in real-world environmentsMethod: Image-conditioned reinforcement learning framework with lightweight texture-aware CNN encoder and privileged critic during training; formulates frontend configuration as sequential decision-making problem
Result: 3x longer feature tracks and 3x lower computational cost on TartanAirV2 and TUM RGB-D datasets, trained entirely in simulation
Conclusion: First RL framework for online VO parameter tuning that proactively adapts to visual input, improving robustness and efficiency
Abstract: Resource-constrained autonomous robots rely on sparse direct and semi-direct visual-(inertial)-odometry (VO) pipelines, as they provide a favorable tradeoff between accuracy, robustness, and computational cost. However, the performance of most systems depends critically on hand-tuned hyperparameters governing feature detection, tracking, and outlier rejection. These parameters are typically fixed during deployment, even though their optimal values vary with scene characteristics such as texture density, illumination, motion blur, and sensor noise, leading to brittle performance in real-world environments. We propose the first image-conditioned reinforcement learning framework for online tuning of VO frontend parameters, effectively embedding the expert into the system. Our key idea is to formulate the frontend configuration as a sequential decision-making problem and learn a policy that directly maps visual input to feature detection and tracking parameters. The policy uses a lightweight texture-aware CNN encoder and a privileged critic during training. Unlike prior RL-based approaches that rely solely on internal VO statistics, our method observes the image content and proactively adapts parameters before tracking degrades. Experiments on TartanAirV2 and TUM RGB-D show 3x longer feature tracks and 3x lower computational cost, despite training entirely in simulation.
[354] Benchmarking Recurrent Event-Based Object Detection for Industrial Multi-Class Recognition on MTEvent
Lokeshwaran Manohar, Moritz Roidl
Main category: cs.CV
TL;DR: Benchmarking recurrent YOLOv8s on MTEvent dataset for industrial multi-class object detection using event cameras, showing recurrent models outperform non-recurrent baselines and event-domain pretraining yields best results.
Details
Motivation: Event cameras offer advantages for industrial robotics (high temporal resolution, dynamic range, reduced motion blur), but most event-based detection research focuses on outdoor driving or limited classes. Need for benchmarking in industrial multi-class recognition settings.Method: Benchmark recurrent ReYOLOv8s on MTEvent dataset for industrial multi-class recognition, using non-recurrent YOLOv8s variant as baseline to analyze temporal memory effects. Compare scratch training vs. event-domain pretraining (GEN1, PEDRo).
Result: Best scratch recurrent model achieves 0.285 mAP50 (9.6% improvement over non-recurrent baseline). GEN1-initialized fine-tuning yields best overall result of 0.329 mAP50. PEDRo initialization drops to 0.251, showing mismatched pretraining can be worse than scratch training. Models improve with clip length when pretrained.
Conclusion: Recurrent event-based detection shows promise for industrial environments, with event-domain pretraining being crucial. Class imbalance and human-object interaction remain challenges. This work provides focused benchmarking for industrial event-based detection.
Abstract: Event cameras are attractive for industrial robotics because they provide high temporal resolution, high dynamic range, and reduced motion blur. However, most event-based object detection studies focus on outdoor driving scenarios or limited class settings. In this work, we benchmark recurrent ReYOLOv8s on MTEvent for industrial multi-class recognition and use a non-recurrent YOLOv8s variant as a baseline to analyze the effect of temporal memory. On the MTEvent validation split, the best scratch recurrent model (C21) reaches 0.285 mAP50, corresponding to a 9.6% relative improvement over the nonrecurrent YOLOv8s baseline (0.260). Event-domain pretraining has a stronger effect: GEN1-initialized fine-tuning yields the best overall result of 0.329 mAP50 at clip length 21, and unlike scratch training, GEN1-pretrained models improve consistently with clip length. PEDRo initialization drops to 0.251, indicating that mismatched source-domain pretraining can be less effective than training from scratch. Persistent failure modes are dominated by class imbalance and human-object interaction. Overall, we position this work as a focused benchmarking and analysis study of recurrent event-based detection in industrial environments.
[355] Timing In stand-up Comedy: Text, Audio, Laughter, Kinesics (TIC-TALK): Pipeline and Database for the Multimodal Study of Comedic Timing
Yaelle Zribi, Florian Cafiero, Vincent Lépinay, Chahan Vidal-Gorène
Main category: cs.CV
TL;DR: TIC-TALK is a multimodal dataset for analyzing stand-up comedy performance dynamics, combining language, gesture, and audience response data from 90 professionally filmed comedy specials.
Details
Motivation: To study stand-up comedy beyond just verbal content by capturing embodied performance and audience feedback, creating a comprehensive multimodal resource for analyzing performance dynamics.Method: Combines BERTopic for thematic segmentation, Whisper-AT for laughter detection, YOLOv8-cls for shot classification, and YOLOv8s-pose for skeletal keypoint extraction, with all streams temporally aligned without resampling.
Result: Created dataset with 5,400+ topic segments; found kinetic energy negatively predicts laughter, personal content elicits more laughter than geopolitical themes, and close-up shots correlate positively with laughter.
Conclusion: TIC-TALK enables multimodal analysis of performance dynamics in stand-up comedy, revealing relationships between body language, thematic content, and audience response.
Abstract: Stand-up comedy, and humor in general, are often studied through their verbal content. Yet live performance relies just as much on embodied presence and audience feedback. We introduce TIC-TALK, a multimodal resource with 5,400+ temporally aligned topic segments capturing language, gesture, and audience response across 90 professionally filmed stand-up comedy specials (2015-2024). The pipeline combines BERTopic for 60 s thematic segmentation with dense sentence embeddings, Whisper-AT for 0.8 s laughter detection, a fine-tuned YOLOv8-cls shot classifier, and YOLOv8s-pose for raw keypoint extraction at 1 fps. Raw 17-joint skeletal coordinates are retained without prior clustering, enabling the computation of continuous kinematic signals-arm spread, kinetic energy, and trunk lean-that serve as proxies for performance dynamics. All streams are aligned by hierarchical temporal containment without resampling, and each topic segment stores its sentence-BERT embedding for downstream similarity and clustering tasks. As a concrete use case, we study laughter dynamics across 24 thematic topics: kinetic energy negatively predicts audience laughter rate (r = -0.75, N = 24), consistent with a stillness-before-punchline pattern; personal and bodily content elicits more laughter than geopolitical themes; and shot close-up proportion correlates positively with laughter (r = +0.28), consistent with reactive montage.
[356] Anatomical Token Uncertainty for Transformer-Guided Active MRI Acquisition
Lev Ayzenberg, Shady Abu-Hussein, Raja Giryes, Hayit Greenspan
Main category: cs.CV
TL;DR: A novel active sampling framework for MRI acceleration using pretrained medical image tokenizers and latent transformers to guide k-space sampling based on token entropy uncertainty measures.
Details
Motivation: MRI acquisition is slow, limiting clinical throughput and patient comfort. Compressed Sensing MRI aims to accelerate this by reconstructing from under-sampled data, requiring optimal sampling trajectories and reconstruction models.Method: Leverages pretrained medical image tokenizer and latent transformer to represent anatomy through quantized visual tokens. Uses token entropy as uncertainty measure to guide active sampling via two strategies: Latent Entropy Selection (LES) projects patch-wise token entropy to k-space to identify informative sampling lines, and Gradient-based Entropy Optimization (GEO) finds regions of maximum uncertainty reduction via k-space gradient of total latent entropy loss.
Result: Evaluated on fastMRI single-coil Knee and Brain datasets at 8× and 16× acceleration. Outperforms state-of-the-art baselines in perceptual metrics and feature-based distances.
Conclusion: The proposed active sampling framework effectively accelerates MRI acquisition by intelligently selecting k-space sampling locations based on latent uncertainty measures from tokenized representations.
Abstract: Full data acquisition in MRI is inherently slow, which limits clinical throughput and increases patient discomfort. Compressed Sensing MRI (CS-MRI) seeks to accelerate acquisition by reconstructing images from under-sampled k-space data, requiring both an optimal sampling trajectory and a high-fidelity reconstruction model. In this work, we propose a novel active sampling framework that leverages the inherent discrete structure of a pretrained medical image tokenizer and a latent transformer. By representing anatomy through a dictionary of quantized visual tokens, the model provides a well-defined probability distribution over the latent space. We utilize this distribution to derive a principled uncertainty measure via token entropy, which guides the active sampling process. We introduce two strategies to exploit this latent uncertainty: (1) Latent Entropy Selection (LES), projecting patch-wise token entropy into the $k$-space domain to identify informative sampling lines, and (2) Gradient-based Entropy Optimization (GEO), which identifies regions of maximum uncertainty reduction via the $k$-space gradient of a total latent entropy loss. We evaluate our framework on the fastMRI singlecoil Knee and Brain datasets at $\times 8$ and $\times 16$ acceleration. Our results demonstrate that our active policies outperform state-of-the-art baselines in perceptual metrics, and feature-based distances. Our code is available at https://github.com/levayz/TRUST-MRI.
[357] Cascade-Free Mandarin Visual Speech Recognition via Semantic-Guided Cross-Representation Alignment
Lei Yang, Yi He, Fei Wu, Shilin Wang
Main category: cs.CV
TL;DR: A cascade-free multitask learning approach for Chinese Mandarin visual speech recognition that jointly integrates phoneme and viseme representations with semantic-guided contrastive loss to address tonal language challenges.
Details
Motivation: Chinese Mandarin VSR lags behind English due to tonal nature limitations. Existing cascade architectures with intermediate representations (like pinyin) cause error accumulation and increased inference latency. Need for cascade-free approach that better exploits contextual information.Method: Proposes cascade-free architecture based on multitask learning that jointly integrates multiple intermediate representations (phoneme and viseme). Uses semantic-guided local contrastive loss to temporally align features, enabling on-demand activation during inference for efficiency-performance trade-off.
Result: Experiments on publicly available datasets demonstrate superior recognition performance compared to existing methods.
Conclusion: The proposed method effectively addresses tonal language challenges in Chinese Mandarin VSR by eliminating cascade dependencies, reducing error accumulation, and improving inference efficiency while maintaining performance.
Abstract: Chinese mandarin visual speech recognition (VSR) is a task that has advanced in recent years, yet still lags behind the performance on non-tonal languages such as English. One primary challenge arises from the tonal nature of Mandarin, which limits the effectiveness of conventional sequence-to-sequence modeling approaches. To alleviate this issue, existing Chinese VSR systems commonly incorporate intermediate representations, most notably pinyin, within cascade architectures to enhance recognition accuracy. While beneficial, in these cascaded designs, the subsequent stage during inference depends on the output of the preceding stage, leading to error accumulation and increased inference latency. To address these limitations, we propose a cascade-free architecture based on multitask learning that jointly integrates multiple intermediate representations, including phoneme and viseme, to better exploit contextual information. The proposed semantic-guided local contrastive loss temporally aligns the features, enabling on-demand activation during inference, thereby providing a trade-off between inference efficiency and performance while mitigating error accumulation caused by projection and re-embedding. Experiments conducted on publicly available datasets demonstrate that our method achieves superior recognition performance.
[358] Clinical Graph-Mediated Distillation for Unpaired MRI-to-CFI Hypertension Prediction
Dillan Imans, Phuoc-Nguyen Bui, Duc-Tai Le, Hyunseung Choo
Main category: cs.CV
TL;DR: CGMD transfers hypertension knowledge from brain MRI to retinal fundus models using clinical similarity graphs without paired multimodal data, improving HTN prediction from fundus images.
Details
Motivation: Retinal fundus imaging enables low-cost hypertension screening but has subtle cues, while brain MRI provides stronger markers but is expensive and rarely paired with fundus images, creating modality-siloed datasets.Method: Clinical Graph-Mediated Distillation (CGMD) constructs a clinical similarity kNN graph using shared biomarkers to bridge MRI and fundus cohorts, trains an MRI teacher, propagates representations over the graph, imputes brain-informed targets for fundus patients, and trains a fundus student with joint HTN supervision, target distillation, and relational distillation.
Result: CGMD consistently improves fundus-based hypertension prediction over standard distillation and non-graph imputation baselines on newly collected unpaired MRI-fundus-biomarker dataset, with ablations confirming importance of clinically grounded graph connectivity.
Conclusion: CGMD effectively transfers knowledge across unpaired medical imaging modalities using clinical similarity graphs, enabling improved hypertension screening from low-cost fundus images by leveraging brain MRI insights without requiring paired data.
Abstract: Retinal fundus imaging enables low-cost and scalable hypertension (HTN) screening, but HTN-related retinal cues are subtle, yielding high-variance predictions. Brain MRI provides stronger vascular and small-vessel-disease markers of HTN, yet it is expensive and rarely acquired alongside fundus images, resulting in modality-siloed datasets with disjoint MRI and fundus cohorts. We study this unpaired MRI-fundus regime and introduce Clinical Graph-Mediated Distillation (CGMD), a framework that transfers MRI-derived HTN knowledge to a fundus model without paired multimodal data. CGMD leverages shared structured biomarkers as a bridge by constructing a clinical similarity kNN graph spanning both cohorts. We train an MRI teacher, propagate its representations over the graph, and impute brain-informed representation targets for fundus patients. A fundus student is then trained with a joint objective combining HTN supervision, target distillation, and relational distillation. Experiments on our newly collected unpaired MRI-fundus-biomarker dataset show that CGMD consistently improves fundus-based HTN prediction over standard distillation and non-graph imputation baselines, with ablations confirming the importance of clinically grounded graph connectivity. Code is available at https://github.com/DillanImans/CGMD-unpaired-distillation.
[359] Ctrl-A: Control-Driven Online Data Augmentation
Jesper B. Christensen, Ciaran Bench, Spencer A. Thomas, Hüsnü Aslan, David Balslev-Harder, Nadia A. S. Smith, Alessandra Manzin
Main category: cs.CV
TL;DR: ControlAugment (Ctrl-A) is an automated data augmentation algorithm for image-vision tasks that uses control theory principles to dynamically adjust augmentation strength distributions during training without manual initialization.
Details
Motivation: The paper addresses the need for automated data augmentation that doesn't require manual engineering of augmentation policies for new image-vision tasks. Current methods often need careful initialization of augmentation strengths or extensive hyperparameter tuning.Method: Ctrl-A uses a control-loop architecture with relative operation response curves to dynamically and individually adapt augmentation strength distributions during training. It employs an operation-dependent update procedure that can suppress augmentation styles that negatively impact model performance.
Result: Experiments on CIFAR-10, CIFAR-100, and SVHN-core datasets using WideResNet-28-10 show that Ctrl-A is highly competitive with existing state-of-the-art data augmentation strategies.
Conclusion: Ctrl-A provides an effective automated approach to data augmentation for image-vision tasks by leveraging control theory principles, eliminating the need for manual initialization and policy engineering while achieving competitive performance.
Abstract: We introduce ControlAugment (Ctrl-A), an automated data augmentation algorithm for image-vision tasks, which incorporates principles from control theory for online adjustment of augmentation strength distributions during model training. Ctrl-A eliminates the need for initialization of individual augmentation strengths. Instead, augmentation strength distributions are dynamically, and individually, adapted during training based on a control-loop architecture and what we define as relative operation response curves. Using an operation-dependent update procedure provides Ctrl-A with the potential to suppress augmentation styles that negatively impact model performance, alleviating the need for manually engineering augmentation policies for new image-vision tasks. Experiments on the CIFAR-10, CIFAR-100, and SVHN-core benchmark datasets using the common WideResNet-28-10 architecture demonstrate that Ctrl-A is highly competitive with existing state-of-the-art data augmentation strategies.
[360] Beyond Strict Pairing: Arbitrarily Paired Training for High-Performance Infrared and Visible Image Fusion
Yanglin Deng, Tianyang Xu, Chunyang Cheng, Hui Li, Xiao-jun Wu, Josef Kittler
Main category: cs.CV
TL;DR: This paper challenges the need for strictly paired training in infrared-visible image fusion, proposing unpaired and arbitrarily paired training paradigms that achieve comparable performance with 100x less data.
Details
Motivation: Current infrared-visible image fusion methods require extensive rigidly aligned image pairs, which are costly and labor-intensive to acquire. This strict pairing limits cross-modal relationship learning and generalization.Method: Proposes UnPaired and Arbitrarily Paired Training Paradigms (UPTP and APTP) with theoretical objectives. Develops a practical framework with three lightweight baselines (CNN, Transformer, GAN) and innovative loss functions to handle limited, unaligned training data.
Result: APTP and UPTP achieve performance comparable to strictly paired training with 100x larger datasets, demonstrating feasibility with severely limited and content-inconsistent data.
Conclusion: The proposed paradigms fundamentally reduce data collection costs and difficulty while enhancing model robustness, providing a practical solution for infrared-visible image fusion studies.
Abstract: Infrared and visible image fusion(IVIF) combines complementary modalities while preserving natural textures and salient thermal signatures. Existing solutions predominantly rely on extensive sets of rigidly aligned image pairs for training. However, acquiring such data is often impractical due to the costly and labour-intensive alignment process. Besides, maintaining a rigid pairing setting during training restricts the volume of cross-modal relationships, thereby limiting generalisation performance. To this end, this work challenges the necessity of Strictly Paired Training Paradigm (SPTP) by systematically investigating UnPaired and Arbitrarily Paired Training Paradigms (UPTP and APTP) for high-performance IVIF. We establish a theoretical objective of APTP, reflecting the complementary nature between UPTP and SPTP. More importantly, we develop a practical framework capable of significantly enriching cross-modal relationships even with severely limited and unaligned training data. To validate our propositions, three end-to-end lightweight baselines, alongside a set of innovative loss functions, are designed to cover three classic frameworks (CNN, Transformer, GAN). Comprehensive experiments demonstrate that the proposed APTP and UPTP are feasible and capable of training models on a severely limited and content-inconsistent infrared and visible dataset, achieving performance comparable to that of a dataset 100$\times$ larger in SPTP. This finding fundamentally alleviates the cost and difficulty of data collection while enhancing model robustness from the data perspective, delivering a feasible solution for IVIF studies. The code is available at \href{https://github.com/yanglinDeng/IVIF_unpair}{\textcolor{blue}{https://github.com/yanglinDeng/IVIF_unpair}}.
[361] SteelDefectX: A Coarse-to-Fine Vision-Language Dataset and Benchmark for Generalizable Steel Surface Defect Detection
Shuxian Zhao, Jie Gui, Baosheng Yu, Lu Dong, Zhipeng Gui
Main category: cs.CV
TL;DR: SteelDefectX: A vision-language dataset for steel surface defect detection with coarse-to-fine textual annotations to improve model interpretability and generalization.
Details
Motivation: Current steel defect detection methods rely on basic image classification with label-only datasets, limiting interpretability and generalization. Need for richer annotations to enable more explainable and generalizable models.Method: Created SteelDefectX dataset with 7,778 images across 25 defect categories annotated with coarse-to-fine textual descriptions. Coarse level includes defect categories, visual attributes, and industrial causes; fine level includes sample-specific attributes like shape, size, depth, position, contrast. Established benchmark with four tasks: vision-only classification, vision-language classification, few/zero-shot recognition, and zero-shot transfer.
Result: Experiments with baseline models show coarse-to-fine textual annotations significantly improve interpretability, generalization, and transferability. Dataset enables models to learn richer defect representations.
Conclusion: SteelDefectX serves as valuable resource for advancing explainable, generalizable steel surface defect detection. Dataset publicly available on GitHub.
Abstract: Steel surface defect detection is essential for ensuring product quality and reliability in modern manufacturing. Current methods often rely on basic image classification models trained on label-only datasets, which limits their interpretability and generalization. To address these challenges, we introduce SteelDefectX, a vision-language dataset containing 7,778 images across 25 defect categories, annotated with coarse-to-fine textual descriptions. At the coarse-grained level, the dataset provides class-level information, including defect categories, representative visual attributes, and associated industrial causes. At the fine-grained level, it captures sample-specific attributes, such as shape, size, depth, position, and contrast, enabling models to learn richer and more detailed defect representations. We further establish a benchmark comprising four tasks, vision-only classification, vision-language classification, few/zero-shot recognition, and zero-shot transfer, to evaluate model performance and generalization. Experiments with several baseline models demonstrate that coarse-to-fine textual annotations significantly improve interpretability, generalization, and transferability. We hope that SteelDefectX will serve as a valuable resource for advancing research on explainable, generalizable steel surface defect detection. The data will be publicly available on https://github.com/Zhaosxian/SteelDefectX.
[362] Multi-View Deformable Convolution Meets Visual Mamba for Coronary Artery Segmentation
Xiaochan Yuan, Pai Zeng
Main category: cs.CV
TL;DR: MDSVM-UNet: A two-stage coronary artery segmentation framework combining multidirectional snake convolution with residual visual Mamba for efficient long-range dependency modeling in CT angiography images.
Details
Motivation: Coronary artery segmentation from CT angiography is clinically important but challenging due to vessels' multi-branching tubular morphology and class imbalance. CNN-based methods struggle with long-range dependencies, while Vision Transformers have prohibitive computational overhead for clinical deployment.Method: Two-stage framework with MDSConv (multidirectional snake convolution) in encoding stage for multi-view feature fusion along anatomical planes, and RVM (residual visual Mamba) in decoding stage for efficient long-range dependency modeling with linear complexity. Uses progressive segmentation: coarse whole-image segmentation followed by fine-grained block-level segmentation.
Result: The paper proposes a novel segmentation framework but does not provide specific quantitative results in the abstract. The method addresses computational efficiency challenges while aiming to capture complex vascular geometry.
Conclusion: MDSVM-UNet offers an efficient solution for coronary artery segmentation by combining MDSConv’s geometric modeling with RVM’s efficient long-range dependency capture, potentially enabling clinical deployment in resource-constrained settings.
Abstract: Accurate segmentation of coronary arteries from computed tomography angiography (CTA) images is of paramount clinical importance for the diagnosis and treatment planning of cardiovascular diseases. However, coronary artery segmentation remains challenging due to the inherent multi-branching and slender tubular morphology of the vasculature, compounded by severe class imbalance between foreground vessels and background tissue. Conventional convolutional neural network (CNN)-based approaches struggle to capture long-range dependencies among spatially distant vascular structures, while Vision Transformer (ViT)-based methods incur prohibitive computational overhead that hinders deployment in resource-constrained clinical settings. Motivated by the recent success of state space models (SSMs) in efficiently modeling long-range sequential dependencies with linear complexity, we propose MDSVM-UNet, a novel two-stage coronary artery segmentation framework that synergistically integrates multidirectional snake convolution (MDSConv) with residual visual Mamba (RVM). In the encoding stage, we introduce MDSConv, a deformable convolution module that learns adaptive offsets along three orthogonal anatomical planes – sagittal, coronal, and axial – thereby enabling comprehensive multi-view feature fusion that faithfully captures the elongated and tortuous geometry of coronary vessels. In the decoding stage, we design an RVM-based upsampling decoder block that leverages selective state space mechanisms to model inter-slice long-range dependencies while preserving linear computational complexity. Furthermore, we propose a progressive two-stage segmentation strategy: the first stage performs coarse whole-image segmentation to guide intelligent block extraction, while the second stage conducts fine-grained block-level segmentation to recover vascular details and suppress false positives..
[363] Climate Prompting: Generating the Madden-Julian Oscillation using Video Diffusion and Low-Dimensional Conditioning
Sulian Thual, Feiyang Cai, Jingjing Wang, Feng Luo
Main category: cs.CV
TL;DR: A video diffusion model trained on atmospheric reanalysis data generates realistic Madden-Julian oscillation sequences conditioned on low-dimensional metrics, enabling analysis of physical drivers through idealized conditionings.
Details
Motivation: To bridge the gap between traditional low-dimensional MJO theory and high-resolution atmospheric complexity by using generative deep learning to synthesize MJO sequences that can be analyzed for underlying physical processes.Method: A video diffusion model trained on atmospheric reanalysis data to generate MJO sequences conditioned on key low-dimensional metrics, with intentional idealized conditionings to create tractable scenarios for analysis.
Result: The generated MJOs capture key features including composites, power spectra, and multiscale structures like convectively coupled waves, despite some biases. The model successfully generates tractable MJOs under idealized conditions.
Conclusion: The approach provides a practical framework for connecting low-dimensional MJO theory with high-resolution atmospheric complexity and will aid tropical atmosphere prediction.
Abstract: Generative Deep Learning is a powerful tool for modeling of the Madden-Julian oscillation (MJO) in the tropics, yet its relationship to traditional theoretical frameworks remains poorly understood. Here we propose a video diffusion model, trained on atmospheric reanalysis, to synthetize long MJO sequences conditioned on key low-dimensional metrics. The generated MJOs capture key features including composites, power spectra and multiscale structures including convectively coupled waves, despite some bias. We then prompt the model to generate more tractable MJOs based on intentionally idealized low-dimensional conditionings, for example a perpetual MJO, an isolated modulation by seasons and/or the El Nino-Southern Oscillation, and so on. This enables deconstructing the underlying processes and identifying physical drivers. The present approach provides a practical framework for bridging the gap between low-dimensional MJO theory and high-resolution atmospheric complexity and will help tropical atmosphere prediction.
[364] Adaptive Video Distillation: Mitigating Oversaturation and Temporal Collapse in Few-Step Generation
Yuyang You, Yongzhi Li, Jiahui Li, Yadong Mu, Quan Chen, Peng Jiang
Main category: cs.CV
TL;DR: A novel distillation framework for video diffusion models that addresses artifacts from image distillation techniques through adaptive regression loss, temporal regularization, and inference-time frame interpolation.
Details
Motivation: Video generation is computationally expensive, making model distillation crucial for efficient deployment. Current methods often adapt image distillation techniques, leading to artifacts like oversaturation, temporal inconsistency, and mode collapse in video synthesis.Method: Proposes a video-specific distillation framework with three key innovations: (1) adaptive regression loss that dynamically adjusts spatial supervision weights to prevent artifacts from distribution shifts, (2) temporal regularization loss to prevent temporal collapse and promote smooth sampling trajectories, and (3) inference-time frame interpolation to reduce sampling overhead while preserving quality.
Result: Extensive experiments on VBench and VBench2 benchmarks show the method achieves stable few-step video synthesis, significantly enhancing perceptual fidelity and motion realism, consistently outperforming existing distillation baselines across multiple metrics.
Conclusion: The proposed distillation framework effectively addresses video-specific artifacts from image distillation techniques, enabling efficient and high-quality video generation through adaptive spatial supervision, temporal regularization, and optimized inference strategies.
Abstract: Video generation has recently emerged as a central task in the field of generative AI. However, the substantial computational cost inherent in video synthesis makes model distillation a critical technique for efficient deployment. Despite its significance, there is a scarcity of methods specifically designed for video diffusion models. Prevailing approaches often directly adapt image distillation techniques, which frequently lead to artifacts such as oversaturation, temporal inconsistency, and mode collapse. To address these challenges, we propose a novel distillation framework tailored specifically for video diffusion models. Its core innovations include: (1) an adaptive regression loss that dynamically adjusts spatial supervision weights to prevent artifacts arising from excessive distribution shifts; (2) a temporal regularization loss to counteract temporal collapse, promoting smooth and physically plausible sampling trajectories; and (3) an inference-time frame interpolation strategy that reduces sampling overhead while preserving perceptual quality. Extensive experiments and ablation studies on the VBench and VBench2 benchmarks demonstrate that our method achieves stable few-step video synthesis, significantly enhancing perceptual fidelity and motion realism. It consistently outperforms existing distillation baselines across multiple metrics.
[365] Adversarial Camouflage
Paweł Borsukiewicz, Daniele Lunghi, Melissa Tessa, Jacques Klein, Tegawendé F. Bissyandé
Main category: cs.CV
TL;DR: Adversarial Camouflage is a privacy protection method that generates optimized patterns to degrade face recognition models, showing effectiveness across multiple architectures and in real-world scenarios.
Details
Motivation: Address privacy concerns from widespread facial recognition deployment by creating an efficient, reproducible physical-world solution that protects users from mass surveillance threats.Method: Define low-dimensional pattern space (color, shape, angle), optimize patterns to maximize recognition error, project onto semantically valid facial regions, ensure cross-model transferability across architectures.
Result: Significantly degrades performance of all tested state-of-the-art face recognition models in simulations, shows promising real-world results, reveals model robustness differences and attack transferability across architectures.
Conclusion: Adversarial Camouflage provides an effective privacy protection solution against facial recognition systems with demonstrated cross-model effectiveness and real-world applicability.
Abstract: While the rapid development of facial recognition algorithms has enabled numerous beneficial applications, their widespread deployment has raised significant concerns about the risks of mass surveillance and threats to individual privacy. In this paper, we introduce \textit{Adversarial Camouflage} as a novel solution for protecting users’ privacy. This approach is designed to be efficient and simple to reproduce for users in the physical world. The algorithm starts by defining a low-dimensional pattern space parameterized by color, shape, and angle. Optimized patterns, once found, are projected onto semantically valid facial regions for evaluation. Our method maximizes recognition error across multiple architectures, ensuring high cross-model transferability even against black-box systems. It significantly degrades the performance of all tested state-of-the-art face recognition models during simulations and demonstrates promising results in real-world human experiments, while revealing differences in model robustness and evidence of attack transferability across architectures.
[366] Manifold-Aware Exploration for Reinforcement Learning in Video Generation
Mingzhe Zheng, Weijie Kong, Yue Wu, Dengyang Jiang, Yue Ma, Xuanhua He, Bin Lin, Kaixiong Gong, Zhao Zhong, Liefeng Bo, Qifeng Chen, Harry Yang
Main category: cs.CV
TL;DR: SAGE-GRPO is a stable alignment method for video generation that constrains exploration within the pre-trained model’s data manifold to improve rollout quality and reward reliability during post-training alignment.
Details
Motivation: Current GRPO methods for video generation are less reliable than those for language and image models due to complex solution spaces and ODE-to-SDE conversion injecting excess noise, which lowers rollout quality and makes reward estimates unreliable, destabilizing post-training alignment.Method: SAGE-GRPO applies constraints at micro and macro levels: micro-level includes manifold-aware SDE with logarithmic curvature correction and gradient norm equalizer; macro-level uses dual trust region with periodic moving anchor and stepwise constraints to track checkpoints closer to the manifold and limit drift.
Result: Evaluation on HunyuanVideo1.5 with VideoAlign reward model shows consistent gains over previous methods in VQ, MQ, TA, and visual metrics (CLIPScore, PickScore), demonstrating superior performance in both reward maximization and overall video quality.
Conclusion: Constraining exploration within the pre-trained model’s data manifold through micro and macro-level constraints enables stable alignment for video generation, addressing reliability issues in current GRPO methods.
Abstract: Group Relative Policy Optimization (GRPO) methods for video generation like FlowGRPO remain far less reliable than their counterparts for language models and images. This gap arises because video generation has a complex solution space, and the ODE-to-SDE conversion used for exploration can inject excess noise, lowering rollout quality and making reward estimates less reliable, which destabilizes post-training alignment. To address this problem, we view the pre-trained model as defining a valid video data manifold and formulate the core problem as constraining exploration within the vicinity of this manifold, ensuring that rollout quality is preserved and reward estimates remain reliable. We propose SAGE-GRPO (Stable Alignment via Exploration), which applies constraints at both micro and macro levels. At the micro level, we derive a precise manifold-aware SDE with a logarithmic curvature correction and introduce a gradient norm equalizer to stabilize sampling and updates across timesteps. At the macro level, we use a dual trust region with a periodic moving anchor and stepwise constraints so that the trust region tracks checkpoints that are closer to the manifold and limits long-horizon drift. We evaluate SAGE-GRPO on HunyuanVideo1.5 using the original VideoAlign as the reward model and observe consistent gains over previous methods in VQ, MQ, TA, and visual metrics (CLIPScore, PickScore), demonstrating superior performance in both reward maximization and overall video quality. The code and visual gallery are available at https://dungeonmassster.github.io/SAGE-GRPO-Page/.
[367] Thermal Topology Collapse: Universal Physical Patch Attacks on Infrared Vision Systems
Chengyin Hu, Yikun Guo, Yuxian Dong, Qike Zhang, Kalibinuer Tiliwalidi, Yiwei Wei, Haitao Shi, Jiujiang Guo, Jiahuan Long, Xiang Chen
Main category: cs.CV
TL;DR: UPPA is a universal physical patch attack method for infrared pedestrian detectors that uses parameterized Bezier blocks and PSO optimization to create physically robust adversarial patches that work across domains without online computation.
Details
Motivation: Existing physical attack methods for infrared detectors rely on instance-specific online optimization and rigid patterns, leading to high deployment costs and insufficient physical robustness. There's a need for a universal attack method that can work across different scenarios without per-instance optimization.Method: UPPA uses geometrically constrained parameterized Bezier blocks to model perturbations, employs Particle Swarm Optimization (PSO) for unified optimization across global data distribution, and materializes digital perturbations into physical cold patches that create continuous low-temperature distributions aligned with infrared thermal radiation characteristics.
Result: Extensive experiments show UPPA achieves outstanding physical attack success rate without online computational overhead, exhibits strong cross-domain generalization, and has reliable black-box transferability.
Conclusion: UPPA is the first universal physical attack method in the infrared domain that addresses limitations of existing methods by providing a computationally efficient, physically robust solution with good generalization capabilities.
Abstract: Although infrared pedestrian detectors have been widely deployed in visual perception tasks, their vulnerability to physical adversarial attacks is becoming increasingly apparent. Existing physical attack methods predominantly rely on instance-specific online optimization and rigid pattern design, leading to high deployment costs and insufficient physical robustness. To address these limitations, this work proposes the Universal Physical Patch Attack (UPPA), the first universal physical attack method in the infrared domain. This method employs geometrically constrained parameterized Bezier blocks to model perturbations and utilizes the Particle Swarm Optimization (PSO) algorithm to perform unified optimization across the global data distribution, thus maintaining topological stability under dynamic deformations. In the physical deployment phase, we materialize the optimized digital perturbations into physical cold patches, achieving a continuous and smooth low-temperature distribution that naturally aligns with the thermal radiation characteristics of infrared imaging. Extensive experiments demonstrate that UPPA achieves an outstanding physical attack success rate without any online computational overhead, while also exhibiting strong cross-domain generalization and reliable black-box transferability.
[368] Not All Layers Are Created Equal: Adaptive LoRA Ranks for Personalized Image Generation
Donald Shenaj, Federico Errica, Antonio Carta
Main category: cs.CV
TL;DR: LoRA² introduces adaptive rank selection for diffusion model fine-tuning, allowing different layers to have different ranks based on importance, improving memory-performance trade-off.
Details
Motivation: Current LoRA fine-tuning uses fixed ranks for all layers regardless of subject complexity, leading to suboptimal memory-performance trade-offs. The combinatorial cost of selecting optimal ranks per layer makes manual tuning impractical.Method: Proposes LoRA² which learns adaptive ranks for each layer during fine-tuning by imposing importance ordering on rank positions, encouraging higher ranks only when needed. Uses variational methods to let ranks freely adapt based on subject complexity.
Result: Achieves competitive trade-off between DINO, CLIP-I, and CLIP-T metrics across 29 subjects while requiring less memory and lower overall rank than high-rank LoRA versions. Shows both qualitative and quantitative improvements.
Conclusion: LoRA² provides an effective solution to the rank selection problem in diffusion model fine-tuning, enabling adaptive layer-wise rank allocation that balances performance and memory efficiency better than fixed-rank approaches.
Abstract: Low Rank Adaptation (LoRA) is the de facto fine-tuning strategy to generate personalized images from pre-trained diffusion models. Choosing a good rank is extremely critical, since it trades off performance and memory consumption, but today the decision is often left to the community’s consensus, regardless of the personalized subject’s complexity. The reason is evident: the cost of selecting a good rank for each LoRA component is combinatorial, so we opt for practical shortcuts such as fixing the same rank for all components. In this paper, we take a first step to overcome this challenge. Inspired by variational methods that learn an adaptive width of neural networks, we let the ranks of each layer freely adapt during fine-tuning on a subject. We achieve it by imposing an ordering of importance on the rank’s positions, effectively encouraging the creation of higher ranks when strictly needed. Qualitatively and quantitatively, our approach, LoRA$^2$, achieves a competitive trade-off between DINO, CLIP-I, and CLIP-T across 29 subjects while requiring much less memory and lower rank than high rank LoRA versions. Code: https://github.com/donaldssh/NotAllLayersAreCreatedEqual.
[369] Deep S2P: Integrating Learning Based Stereo Matching Into the Satellite Stereo Pipeline
Elías Masquil, Thibaud Ehret, Pablo Musé, Gabriele Facciolo
Main category: cs.CV
TL;DR: Integration of modern learning-based stereo matchers into satellite stereo pipelines for improved Digital Surface Model generation, with adaptations for satellite viewing geometry and disparity assumptions.
Details
Motivation: To bridge the gap between state-of-the-art learning-based stereo matchers and operational satellite pipelines, which face challenges due to differences in viewing geometry and disparity assumptions that prevent direct integration.Method: Integrated several modern learning-based stereo matchers (StereoAnywhere, MonSter, Foundation Stereo, and satellite fine-tuned MonSter) into the Satellite Stereo Pipeline (S2P), adapting the rectification stage to enforce compatible disparity polarity and range.
Result: Experiments show consistent improvements over classical cost-volume-based approaches in Digital Surface Model accuracy, with substantially improved geometric detail and sharper structures, though metrics like mean absolute error show saturation effects. Performance on challenging surfaces like vegetation remains limited.
Conclusion: Learning-based stereo matchers can improve satellite stereo pipelines but require adaptations for satellite geometry, and evaluation strategies need to better reflect perceptual and structural fidelity rather than just traditional metrics.
Abstract: Digital Surface Model generation from satellite imagery is a core task in Earth observation and is commonly addressed using classical stereoscopic matching algorithms in satellite pipelines as in the Satellite Stereo Pipeline (S2P). While recent learning-based stereo matchers achieve state-of-the-art performance on standard benchmarks, their integration into operational satellite pipelines remains challenging due to differences in viewing geometry and disparity assumptions. In this work, we integrate several modern learning-based stereo matchers, including StereoAnywhere, MonSter, Foundation Stereo, and a satellite fine-tuned variant of MonSter, into the Satellite Stereo Pipeline, adapting the rectification stage to enforce compatible disparity polarity and range. We release the corresponding code to enable reproducible use of these methods in large-scale Earth observation workflows. Experiments on satellite imagery show consistent improvements over classical cost-volume-based approaches in terms of Digital Surface Model accuracy, although commonly used metrics such as mean absolute error exhibit saturation effects. Qualitative results reveal substantially improved geometric detail and sharper structures, highlighting the need for evaluation strategies that better reflect perceptual and structural fidelity. At the same time, performance over challenging surface types such as vegetation remains limited across all evaluated models, indicating open challenges for learning-based stereo in natural environments.
[370] SHAPE: Structure-aware Hierarchical Unsupervised Domain Adaptation with Plausibility Evaluation for Medical Image Segmentation
Linkuan Zhou, Yinghao Xia, Yufei Shen, Xiangyu Li, Wenjie Du, Cong Cong, Leyi Wei, Ran Su, Qiangguo Jin
Main category: cs.CV
TL;DR: SHAPE: A structure-aware hierarchical UDA framework for medical segmentation that uses DINOv3 foundation with hierarchical feature modulation and hypergraph-based plausibility estimation to ensure anatomically plausible segmentation across domains.
Details
Motivation: Existing UDA methods for medical segmentation suffer from semantically unaware feature alignment leading to poor distributional fidelity, and pseudo-label validation that ignores global anatomical constraints, resulting in implausible structures.Method: Built on DINOv3 foundation with Hierarchical Feature Modulation (HFM) for high-fidelity class-aware features, Hypergraph Plausibility Estimation (HPE) for global anatomical plausibility assessment, and Structural Anomaly Pruning (SAP) for artifact removal via cross-view stability.
Result: State-of-the-art performance on cardiac and abdominal cross-modality benchmarks: 90.08% (MRI->CT) and 78.51% (CT->MRI) Dice scores on cardiac data; 87.48% (MRI->CT) and 86.89% (CT->MRI) on abdominal data.
Conclusion: SHAPE successfully addresses limitations of existing UDA methods by reframing adaptation towards global anatomical plausibility, achieving superior performance through structure-aware hierarchical adaptation with plausibility evaluation.
Abstract: Unsupervised Domain Adaptation (UDA) is essential for deploying medical segmentation models across diverse clinical environments. Existing methods are fundamentally limited, suffering from semantically unaware feature alignment that results in poor distributional fidelity and from pseudo-label validation that disregards global anatomical constraints, thus failing to prevent the formation of globally implausible structures. To address these issues, we propose SHAPE (Structure-aware Hierarchical Unsupervised Domain Adaptation with Plausibility Evaluation), a framework that reframes adaptation towards global anatomical plausibility. Built on a DINOv3 foundation, its Hierarchical Feature Modulation (HFM) module first generates features with both high fidelity and class-awareness. This shifts the core challenge to robustly validating pseudo-labels. To augment conventional pixel-level validation, we introduce Hypergraph Plausibility Estimation (HPE), which leverages hypergraphs to assess the global anatomical plausibility that standard graphs cannot capture. This is complemented by Structural Anomaly Pruning (SAP) to purge remaining artifacts via cross-view stability. SHAPE significantly outperforms prior methods on cardiac and abdominal cross-modality benchmarks, achieving state-of-the-art average Dice scores of 90.08% (MRI->CT) and 78.51% (CT->MRI) on cardiac data, and 87.48% (MRI->CT) and 86.89% (CT->MRI) on abdominal data. The code is available at https://github.com/BioMedIA-repo/SHAPE.
[371] CLEAR: Context-Aware Learning with End-to-End Mask-Free Inference for Adaptive Video Subtitle Removal
Qingdong He, Chaoyi Wang, Peng Tang, Yifan Yang, Xiaobin Hu
Main category: cs.CV
TL;DR: CLEAR is a mask-free, end-to-end video subtitle removal framework that uses context-aware adaptive learning with two-stage disentangled representation learning and LoRA-based adaptation.
Details
Motivation: Existing diffusion-based video subtitle removal methods require explicit mask sequences during both training and inference, which limits practical deployment. The authors aim to create a truly end-to-end mask-free framework.Method: Two-stage design: Stage I learns disentangled subtitle representations via self-supervised orthogonality constraints on dual encoders. Stage II employs LoRA-based adaptation with generation feedback for dynamic context adjustment. Only requires 0.77% of base diffusion model parameters for training.
Result: Outperforms mask-dependent baselines by +6.77dB PSNR and -74.7% VFID on Chinese subtitle benchmarks. Demonstrates superior zero-shot generalization across six languages (English, Korean, French, Japanese, Russian, German).
Conclusion: CLEAR achieves truly end-to-end inference without mask requirements, enabling practical deployment with strong performance and cross-language generalization through generation-driven feedback.
Abstract: Video subtitle removal aims to distinguish text overlays from background content while preserving temporal coherence. Existing diffusion-based methods necessitate explicit mask sequences during both training and inference phases, which restricts their practical deployment. In this paper, we present CLEAR (Context-aware Learning for End-to-end Adaptive Video Subtitle Removal), a mask-free framework that achieves truly end-to-end inference through context-aware adaptive learning. Our two-stage design decouples prior extraction from generative refinement: Stage I learns disentangled subtitle representations via self-supervised orthogonality constraints on dual encoders, while Stage II employs LoRA-based adaptation with generation feedback for dynamic context adjustment. Notably, our method only requires 0.77% of the parameters of the base diffusion model for training. On Chinese subtitle benchmarks, CLEAR outperforms mask-dependent baselines by + 6.77dB PSNR and -74.7% VFID, while demonstrating superior zero-shot generalization across six languages (English, Korean, French, Japanese, Russian, German), a performance enabled by our generation-driven feedback mechanism that ensures robust subtitle removal without ground-truth masks during inference.
[372] Camera-Agnostic Pruning of 3D Gaussian Splats via Descriptor-Based Beta Evidence
Peter Fasogbon, Ugurcan Budak, Patrice Rondao Alface, Hamed Rezazadegan Tavakoli
Main category: cs.CV
TL;DR: A camera-agnostic pruning method for 3D Gaussian splats using attribute-derived neighborhood descriptors and Beta evidence model for statistical reliability estimation.
Details
Motivation: Existing 3D Gaussian splat pruning methods depend on camera parameters, rendered images, or view-dependent measures, which hinders camera-agnostic exchange settings where splats are shared as point-based representations (e.g., .ply files).Method: Proposes a camera-agnostic, one-shot, post-training pruning method using hybrid descriptor framework capturing structural and appearance consistency directly from splat representation. Formulates pruning as statistical evidence estimation problem with Beta evidence model that quantifies per-splat reliability through probabilistic confidence scores.
Result: Experiments on ISO/IEC MPEG CTC test sequences show substantial pruning while preserving reconstruction quality, establishing a practical and generalizable alternative to camera-dependent pruning strategies.
Conclusion: The method provides an effective camera-agnostic pruning solution for 3D Gaussian splats that works in emerging exchange settings where splats are shared directly as point-based representations.
Abstract: The pruning of 3D Gaussian splats is essential for reducing their complexity to enable efficient storage, transmission, and downstream processing. However, most of the existing pruning strategies depend on camera parameters, rendered images, or view-dependent measures. This dependency becomes a hindrance in emerging camera-agnostic exchange settings, where splats are shared directly as point-based representations (e.g., .ply). In this paper, we propose a camera-agnostic, one-shot, post-training pruning method for 3D Gaussian splats that relies solely on attribute-derived neighbourhood descriptors. As our primary contribution, we introduce a hybrid descriptor framework that captures structural and appearance consistency directly from the splat representation. Building on these descriptors, we formulate pruning as a statistical evidence estimation problem and introduce a Beta evidence model that quantifies per-splat reliability through a probabilistic confidence score. Experiments conducted on standardized test sequences defined by the ISO/IEC MPEG Common Test Conditions (CTC) demonstrate that our approach achieves substantial pruning while preserving reconstruction quality, establishing a practical and generalizable alternative to existing camera-dependent pruning strategies.
[373] The Golden Subspace: Where Efficiency Meets Generalization in Continual Test-Time Adaptation
Guannan Lai, Da-Wei Zhou, Zhenguo Li, Han-Jia Ye
Main category: cs.CV
TL;DR: GOLD proposes an efficient continual test-time adaptation method that identifies and maintains a “golden subspace” for minimal feature updates, achieving better efficiency-generalization trade-off than existing CTTA methods.
Details
Motivation: Existing Continual Test-Time Adaptation (CTTA) methods face an efficiency-generalization trade-off: updating more parameters improves adaptation but reduces online inference efficiency. The authors aim to achieve comparable adaptation with minimal feature updates by identifying the optimal subspace for adaptation.Method: The paper proves the existence of a “golden subspace” for adaptation that coincides with the row space of the pretrained classifier. They introduce sample-wise Average Gradient Outer Product (AGOP) to efficiently estimate classifier weights without retraining. GOLD uses a lightweight adapter to project features onto this subspace and learns a compact scaling vector while dynamically updating the subspace via AGOP.
Result: Extensive experiments on classification and segmentation benchmarks, including autonomous-driving scenarios, demonstrate that GOLD attains superior efficiency, stability, and overall performance compared to existing CTTA methods.
Conclusion: GOLD successfully addresses the efficiency-generalization trade-off in CTTA by identifying and maintaining the golden subspace for adaptation, enabling efficient online adaptation with minimal feature updates while maintaining strong performance.
Abstract: Continual Test-Time Adaptation (CTTA) aims to enable models to adapt online to unlabeled data streams under distribution shift without accessing source data. Existing CTTA methods face an efficiency-generalization trade-off: updating more parameters improves adaptation but severely reduces online inference efficiency. An ideal solution is to achieve comparable adaptation with minimal feature updates; we call this minimal subspace the golden subspace. We prove its existence in a single-step adaptation setting and show that it coincides with the row space of the pretrained classifier. To enable online maintenance of this subspace, we introduce the sample-wise Average Gradient Outer Product (AGOP) as an efficient proxy for estimating the classifier weights without retraining. Building on these insights, we propose Guided Online Low-rank Directional adaptation (GOLD), which uses a lightweight adapter to project features onto the golden subspace and learns a compact scaling vector while the subspace is dynamically updated via AGOP. Extensive experiments on classification and segmentation benchmarks, including autonomous-driving scenarios, demonstrate that GOLD attains superior efficiency, stability, and overall performance. Our code is available at https://github.com/AIGNLAI/GOLD.
[374] Chronological Contrastive Learning: Few-Shot Progression Assessment in Irreversible Diseases
Clemens Watzenböck, Daniel Aletaha, Michaël Deman, Thomas Deimel, Jana Eder, Ivana Janickova, Robert Janiczek, Peter Mandl, Philipp Seeböck, Gabriela Supp, Paul Weiser, Georg Langs
Main category: cs.CV
TL;DR: ChronoCon is a self-supervised contrastive learning method that uses temporal ordering of longitudinal medical scans to learn disease severity representations without expert labels, assuming monotonic disease progression.
Details
Motivation: Expert-annotated disease severity scores in medical imaging are expensive, time-consuming, and variable, while clinical archives contain abundant longitudinal imaging data that existing self-supervised methods ignore. The paper aims to leverage chronological structure in patient scans to reduce annotation requirements.Method: ChronoCon uses contrastive learning with rankings derived from visitation order of longitudinal scans, replacing label-based ranking losses with temporal ordering. It assumes monotonic progression in irreversible diseases to learn disease-relevant representations without expert labels, generalizing Rank-N-Contrast from label distances to temporal ordering.
Result: On rheumatoid arthritis radiographs, ChronoCon significantly improves label efficiency, outperforming fully supervised baselines in low-label settings. Fine-tuning on expert scores from only five patients yields 86% intraclass correlation coefficient for severity prediction.
Conclusion: Chronological contrastive learning can exploit routinely available imaging metadata to reduce annotation requirements in irreversible disease domains, demonstrating practical value for medical imaging applications.
Abstract: Quantitative disease severity scoring in medical imaging is costly, time-consuming, and subject to inter-reader variability. At the same time, clinical archives contain far more longitudinal imaging data than expert-annotated severity scores. Existing self-supervised methods typically ignore this chronological structure. We introduce ChronoCon, a contrastive learning approach that replaces label-based ranking losses with rankings derived solely from the visitation order of a patient’s longitudinal scans. Under the clinically plausible assumption of monotonic progression in irreversible diseases, the method learns disease-relevant representations without using any expert labels. This generalizes the idea of Rank-N-Contrast from label distances to temporal ordering. Evaluated on rheumatoid arthritis radiographs for severity assessment, the learned representations substantially improve label efficiency. In low-label settings, ChronoCon significantly outperforms a fully supervised baseline initialized from ImageNet weights. In a few-shot learning experiment, fine-tuning ChronoCon on expert scores from only five patients yields an intraclass correlation coefficient of 86% for severity score prediction. These results demonstrate the potential of chronological contrastive learning to exploit routinely available imaging metadata to reduce annotation requirements in the irreversible disease domain. Code is available at https://github.com/cirmuw/ChronoCon.
[375] SatGeo-NeRF: Geometrically Regularized NeRF for Satellite Imagery
Valentin Wagner, Sebastian Bullinger, Michael Arens, Rainer Stiefelhagen
Main category: cs.CV
TL;DR: SatGeo-NeRF introduces three geometric regularizers for satellite NeRF reconstruction to reduce overfitting artifacts and improve geometric accuracy.
Details
Motivation: Current NeRF models for satellite imagery suffer from overfitting-induced geometric artifacts, leading to inaccurate 3D reconstructions from satellite imagery.Method: Three model-agnostic regularizers: 1) Gravity-Aligned Planarity Regularization aligns surface normals with gravity axis, 2) Granularity Regularization enforces coarse-to-fine geometry learning, 3) Depth-Supervised Regularization stabilizes early training.
Result: On DFC2019 benchmark, SatGeo-NeRF improves Mean Altitude Error by 13.9% and 11.7% relative to state-of-the-art baselines EO-NeRF and EO-GS.
Conclusion: The proposed geometric regularizers effectively mitigate overfitting artifacts in satellite NeRF reconstruction, significantly improving geometric accuracy.
Abstract: We present SatGeo-NeRF, a geometrically regularized NeRF for satellite imagery that mitigates overfitting-induced geometric artifacts observed in current state-of-the-art models using three model-agnostic regularizers. Gravity-Aligned Planarity Regularization aligns depth-inferred, approximated surface normals with the gravity axis to promote local planarity, coupling adjacent rays via a corresponding surface approximation to facilitate cross-ray gradient flow. Granularity Regularization enforces a coarse-to-fine geometry-learning scheme, and Depth-Supervised Regularization stabilizes early training for improved geometric accuracy. On the DFC2019 satellite reconstruction benchmark, SatGeo-NeRF improves the Mean Altitude Error by 13.9% and 11.7% relative to state-of-the-art baselines such as EO-NeRF and EO-GS.
[376] Cross-Instance Gaussian Splatting Registration via Geometry-Aware Feature-Guided Alignment
Roy Amoyal, Oren Freifeld, Chaim Baskin
Main category: cs.CV
TL;DR: Gaussian Splatting Alignment (GSA) is a novel method for aligning two independent 3D Gaussian Splatting models via similarity transformation, even for different objects in the same category, outperforming existing methods.
Details
Motivation: Existing methods can only align 3DGS models of the same object and often require true scale as input. There's a need for category-level 3DGS registration that works for different objects within the same category without scale priors.Method: Two-step optimization framework: 1) Iterative feature-guided absolute orientation solver for coarse registration (robust to poor initialization), 2) Fine registration enforcing multi-view feature consistency inspired by inverse radiance-field formulations. Uses viewpoint-guided spherical map features for robust correspondences.
Result: GSA outperforms prior works in same-object cases, often by large margins, even when competitors are given true scale. For different objects in same category, GSA vastly surpasses existing methods, providing first effective solution for category-level 3DGS registration.
Conclusion: GSA enables effective category-level 3D Gaussian Splatting registration, unlocking new applications by aligning different objects within the same category without requiring scale priors or same-object constraints.
Abstract: We present Gaussian Splatting Alignment (GSA), a novel method for aligning two independent 3D Gaussian Splatting (3DGS) models via a similarity transformation (rotation, translation, and scale), even when they are of different objects in the same category (e.g., different cars). In contrast, existing methods can only align 3DGS models of the same object (e.g., the same car) and often must be given true scale as input, while we estimate it successfully. GSA leverages viewpoint-guided spherical map features to obtain robust correspondences and introduces a two-step optimization framework that aligns 3DGS models while keeping them fixed. First, we apply an iterative feature-guided absolute orientation solver as our coarse registration, which is robust to poor initialization (e.g., 180 degrees misalignment or a 10x scale gap). Next, we use a fine registration step that enforces multi-view feature consistency, inspired by inverse radiance-field formulations. The first step already achieves state-of-the-art performance, and the second further improves results. In the same-object case, GSA outperforms prior works, often by a large margin, even when the other methods are given the true scale. In the harder case of different objects in the same category, GSA vastly surpasses them, providing the first effective solution for category-level 3DGS registration and unlocking new applications. Project webpage: https://bgu-cs-vil.github.io/GSA-project/
[377] LRC-WeatherNet: LiDAR, RADAR, and Camera Fusion Network for Real-time Weather-type Classification in Autonomous Driving
Nour Alhuda Albashir, Lars Pernickel, Danial Hamoud, Idriss Gouigah, Eren Erdal Aksoy
Main category: cs.CV
TL;DR: LRC-WeatherNet is a multi-sensor fusion framework combining LiDAR, RADAR, and camera data for real-time weather classification in autonomous vehicles, achieving superior performance on diverse weather conditions.
Details
Motivation: Autonomous vehicles struggle with perception in adverse weather (rain, fog, snow) due to sensor degradation. Each sensor (LiDAR, RADAR, camera) has unique strengths but also limitations in poor conditions, creating a need for robust multi-sensor fusion for weather classification.Method: Proposes LRC-WeatherNet with early fusion using unified Bird’s Eye View representation and mid-level gated fusion of modality-specific feature maps. The framework adapts to varying sensor reliability under changing weather conditions.
Result: Evaluated on MSU-4S dataset covering nine weather types, LRC-WeatherNet achieves superior classification performance and computational efficiency, significantly outperforming unimodal baselines in adverse conditions.
Conclusion: This is the first work to combine all three modalities (LiDAR, RADAR, camera) for robust, real-time weather classification in autonomous driving, with models and code released publicly.
Abstract: Autonomous vehicles face major perception and navigation challenges in adverse weather such as rain, fog, and snow, which degrade the performance of LiDAR, RADAR, and RGB camera sensors. While each sensor type offers unique strengths, such as RADAR robustness in poor visibility and LiDAR precision in clear conditions, they also suffer distinct limitations when exposed to environmental obstructions. This study proposes LRC-WeatherNet, a novel multi-sensor fusion framework that integrates LiDAR, RADAR, and camera data for real-time classification of weather conditions. By employing both early fusion using a unified Bird’s Eye View representation and mid-level gated fusion of modality-specific feature maps, our approach adapts to the varying reliability of each sensor under changing weather. Evaluated on the extensive MSU-4S dataset covering nine weather types, LRC-WeatherNet achieves superior classification performance and computational efficiency, significantly outperforming unimodal baselines in adverse conditions. This work is the first to combine all three modalities for robust, real-time weather classification in autonomous driving. We release our trained models and source code in https://github.com/nouralhudaalbashir/LRC-WeatherNet.
[378] MultiBind: A Benchmark for Attribute Misbinding in Multi-Subject Generation
Wenqing Tian, Hanyi Mao, Zhaocheng Liu, Lihua Zhang, Qiang Liu, Jian Wu, Liang Wang
Main category: cs.CV
TL;DR: MultiBind benchmark for evaluating cross-subject attribute misbinding in multi-reference image generation, with specialized metrics to diagnose binding failures.
Details
Motivation: Current multi-reference image generation systems suffer from cross-subject attribute misbinding where attributes get assigned to wrong subjects, but existing benchmarks focus on holistic fidelity rather than diagnosing these specific binding failures.Method: Created MultiBind benchmark from real multi-person photographs with slot-ordered subject crops, masks, bounding boxes, canonicalized references, inpainted backgrounds, and entity-indexed prompts. Proposed dimension-wise confusion evaluation protocol that matches generated subjects to ground-truth slots and measures slot-to-slot similarity using specialists for face identity, appearance, pose, and expression.
Result: MultiBind reveals binding failures that conventional reconstruction metrics miss, exposing interpretable failure patterns like drift, swap, dominance, and blending in modern multi-reference generators.
Conclusion: MultiBind provides a specialized benchmark and evaluation protocol for diagnosing cross-subject attribute misbinding in multi-reference image generation, enabling better understanding and improvement of binding capabilities.
Abstract: Subject-driven image generation is increasingly expected to support fine-grained control over multiple entities within a single image. In multi-reference workflows, users may provide several subject images, a background reference, and long, entity-indexed prompts to control multiple people within one scene. In this setting, a key failure mode is cross-subject attribute misbinding: attributes are preserved, edited, or transferred to the wrong subject. Existing benchmarks and metrics largely emphasize holistic fidelity or per-subject self-similarity, making such failures hard to diagnose. We introduce MultiBind, a benchmark built from real multi-person photographs. Each instance provides slot-ordered subject crops with masks and bounding boxes, canonicalized subject references, an inpainted background reference, and a dense entity-indexed prompt derived from structured annotations. We also propose a dimension-wise confusion evaluation protocol that matches generated subjects to ground-truth slots and measures slot-to-slot similarity using specialists for face identity, appearance, pose, and expression. By subtracting the corresponding ground-truth similarity matrices, our method separates self-degradation from true cross-subject interference and exposes interpretable failure patterns such as drift, swap, dominance, and blending. Experiments on modern multi-reference generators show that MultiBind reveals binding failures that conventional reconstruction metrics miss.
[379] SegMaFormer: A Hybrid State-Space and Transformer Model for Efficient Segmentation
Duy D. Nguyen, Phat T. Tran-Truong
Main category: cs.CV
TL;DR: SegMaFormer: A lightweight hybrid architecture combining Mamba and Transformer modules for efficient 3D medical image segmentation with significantly reduced computational complexity.
Details
Motivation: Transformer models for 3D medical image segmentation have high computational complexity and parameter counts, which is prohibitive for volumetric data and limited annotated datasets. Need for efficient long-range dependency modeling.Method: Hybrid architecture synergizing Mamba and Transformer modules in hierarchical volumetric encoder. Uses Mamba-based layers in early high-resolution stages to reduce computation, and self-attention in later low-resolution stages to refine features. Augmented with generalized rotary position embeddings.
Result: Achieves competitive performance on Synapse, BraTS, and ACDC benchmarks, matching Dice coefficient of larger models. Reduces parameters by up to 75x and substantially decreases FLOPs compared to state-of-the-art models.
Conclusion: SegMaFormer establishes an efficient and high-performing solution for 3D medical image segmentation by balancing computational efficiency with modeling capability through strategic Mamba-Transformer integration.
Abstract: The advent of Transformer and Mamba-based architectures has significantly advanced 3D medical image segmentation by enabling global contextual modeling, a capability traditionally limited in Convolutional Neural Networks (CNNs). However, state-of-the-art Transformer models often entail substantial computational complexity and parameter counts, which is particularly prohibitive for volumetric data and further exacerbated by the limited availability of annotated medical imaging datasets. To address these limitations, this work introduces SegMaFormer, a lightweight hybrid architecture that synergizes Mamba and Transformer modules within a hierarchical volumetric encoder for efficient long-range dependency modeling. The model strategically employs Mamba-based layers in early, high-resolution stages to reduce computational overhead while capturing essential spatial context, and reserves self-attention mechanisms for later, lower-resolution stages to refine feature representation. This design is augmented with generalized rotary position embeddings to enhance spatial awareness. Despite its compact structure, SegMaFormer achieves competitive performance on three public benchmarks (Synapse, BraTS, and ACDC), matching the Dice coefficient of significantly larger models. Empirically, our approach reduces parameters by up to 75x and substantially decreases FLOPs compared to current state-of-the-art models, establishing an efficient and high-performing solution for 3D medical image segmentation.
[380] GeoFlow: Real-Time Fine-Grained Cross-View Geolocalization via Iterative Flow Prediction
Ayesh Abu Lehyeh, Xiaohan Zhang, Ahmad Arrabi, Waqas Sultani, Chen Chen, Safwan Wshah
Main category: cs.CV
TL;DR: GeoFlow introduces a lightweight framework for fine-grained cross-view geolocalization that breaks the accuracy-speed trade-off using probabilistic mapping and iterative refinement sampling.
Details
Motivation: Current fine-grained cross-view geolocalization methods force a difficult trade-off between accuracy and speed, with high-accuracy models being too slow for real-time autonomous navigation applications in GPS-denied areas.Method: GeoFlow learns a direct probabilistic mapping to predict displacement (distance and direction) to correct location hypotheses, combined with Iterative Refinement Sampling (IRS) that refines a population of hypotheses iteratively to converge to a robust consensus.
Result: Experiments on KITTI and VIGOR datasets show GeoFlow achieves state-of-the-art efficiency with real-time speeds of 29 FPS while maintaining competitive localization accuracy.
Conclusion: GeoFlow offers a new approach for practical real-time geolocalization systems with flexible inference-time scaling that allows direct trade-off between performance and computation without retraining.
Abstract: Accurate and fast localization is vital for safe autonomous navigation in GPS-denied areas. Fine-Grained Cross-View Geolocalization (FG-CVG) aims to estimate the precise 2-Degree-of-Freedom (2-DoF) location of a ground image relative to a satellite image. However, current methods force a difficult trade-off, with high-accuracy models being slow for real-time use. In this paper, we introduce GeoFlow, a new approach that offers a lightweight and highly efficient framework that breaks this accuracy-speed trade-off. Our technique learns a direct probabilistic mapping, predicting the displacement (in distance and direction) required to correct any given location hypothesis. This is complemented by our novel inference algorithm, Iterative Refinement Sampling (IRS). Instead of trusting a single prediction, IRS refines a population of hypotheses, allowing them to iteratively ‘flow’ from random starting points to a robust, converged consensus. Even its iterative nature, this approach offers flexible inference-time scaling, allowing a direct trade-off between performance and computation without any re-training. Experiments on the KITTI and VIGOR datasets show that GeoFlow achieves state-of-the-art efficiency, running at real-time speeds of 29 FPS while maintaining competitive localization accuracy. This work opens a new path for the development of practical real-time geolocalization systems.
[381] Uncertainty-guided Compositional Alignment with Part-to-Whole Semantic Representativeness in Hyperbolic Vision-Language Models
Hayeon Kim, Ji Ha Jang, Junghun James Kim, Se Young Chun
Main category: cs.CV
TL;DR: UNCHA enhances hyperbolic vision-language models by modeling part-to-whole semantic representativeness with hyperbolic uncertainty, improving compositional understanding of multi-object scenes.
Details
Motivation: Current hyperbolic VLMs capture hierarchical structures but fail to model that different parts have varying levels of semantic representativeness to the whole scene, limiting their ability to understand complex multi-object compositional scenarios.Method: Proposes UNcertainty-guided Compositional Hyperbolic Alignment (UNCHA) that: 1) models part-to-whole semantic representativeness with hyperbolic uncertainty, 2) incorporates this representativeness into contrastive objectives with uncertainty-guided weights, and 3) calibrates uncertainty with an entailment loss regularized by entropy-based term.
Result: Achieves state-of-the-art performance on zero-shot classification, retrieval, and multi-label classification benchmarks, learning hyperbolic embeddings with more accurate part-whole ordering and better compositional structure understanding.
Conclusion: UNCHA successfully enhances hyperbolic VLMs by modeling semantic representativeness through uncertainty, leading to improved understanding of complex multi-object scenes and better hierarchical structure preservation.
Abstract: While Vision-Language Models (VLMs) have achieved remarkable performance, their Euclidean embeddings remain limited in capturing hierarchical relationships such as part-to-whole or parent-child structures, and often face challenges in multi-object compositional scenarios. Hyperbolic VLMs mitigate this issue by better preserving hierarchical structures and modeling part-whole relations (i.e., whole scene and its part images) through entailment. However, existing approaches do not model that each part has a different level of semantic representativeness to the whole. We propose UNcertainty-guided Compositional Hyperbolic Alignment (UNCHA) for enhancing hyperbolic VLMs. UNCHA models part-to-whole semantic representativeness with hyperbolic uncertainty, by assigning lower uncertainty to more representative parts and higher uncertainty to less representative ones for the whole scene. This representativeness is then incorporated into the contrastive objective with uncertainty-guided weights. Finally, the uncertainty is further calibrated with an entailment loss regularized by entropy-based term. With the proposed losses, UNCHA learns hyperbolic embeddings with more accurate part-whole ordering, capturing the underlying compositional structure in an image and improving its understanding of complex multi-object scenes. UNCHA achieves state-of-the-art performance on zero-shot classification, retrieval, and multi-label classification benchmarks. Our code and models are available at: https://github.com/jeeit17/UNCHA.git.
[382] Group3D: MLLM-Driven Semantic Grouping for Open-Vocabulary 3D Object Detection
Youbin Kim, Jinho Park, Hogun Park, Eunbyung Park
Main category: cs.CV
TL;DR: Group3D integrates semantic constraints into multi-view 3D object detection using MLLM-derived semantic compatibility groups to prevent geometry-only merging errors.
Details
Motivation: Current open-vocabulary 3D detection methods decouple geometry-based instance construction from semantic labeling, leading to errors when geometric evidence is incomplete or view-dependent, causing over-merging or fragmentation.Method: Group3D uses a multimodal large language model to create scene-adaptive vocabulary organized into semantic compatibility groups. These groups act as merge-time constraints, allowing 3D fragment association only when both semantic compatibility and geometric consistency are satisfied.
Result: Group3D achieves state-of-the-art performance on ScanNet and ARKitScenes for multi-view open-vocabulary 3D detection, with strong generalization in zero-shot scenarios.
Conclusion: Integrating semantic constraints directly into instance construction via MLLM-derived compatibility groups effectively mitigates geometry-driven errors and improves open-vocabulary 3D detection performance.
Abstract: Open-vocabulary 3D object detection aims to localize and recognize objects beyond a fixed training taxonomy. In multi-view RGB settings, recent approaches often decouple geometry-based instance construction from semantic labeling, generating class-agnostic fragments and assigning open-vocabulary categories post hoc. While flexible, such decoupling leaves instance construction governed primarily by geometric consistency, without semantic constraints during merging. When geometric evidence is view-dependent and incomplete, this geometry-only merging can lead to irreversible association errors, including over-merging of distinct objects or fragmentation of a single instance. We propose Group3D, a multi-view open-vocabulary 3D detection framework that integrates semantic constraints directly into the instance construction process. Group3D maintains a scene-adaptive vocabulary derived from a multimodal large language model (MLLM) and organizes it into semantic compatibility groups that encode plausible cross-view category equivalence. These groups act as merge-time constraints: 3D fragments are associated only when they satisfy both semantic compatibility and geometric consistency. This semantically gated merging mitigates geometry-driven over-merging while absorbing multi-view category variability. Group3D supports both pose-known and pose-free settings, relying only on RGB observations. Experiments on ScanNet and ARKitScenes demonstrate that Group3D achieves state-of-the-art performance in multi-view open-vocabulary 3D detection, while exhibiting strong generalization in zero-shot scenarios. The project page is available at https://ubin108.github.io/Group3D/.
[383] Mamba-VMR: Multimodal Query Augmentation via Generated Videos for Precise Temporal Grounding
Yunzhuo Sun, Xinyue Liu, Yanyang Li, Nanding Wu, Yifang Xu, Linlin Zong, Xianchao Zhang, Wenxin Liang
Main category: cs.CV
TL;DR: A two-stage framework for text-driven video moment retrieval that uses LLM-guided subtitle matching and text-to-video generation to create temporal priors, then processes them through a multimodal controlled Mamba network for efficient long-sequence grounding.
Details
Motivation: Traditional VMR methods struggle with capturing hidden temporal dynamics in untrimmed videos, leading to imprecise grounding in long sequences. Existing approaches overlook motion sequences, suffer from high computational costs in Transformer architectures, and fail to effectively integrate subtitle contexts with generated temporal priors.Method: Two-stage framework: 1) LLM-guided subtitle matching identifies relevant textual cues from video subtitles, fused with queries to generate auxiliary short videos via text-to-video models, creating temporal priors. 2) Augmented queries processed through a multimodal controlled Mamba network with video-guided gating for efficient fusion of generated priors and long sequences while filtering noise.
Result: Experimental evaluations on TVR benchmark show significant improvements over state-of-the-art methods, including reduced computational overhead and higher recall in long-sequence grounding.
Conclusion: The proposed framework effectively addresses limitations in traditional VMR by integrating subtitle contexts with generated temporal priors through an efficient multimodal architecture, achieving better performance in long-sequence video grounding with lower computational costs.
Abstract: Text-driven video moment retrieval (VMR) remains challenging due to limited capture of hidden temporal dynamics in untrimmed videos, leading to imprecise grounding in long sequences. Traditional methods rely on natural language queries (NLQs) or static image augmentations, overlooking motion sequences and suffering from high computational costs in Transformer-based architectures. Existing approaches fail to integrate subtitle contexts and generated temporal priors effectively, we therefore propose a novel two-stage framework for enhanced temporal grounding. In the first stage, LLM-guided subtitle matching identifies relevant textual cues from video subtitles, fused with the query to generate auxiliary short videos via text-to-video models, capturing implicit motion information as temporal priors. In the second stage, augmented queries are processed through a multi-modal controlled Mamba network, extending text-controlled selection with video-guided gating for efficient fusion of generated priors and long sequences while filtering noise. Our framework is agnostic to base retrieval models and widely applicable for multimodal VMR. Experimental evaluations on the TVR benchmark demonstrate significant improvements over state-of-the-art methods, including reduced computational overhead and higher recall in long-sequence grounding.
[384] Unified Spatiotemporal Token Compression for Video-LLMs at Ultra-Low Retention
Junhao Du, Jialong Xue, Anqi Li, Jincheng Dai, Guo Lu
Main category: cs.CV
TL;DR: Unified spatiotemporal token compression method for Video-LLMs that globally selects and merges visual tokens to reduce computational costs while preserving performance.
Details
Motivation: Video-LLMs face high computational costs due to large volumes of visual tokens. Existing compression methods use two-stage spatiotemporal strategies with stage-specific metrics and implicit spatiotemporal separability assumptions, which under extremely low retention ratios lead to unbalanced allocation and loss of visual evidence essential for question answering.Method: Reformulates token compression as a spatiotemporal allocation task within a global token retention pool. Uses unified selection mechanism integrating attention weights and semantic similarity to globally select tokens with high contribution and low redundancy. Unselected tokens are merged via clustering and refilled. Inside LLM, introduces text-aware merging for secondary compression based on query relevance. Plug-and-play module requiring no retraining.
Result: Retaining only about 2% of visual tokens preserves 90.1% of baseline performance across multiple benchmarks, reducing FLOPs to roughly 2.6%. Benefits generalize across diverse backbones, decreasing end-to-end inference latency and memory consumption.
Conclusion: The unified spatiotemporal token compression strategy establishes state-of-the-art in video understanding under ultra-low token retention, offering efficient plug-and-play solution for existing Video-LLMs.
Abstract: Video large language models (Video-LLMs) face high computational costs due to large volumes of visual tokens. Existing token compression methods typically adopt a two-stage spatiotemporal compression strategy, relying on stage-specific metrics and an implicit assumption of spatiotemporal separability. Under extremely low retention ratios, however, such approaches often result in unbalanced allocation and loss of visual evidence essential for question answering. We reformulate token compression as a spatiotemporal allocation task within a global token retention pool. We propose a unified selection mechanism that integrates attention weights and semantic similarity to globally select tokens with high contribution and low redundancy. Unselected tokens are merged via clustering and refilled, preserving information integrity. Inside the LLM, we further introduce text-aware merging to perform secondary compression based on query relevance. Without requiring retraining, our method serves as a plug-and-play module compatible with existing Video-LLMs. Experiments show that retaining only about 2% of visual tokens preserves 90.1% of baseline performance across multiple benchmarks, while reducing FLOPs to roughly 2.6%. These benefits generalize across diverse backbones, decreasing end-to-end inference latency and memory consumption. Our unified spatiotemporal token compression strategy establishes the state-of-the-art in video understanding under ultra-low token retention.
[385] GeoFusion-CAD: Structure-Aware Diffusion with Geometric State Space for Parametric 3D Design
Xiaolei Zhou, Chuangjie Fang, Jie Wu, Jingyi Yang, Boyi Lin, Jianwei Zheng
Main category: cs.CV
TL;DR: GeoFusion-CAD: A diffusion framework using hierarchical tree encoding and C-Mamba blocks for scalable parametric CAD sequence generation, with new DeepCAD-240 benchmark for long sequences.
Details
Motivation: Existing Transformer-based CAD sequence generation methods struggle with long command sequences due to quadratic attention costs and limited context windows, hindering scalability for complex geometric and topological dependencies.Method: Proposes GeoFusion-CAD: encodes CAD programs as hierarchical trees capturing geometry and topology, uses state-space diffusion with lightweight C-Mamba blocks for long-range dependency modeling through selective state transitions.
Result: Achieves superior performance on both short and long command ranges, maintains high geometric fidelity and topological consistency where Transformer models degrade, sets new SOTA for long-sequence parametric CAD generation.
Conclusion: GeoFusion-CAD establishes a scalable foundation for next-generation CAD modeling systems, with code and datasets available.
Abstract: Parametric Computer-Aided Design (CAD) is fundamental to modern 3D modeling, yet existing methods struggle to generate long command sequences, especially under complex geometric and topological dependencies. Transformer-based architectures dominate CAD sequence generation due to their strong dependency modeling, but their quadratic attention cost and limited context windowing hinder scalability to long programs. We propose GeoFusion-CAD, an end-to-end diffusion framework for scalable and structure-aware generation. Our proposal encodes CAD programs as hierarchical trees, jointly capturing geometry and topology within a state-space diffusion process. Specifically, a lightweight C-Mamba block models long-range structural dependencies through selective state transitions, enabling coherent generation across extended command sequences. To support long-sequence evaluation, we introduce DeepCAD-240, an extended benchmark that increases the sequence length ranging from 40 to 240 while preserving sketch-extrusion semantics from the ABC dataset. Extensive experiments demonstrate that GeoFusion-CAD achieves superior performance on both short and long command ranges, maintaining high geometric fidelity and topological consistency where Transformer-based models degrade. Our approach sets new state-of-the-art scores for long-sequence parametric CAD generation, establishing a scalable foundation for next-generation CAD modeling systems. Code and datasets are available at GitHub.
[386] Beyond Matching to Tiles: Bridging Unaligned Aerial and Satellite Views for Vision-Only UAV Navigation
Kejia Liu, Haoyang Zhou, Ruoyu Xu, Peicheng Wang, Mingli Song, Haofei Zhang
Main category: cs.CV
TL;DR: Bearing-UAV: A vision-based cross-view geo-localization method that jointly predicts UAV absolute location and heading using neighboring features, addressing limitations of existing matching-based approaches.
Details
Motivation: Existing cross-view geo-localization methods for UAV navigation focus on matching UAV views to map tiles, creating accuracy-storage trade-offs and ignoring heading information. They also insufficiently address cross-view discrepancies and varying overlaps, limiting real-world generalization.Method: Proposes Bearing-UAV, a purely vision-driven method that leverages global and local structural features and explicitly encodes relative spatial relationships. It jointly predicts UAV absolute location and heading from neighboring features rather than using matching/retrieval paradigm.
Result: Extensive experiments show Bearing-UAV yields lower localization error than previous matching/retrieval methods across diverse terrains. The method also introduces Bearing-UAV-90k, a multi-city benchmark for cross-view localization and navigation evaluation.
Conclusion: Bearing-UAV enables accurate, lightweight, and robust vision-based UAV navigation in GNSS-denied environments by jointly predicting location and heading, addressing limitations of existing cross-view geo-localization approaches.
Abstract: Recent advances in cross-view geo-localization (CVGL) methods have shown strong potential for supporting unmanned aerial vehicle (UAV) navigation in GNSS-denied environments. However, existing work predominantly focuses on matching UAV views to onboard map tiles, which introduces an inherent trade-off between accuracy and storage overhead, and overlooks the importance of the UAV’s heading during navigation. Moreover, the substantial discrepancies and varying overlaps in cross-view scenarios have been insufficiently considered, limiting their generalization to real-world scenarios. In this paper, we present Bearing-UAV, a purely vision-driven cross-view navigation method that jointly predicts UAV absolute location and heading from neighboring features, enabling accurate, lightweight, and robust navigation in the wild. Our method leverages global and local structural features and explicitly encodes relative spatial relationships, making it robust to cross-view variations, misalignment, and feature-sparse conditions. We also present Bearing-UAV-90k, a multi-city benchmark for evaluating cross-view localization and navigation. Extensive experiments show encouraging results that Bearing-UAV yields lower localization error than previous matching/retrieval paradigm across diverse terrains. Our code and dataset will be made publicly available.
[387] WiFi-GEN: High-Resolution Indoor Imaging from WiFi Signals Using Generative AI
Jianyang Shi, Bowen Zhang, Amartansh Dubey, Ross Murch, Liwen Jing
Main category: cs.CV
TL;DR: WiFi-GEN: A generative AI network that converts WiFi power measurements into high-resolution indoor images, achieving 275% better shape reconstruction accuracy than physical model-based methods.
Details
Motivation: WiFi signals are omnipresent and can enable passive indoor imaging for robotics and IoT applications. Current physical model-based methods face challenges with nonlinearity, ill-posedness, and uncertainty in WiFi-based imaging.Method: Proposes WiFi-GEN network that treats WiFi indoor imaging as a multi-modal image generation task. The generative AI network absorbs challenges into massive parameters and is designed to fit measured WiFi signals to desired imaging output. A large-scale dataset of 80,000 WiFi signal-image pairs is created.
Result: Achieves shape reconstruction accuracy 275% of physical model-based methods, reduces Frechet Inception Distance by 82%, and releases the first large-scale WiFi imaging dataset.
Conclusion: WiFi-GEN demonstrates superior performance for WiFi-based indoor imaging by framing it as a multi-modal image generation task, enabling high-resolution reconstruction from WiFi signals.
Abstract: Indoor imaging is a critical task for robotics and internet-ofthings. WiFi as an omnipresent signal is a promising candidate for carrying out passive imaging and synchronizing the up-to-date information to all connected devices. This is the first research work to consider WiFi indoor imaging as a multi-modal image generation task that converts the measured WiFi power into a high-resolution indoor image. Our proposedWiFi-GEN network achieves a shape reconstruction accuracy that is 275% of that achieved by physical model-based inversion methods. Additionally, the Frechet Inception Distance score has been significantly reduced by 82%. To examine the effectiveness of models for this task, the first large-scale dataset is released containing 80,000 pairs of WiFi signal and imaging target. Our model absorbs challenges for the model-based methods including the nonlinearity, ill-posedness and non-certainty into massive parameters of our generative AI network. The network is also designed to best fit measured WiFi signals and the desired imaging output. Code: https://github.com/CNFightingSjy/WiFiGEN
[388] Speed by Simplicity: A Single-Stream Architecture for Fast Audio-Video Generative Foundation Model
SII-GAIR, Sand. ai, :, Ethan Chern, Hansi Teng, Hanwen Sun, Hao Wang, Hong Pan, Hongyu Jia, Jiadi Su, Jin Li, Junjie Yu, Lijie Liu, Lingzhi Li, Lyumanshan Ye, Min Hu, Qiangang Wang, Quanwei Qi, Steffi Chern, Tao Bu, Taoran Wang, Teren Xu, Tianning Zhang, Tiantian Mi, Weixian Xu, Wenqiang Zhang, Wentai Zhang, Xianping Yi, Xiaojie Cai, Xiaoyang Kang, Yan Ma, Yixiu Liu, Yunbo Zhang, Yunpeng Huang, Yutong Lin, Zewei Tao, Zhaoliang Liu, Zheng Zhang, Zhiyao Cen, Zhixuan Yu, Zhongshu Wang, Zhulin Hu, Zijin Zhou, Zinan Guo, Yue Cao, Pengfei Liu
Main category: cs.CV
TL;DR: daVinci-MagiHuman is an open-source audio-video generative foundation model that jointly generates synchronized video and audio using a single-stream Transformer architecture, excelling in human-centric scenarios with multilingual support and efficient inference.
Details
Motivation: To create a unified audio-video generative model that avoids complex multi-stream or cross-attention architectures while maintaining strong performance in human-centric generation tasks, with efficient inference capabilities.Method: Uses a single-stream Transformer that processes text, video, and audio within a unified token sequence via self-attention only. Combines this with model distillation, latent-space super-resolution, and a Turbo VAE decoder for efficient inference.
Result: Achieves highest visual quality and text alignment among leading open models, with lowest word error rate (14.60%) for speech intelligibility. Wins 80.0% against Ovi 1.1 and 60.9% against LTX 2.3 in human evaluations. Generates 5-second 256p video in 2 seconds on single H100 GPU.
Conclusion: daVinci-MagiHuman demonstrates that a simple single-stream Transformer architecture can effectively generate synchronized audio-video content with strong human-centric performance, multilingual capabilities, and efficient inference.
Abstract: We present daVinci-MagiHuman, an open-source audio-video generative foundation model for human-centric generation. daVinci-MagiHuman jointly generates synchronized video and audio using a single-stream Transformer that processes text, video, and audio within a unified token sequence via self-attention only. This single-stream design avoids the complexity of multi-stream or cross-attention architectures while remaining easy to optimize with standard training and inference infrastructure. The model is particularly strong in human-centric scenarios, producing expressive facial performance, natural speech-expression coordination, realistic body motion, and precise audio-video synchronization. It supports multilingual spoken generation across Chinese (Mandarin and Cantonese), English, Japanese, Korean, German, and French. For efficient inference, we combine the single-stream backbone with model distillation, latent-space super-resolution, and a Turbo VAE decoder, enabling generation of a 5-second 256p video in 2 seconds on a single H100 GPU. In automatic evaluation, daVinci-MagiHuman achieves the highest visual quality and text alignment among leading open models, along with the lowest word error rate (14.60%) for speech intelligibility. In pairwise human evaluation, it achieves win rates of 80.0% against Ovi 1.1 and 60.9% against LTX 2.3 over 2000 comparisons. We open-source the complete model stack, including the base model, the distilled model, the super-resolution model, and the inference codebase.
[389] Seeing is Improving: Visual Feedback for Iterative Text Layout Refinement
Junrong Guo, Shancheng Fang, Yadong Qu, Hongtao Xie
Main category: cs.CV
TL;DR: VFLM introduces a self-improving layout generation framework that uses visual feedback and reinforcement learning to iteratively refine layouts based on OCR accuracy and visual quality, outperforming existing methods.
Details
Motivation: Existing layout generation methods follow a code-only paradigm that generates code to represent layouts but are blind to the rendered visual outcome, making it difficult to guarantee readability and aesthetics. The authors identify visual feedback as a critical missing factor in layout generation.Method: Proposes Visual Feedback Layout Model (VFLM), a self-improving framework that leverages visual feedback for iterative refinement. Uses reinforcement learning with a visually grounded reward model that incorporates OCR accuracy. The model performs adaptive reflective generation by leveraging visual information to reflect on previous issues and iteratively generates outputs until satisfactory quality is achieved.
Result: Experiments across multiple benchmarks show that VFLM consistently outperforms advanced MLLMs, existing layout models, and code-only baselines. The framework establishes visual feedback as critical for design-oriented MLLMs.
Conclusion: Visual feedback is a critical factor in layout generation, and VFLM’s self-improving framework with iterative refinement based on visual feedback significantly improves layout quality, readability, and aesthetics compared to existing methods.
Abstract: Recent advances in Multimodal Large Language Models (MLLMs) have enabled automated generation of structured layouts from natural language descriptions. Existing methods typically follow a code-only paradigm that generates code to represent layouts, which are then rendered by graphic engines to produce final images. However, they are blind to the rendered visual outcome, making it difficult to guarantee readability and aesthetics. In this paper, we identify visual feedback as a critical factor in layout generation and propose Visual Feedback Layout Model (VFLM), a self-improving framework that leverages visual feedback iterative refinement. VFLM is capable of performing adaptive reflective generation, which leverages visual information to reflect on previous issues and iteratively generates outputs until satisfactory quality is achieved. It is achieved through reinforcement learning with a visually grounded reward model that incorporates OCR accuracy. By rewarding only the final generated outcome, we can effectively stimulate the model’s iterative and reflective generative capabilities. Experiments across multiple benchmarks show that VFLM consistently outperforms advanced MLLMs, existing layout models, and code-only baselines, establishing visual feedback as critical for design-oriented MLLMs. Our code and data are available at https://github.com/FolSpark/VFLM.
[390] STENet: Superpixel Token Enhancing Network for RGB-D Salient Object Detection
Jianlin Chen, Gongyang Li, Zhijiang Zhang, Liang Chang, Dan Zeng
Main category: cs.CV
TL;DR: STENet introduces superpixel-based cross-modal interaction for RGB-D salient object detection, addressing transformer limitations with global and local enhancement modules.
Details
Motivation: Current RGB-D SOD methods using transformers face challenges with quadratic attention complexity and limited local detail extraction. The paper aims to overcome these limitations by introducing superpixels into cross-modal interaction.Method: Proposes Superpixel Token Enhancing Network (STENet) with two superpixel-driven cross-modal interaction modules: 1) Superpixel Attention Global Enhancing Module for region-level information with reduced complexity, and 2) Superpixel Attention Local Refining Module for local detail enhancement. Uses updated superpixel generation with expanded neighborhood range.
Result: Experiments on seven RGB-D SOD datasets show STENet achieves competitive performance compared to state-of-the-art methods.
Conclusion: STENet effectively addresses transformer limitations in RGB-D SOD by introducing superpixel-based cross-modal interaction, achieving good performance with reduced computational complexity.
Abstract: Transformer-based methods for RGB-D Salient Object Detection (SOD) have gained significant interest, owing to the transformer’s exceptional capacity to capture long-range pixel dependencies. Nevertheless, current RGB-D SOD methods face challenges, such as the quadratic complexity of the attention mechanism and the limited local detail extraction. To overcome these limitations, we propose a novel Superpixel Token Enhancing Network (STENet), which introduces superpixels into cross-modal interaction. STENet follows the two-stream encoder-decoder structure. Its cores are two tailored superpixel-driven cross-modal interaction modules, responsible for global and local feature enhancement. Specifically, we update the superpixel generation method by expanding the neighborhood range of each superpixel, allowing for flexible transformation between pixels and superpixels. With the updated superpixel generation method, we first propose the Superpixel Attention Global Enhancing Module to model the global pixel-to-superpixel relationship rather than the traditional global pixel-to-pixel relationship, which can capture region-level information and reduce computational complexity. We also propose the Superpixel Attention Local Refining Module, which leverages pixel similarity within superpixels to filter out a subset of pixels (i.e., local pixels) and then performs feature enhancement on these local pixels, thereby capturing concerned local details. Furthermore, we fuse the globally and locally enhanced features along with the cross-scale features to achieve comprehensive feature representation. Experiments on seven RGB-D SOD datasets reveal that our STENet achieves competitive performance compared to state-of-the-art methods. The code and results of our method are available at https://github.com/Mark9010/STENet.
[391] HPE-CogVLM: Advancing Vision Language Models with a Head Pose Grounding Task
Yu Tian, Tianqi Shao, Tsukasa Demizu, Xuyang Wu, Hsin-Tai Wu
Main category: cs.CV
TL;DR: A novel framework that integrates head pose estimation capability into Vision Language Model (CogVLM) using a specialized LoRA layer-based model merging method, achieving significant accuracy improvements over CNN-based approaches.
Details
Motivation: Traditional CNN-based head pose estimation models lack robustness in real-world scenarios and rely on cropped head images. Vision Language Models offer potential for analyzing entire images with attention mechanisms, but direct fine-tuning fails to achieve good accuracy while preserving object detection capabilities.Method: Developed a novel LoRA layer-based model merging method with high cosine similarity threshold and ‘winner-takes-all’ layer selection strategy to integrate HPE capability into CogVLM while preserving original object detection knowledge.
Result: HPE-CogVLM achieves 31.5% reduction in Mean Absolute Error over state-of-the-art CNN model 6DRepNet in cross-dataset evaluation, and outperforms both directly LoRA fine-tuned and task arithmetic-based merged VLMs across all HPE metrics.
Conclusion: The proposed model merging approach successfully integrates head pose estimation into Vision Language Models while maintaining object detection capabilities, demonstrating superior performance over traditional CNN methods and other VLM adaptation techniques.
Abstract: Head pose estimation (HPE) requires a sophisticated understanding of 3D spatial relationships to generate precise yaw, pitch, and roll angles. Previous HPE models, primarily CNN-based, rely on cropped close-up human head images as inputs and often lack robustness in real-world scenario. Vision Language Models (VLMs) can analyze entire images while focusing on specific objects through their attention mechanisms. In this paper, we propose a novel framework to improve the HPE accuracy by leveraging the object detection grounding capability of a VLM, referred to as CogVLM. We empirically find that directly LoRA fine-tuning of this VLM for the HPE task fails to achieve desirable HPE accuracy, while some model merging methods can improve accuracy but frequently produce blended invalid response formats, struggling to handle both object detection and HPE tasks simultaneously. To integrate HPE capability into CogVLM effectively, we develop a novel LoRA layer-based model merging method. This merging approach applies a high cosine similarity threshold and a ‘winner-takes-all’ layer selection strategy, aligning attention to the HPE task while preserving original object detection knowledge. It successfully resolves issues with blended invalid response formats and improves accuracy. Results show that our HPE-CogVLM achieves a 31.5% reduction in Mean Absolute Error over the current state-of-the-art CNN model, 6DRepNet, in cross-dataset evaluation. Furthermore, HPE-CogVLM outperforms both directly LoRA fine-tuned and task arithmetic-based merged VLMs across all HPE metrics.
[392] 6D Robotic OCT Scanning of Curved Tissue Surfaces
Suresh Guttikonda, Maximilian Neidhardt, Vidas Raudonis, Alexander Schlaefer
Main category: cs.CV
TL;DR: A method for full 6D hand-eye calibration of robot-mounted OCT probes using markers, enabling consistent scanning of curved tissue surfaces without image registration errors.
Details
Motivation: Current OCT scanning methods for large tissue structures are limited when dealing with curved surfaces. Handheld scanning requires image overlap for stitching, while robotic scanning typically restricts motion to translations to avoid complex hand-eye calibration. These approaches fail to handle curved tissue surfaces effectively and can accumulate registration errors.Method: Proposes a marker-based approach for full six-dimensional hand-eye calibration of robot-mounted OCT probes. The method enables accurate transformation estimation between the robot and OCT probe coordinate systems, allowing for consistent scanning of curved surfaces without relying on image registration.
Result: The calibration method produces highly repeatable transformation estimates. Robotic scanning experiments on two phantom surfaces demonstrate that the proposed calibration enables consistent scanning of large, curved tissue surfaces. The approach avoids error accumulation along scan paths and shows improvement over conventional 3D-translational robotic scanning.
Conclusion: The marker-based 6D hand-eye calibration enables effective robotic OCT scanning of curved tissue surfaces, overcoming limitations of existing methods that rely on image registration or translational-only motion. This approach provides more accurate and consistent scanning for larger tissue structures.
Abstract: Optical coherence tomography (OCT) is a non-invasive volumetric imaging modality with high spatial and temporal resolution. For imaging larger tissue structures, OCT probes need to be moved to scan the respective area. For handheld scanning, stitching of the acquired OCT volumes requires overlap to register the images. For robotic scanning and stitching, a typical approach is to restrict the motion to translations, as this avoids a full hand-eye calibration, which is complicated by the small field of view of most OCT probes. However, stitching by registration or by translational scanning are limited when curved tissue surfaces need to be scanned. We propose a marker for full six-dimensional hand-eye calibration of a robot mounted OCT probe. We show that the calibration results in highly repeatable estimates of the transformation. Moreover, we evaluate robotic scanning of two phantom surfaces to demonstrate that the proposed calibration allows for consistent scanning of large, curved tissue surfaces. As the proposed approach is not relying on image registration, it does not suffer from a potential accumulation of errors along a scan path. We also illustrate the improvement compared to conventional 3D-translational robotic scanning.
[393] Tuning Real-World Image Restoration at Inference: A Test-Time Scaling Paradigm for Flow Matching Models
Purui Bai, Junxian Duan, Pin Wang, Jinhua Hao, Ming Sun, Chao Zhou, Huaibo Huang
Main category: cs.CV
TL;DR: ResFlow-Tuner: A flow matching-based image restoration framework using FLUX.1-dev with unified multi-modal fusion and test-time scaling for state-of-the-art performance.
Details
Motivation: To address challenges in efficiently leveraging ultra-large-scale pre-trained text-to-image models and fully exploiting their potential for real-world image restoration tasks.Method: Uses FLUX.1-dev flow matching model with unified multi-modal fusion (encoding multi-modal conditions into unified sequence) and training-free test-time scaling paradigm that dynamically steers denoising direction using reward model feedback during inference.
Result: Achieves state-of-the-art performance across multiple standard benchmarks, validating flow matching models for low-level vision tasks and proposing efficient inference-time scaling paradigm for large pre-trained models.
Conclusion: ResFlow-Tuner demonstrates powerful capabilities of flow matching models in image restoration and introduces a novel, efficient inference-time scaling approach suitable for large pre-trained models.
Abstract: Although diffusion-based real-world image restoration (Real-IR) has achieved remarkable progress, efficiently leveraging ultra-large-scale pre-trained text-to-image (T2I) models and fully exploiting their potential remain significant challenges. To address this issue, we propose ResFlow-Tuner, an image restoration framework based on the state-of-the-art flow matching model, FLUX.1-dev, which integrates unified multi-modal fusion (UMMF) with test-time scaling (TTS) to achieve unprecedented restoration performance. Our approach fully leverages the advantages of the Multi-Modal Diffusion Transformer (MM-DiT) architecture by encoding multi-modal conditions into a unified sequence that guides the synthesis of high-quality images. Furthermore, we introduce a training-free test-time scaling paradigm tailored for image restoration. During inference, this technique dynamically steers the denoising direction through feedback from a reward model (RM), thereby achieving significant performance gains with controllable computational overhead. Extensive experiments demonstrate that our method achieves state-of-the-art performance across multiple standard benchmarks. This work not only validates the powerful capabilities of the flow matching model in low-level vision tasks but, more importantly, proposes a novel and efficient inference-time scaling paradigm suitable for large pre-trained models.
[394] SpatialReward: Verifiable Spatial Reward Modeling for Fine-Grained Spatial Consistency in Text-to-Image Generation
Sashuai Zhou, Qiang Zhou, Junpeng Ma, Yue Cao, Ruofan Hu, Ziang Zhang, Xiaoda Yang, Zhibin Wang, Jun Song, Cheng Yu, Bo Zheng, Zhou Zhao
Main category: cs.CV
TL;DR: SpatialReward is a verifiable reward model that evaluates spatial layouts in text-to-image generation using prompt decomposition, expert detectors, and vision-language reasoning to improve spatial consistency through RL training.
Details
Motivation: Existing text-to-image reward models focus on semantic alignment and visual quality but neglect fine-grained spatial relationships, leading to plausible-looking images with inaccurate object positioning and spatial inconsistencies.Method: Multi-stage pipeline: 1) Prompt Decomposer extracts entities, attributes, and spatial metadata; 2) Expert detectors provide visual grounding of object positions; 3) Vision-language model uses chain-of-thought reasoning to assess complex spatial relations. Also introduces SpatRelBench benchmark for comprehensive evaluation.
Result: Incorporating SpatialReward into RL training for Stable Diffusion and FLUX consistently improves spatial consistency and overall generation quality, with results better aligned with human judgments.
Conclusion: Verifiable reward models like SpatialReward have significant potential for enabling more accurate and controllable optimization in text-to-image generation by addressing spatial relationship evaluation.
Abstract: Recent advances in text-to-image (T2I) generation via reinforcement learning (RL) have benefited from reward models that assess semantic alignment and visual quality. However, most existing reward models pay limited attention to fine-grained spatial relationships, often producing images that appear plausible overall yet contain inaccuracies in object positioning. In this work, we present \textbf{SpatialReward}, a verifiable reward model explicitly designed to evaluate spatial layouts in generated images. SpatialReward adopts a multi-stage pipeline: a \emph{Prompt Decomposer} extracts entities, attributes, and spatial metadata from free-form prompts; expert detectors provide accurate visual grounding of object positions and attributes; and a vision-language model applies chain-of-thought reasoning over grounded observations to assess complex spatial relations that are challenging for rule-based methods. To more comprehensively evaluate spatial relationships in generated images, we introduce \textbf{SpatRelBench}, a benchmark covering object attributes, orientation, inter-object relations, and rendered text placement. Experiments on Stable Diffusion and FLUX show that incorporating SpatialReward into RL training consistently improves spatial consistency and overall generation quality, with results aligned more closely to human judgments. These findings indicate that verifiable reward models hold considerable potential for enabling more accurate and controllable optimization in text-to-image generation models.
[395] GTSR: Subsurface Scattering Awared 3D Gaussians for Translucent Surface Reconstruction
Youwen Yuan, Xi Zhao
Main category: cs.CV
TL;DR: GTSR: A 3D Gaussian Splatting-based pipeline for reconstructing translucent objects from multi-view images using surface and interior Gaussians with Fresnel blending and Disney BSDF for enhanced detail.
Details
Motivation: Existing methods for translucent object reconstruction are computationally expensive (differentiable path tracing) or struggle with translucency (3DGS for opaque objects). There's a need for efficient reconstruction that properly handles translucent optical properties.Method: Proposes GTSR using two sets of Gaussians: surface Gaussians for geometry and interior Gaussians for light scattering. Uses Fresnel term to blend both sets for rendering. Incorporates Disney BSDF model with deferred rendering to enhance normal and depth constraints for better detail reconstruction.
Result: Outperforms baseline methods on NeuralTO Syn dataset with real-time rendering performance. Successfully adapts to different translucent materials on extended dataset with varying material properties.
Conclusion: GTSR provides an efficient and effective solution for translucent object reconstruction that properly models optical properties while maintaining real-time rendering capabilities.
Abstract: Reconstructing translucent objects from multi-view images is a difficult problem. Previously, researchers have used differentiable path tracing and the neural implicit field, which require relatively large computational costs. Recently, many works have achieved good reconstruction results for opaque objects based on a 3DGS pipeline with much higher efficiency. However, such methods have difficulty dealing with translucent objects, because they do not consider the optical properties of translucent objects. In this paper, we propose a novel 3DGS-based pipeline (GTSR) to reconstruct the surface geometry of translucent objects. GTSR combines two sets of Gaussians, surface and interior Gaussians, which are used to model the surface and scattering color when lights pass translucent objects. To render the appearance of translucent objects, we introduce a method that uses the Fresnel term to blend two sets of Gaussians. Furthermore, to improve the reconstructed details of non-contour areas, we introduce the Disney BSDF model with deferred rendering to enhance constraints of the normal and depth. Experimental results demonstrate that our method outperforms baseline reconstruction methods on the NeuralTO Syn dataset while showing great real-time rendering performance. We also extend the dataset with new translucent objects of varying material properties and demonstrate our method can adapt to different translucent materials.
[396] DTVI: Dual-Stage Textual and Visual Intervention for Safe Text-to-Image Generation
Binhong Tan, Zhaoxin Wang, Handing Wang
Main category: cs.CV
TL;DR: DTVI is a dual-stage inference-time defense framework for safe text-to-image generation that uses category-aware sequence-level intervention on full prompt embeddings and visual generation attenuation to prevent unsafe content generation.
Details
Motivation: Existing T2I diffusion models can generate unsafe content, and current defense methods use category-agnostic token-level interventions that fail to capture distributed malicious semantics across token sequences and remain vulnerable to adversarial prompts.Method: Proposes DTVI with two stages: 1) Category-aware sequence-level intervention on full prompt embeddings to capture distributed malicious semantics, and 2) Attenuation of remaining unsafe influences during visual generation stage.
Result: Achieves average Defense Success Rate (DSR) of 94.43% across sexual-category benchmarks and 88.56% across seven unsafe categories, while maintaining generation quality on benign prompts.
Conclusion: DTVI provides effective and robust defense against unsafe content generation in T2I models while preserving reasonable generation quality for benign prompts.
Abstract: Text-to-Image (T2I) diffusion models have demonstrated strong generation ability, but their potential to generate unsafe content raises significant safety concerns. Existing inference-time defense methods typically perform category-agnostic token-level intervention in the text embedding space, which fails to capture malicious semantics distributed across the full token sequence and remains vulnerable to adversarial prompts. In this paper, we propose DTVI, a dual-stage inference-time defense framework for safe T2I generation. Unlike existing methods that intervene on specific token embeddings, our method introduces category-aware sequence-level intervention on the full prompt embedding to better capture distributed malicious semantics, and further attenuates the remaining unsafe influences during the visual generation stage. Experimental results on real-world unsafe prompts, adversarial prompts, and multiple harmful categories show that our method achieves effective and robust defense while preserving reasonable generation quality on benign prompts, obtaining an average Defense Success Rate (DSR) of 94.43% across sexual-category benchmarks and 88.56 across seven unsafe categories, while maintaining generation quality on benign prompts.
[397] 3D-Layout-R1: Structured Reasoning for Language-Instructed Spatial Editing
Haoyu Zhen, Xiaolong Li, Yilin Zhao, Han Zhang, Sifei Liu, Kaichun Mo, Chuang Gan, Subhashree Radhakrishnan
Main category: cs.CV
TL;DR: A structured reasoning framework for text-guided spatial layout editing using scene-graph reasoning to improve spatial understanding and consistency in visual editing tasks.
Details
Motivation: LLMs and VLMs struggle with spatial understanding and layout consistency in fine-grained visual editing tasks, despite their impressive reasoning abilities. There's a need for better spatial coherence and control in text-conditioned visual editing.Method: Introduces a Structured Reasoning framework that performs text-conditioned spatial layout editing via scene-graph reasoning. Given an input scene graph and natural-language instruction, the model reasons over the graph to generate an updated scene graph that satisfies the text condition while maintaining spatial coherence. Uses explicit structured relational representations to guide the reasoning process.
Result: Achieves 15% improvement in IoU and 25% reduction in center-distance error compared to CoT-SFT and vanilla GRPO baselines. Compared to SOTA zero-shot LLMs, achieves up to 20% higher mIoU, demonstrating markedly improved spatial precision. Evaluated on a new text-guided layout editing benchmark with sorting, spatial alignment, and room-editing tasks.
Conclusion: The structured reasoning approach with scene-graph representations significantly improves spatial understanding and layout consistency in visual editing tasks, offering better interpretability and control over spatial relationships compared to existing LLM/VLM approaches.
Abstract: Large Language Models (LLMs) and Vision Language Models (VLMs) have shown impressive reasoning abilities, yet they struggle with spatial understanding and layout consistency when performing fine-grained visual editing. We introduce a Structured Reasoning framework that performs text-conditioned spatial layout editing via scene-graph reasoning. Given an input scene graph and a natural-language instruction, the model reasons over the graph to generate an updated scene graph that satisfies the text condition while maintaining spatial coherence. By explicitly guiding the reasoning process through structured relational representations, our approach improves both interpretability and control over spatial relationships. We evaluate our method on a new text-guided layout editing benchmark encompassing sorting, spatial alignment, and room-editing tasks. Our training paradigm yields an average 15% improvement in IoU and 25% reduction in center-distance error compared to Chain of Thought Fine-tuning (CoT-SFT) and vanilla GRPO baselines. Compared to SOTA zero-shot LLMs, our best models achieve up to 20% higher mIoU, demonstrating markedly improved spatial precision.
[398] FontCrafter: High-Fidelity Element-Driven Artistic Font Creation with Visual In-Context Generation
Wuyang Luo, Chengkai Tan, Chang Ge, Binye Hong, Su Yang, Yongjiu Ma
Main category: cs.CV
TL;DR: FontCrafter: An element-driven framework for artistic font generation that uses in-context generation with element images as visual context, enabling high-fidelity reconstruction of both structured and amorphous elements while supporting flexible style control.
Details
Motivation: Existing artistic font generation approaches suffer from limited style diversity and coarse control. The paper explores element-driven font generation where elements (fundamental visual units like flowers or flames) serve as reference images for desired styles, aiming to overcome these limitations.Method: Introduces FontCrafter framework with: 1) in-context generation strategy treating element images as visual context using inpainting model to transfer styles at pixel level, 2) Context-aware Mask Adapter (CMA) to inject shape information for glyph control, 3) training-free attention redirection for region-aware style control and stroke hallucination suppression, and 4) edge repainting for natural boundaries.
Result: FontCrafter achieves strong zero-shot generation performance, particularly in preserving structural and textural fidelity of both object elements (structured) and amorphous elements (unstructured). It also supports flexible controls like style mixture.
Conclusion: The proposed element-driven approach with FontCrafter framework successfully addresses limitations in artistic font generation, enabling diverse style creation with fine-grained control while maintaining high fidelity to reference elements.
Abstract: Artistic font generation aims to synthesize stylized glyphs based on a reference style. However, existing approaches suffer from limited style diversity and coarse control. In this work, we explore the potential of element-driven artistic font generation. Elements are the fundamental visual units of a font, serving as reference images for the desired style. Conceptually, we categorize elements into object elements (e.g., flowers or stones) with distinct structures and amorphous elements (e.g., flames or clouds) with unstructured textures. We introduce FontCrafter, an element-driven framework for font creation, and construct a large-scale dataset, ElementFont, which contains diverse element types and high-quality glyph images. However, achieving high-fidelity reconstruction of both texture and structure of reference elements remains challenging. To address this, we propose an in-context generation strategy that treats element images as visual context and uses an inpainting model to transfer element styles into glyph regions at the pixel level. To further control glyph shapes, we design a lightweight Context-aware Mask Adapter (CMA) that injects shape information. Moreover, a training-free attention redirection mechanism enables region-aware style control and suppresses stroke hallucination. In addition, edge repainting is applied to make boundaries more natural. Extensive experiments demonstrate that FontCrafter achieves strong zero-shot generation performance, particularly in preserving structural and textural fidelity, while also supporting flexible controls such as style mixture.
[399] UniMotion: A Unified Framework for Motion-Text-Vision Understanding and Generation
Ziyi Wang, Xinshun Wang, Shuang Chen, Yang Cong, Mengyuan Liu
Main category: cs.CV
TL;DR: UniMotion is a unified framework for simultaneous understanding and generation of human motion, natural language, and RGB images within a single architecture, overcoming limitations of existing models through continuous motion representation and novel alignment techniques.
Details
Motivation: Existing unified models only handle restricted modality subsets (e.g., Motion-Text or static Pose-Image) and rely on discrete tokenization, which introduces quantization errors and disrupts temporal continuity. There's a need for a truly unified framework that treats motion as a first-class continuous modality alongside RGB and language.Method: UniMotion uses a Cross-Modal Aligned Motion VAE (CMA-VAE) and symmetric dual-path embedders to create parallel continuous pathways for Motion and RGB within a shared LLM backbone. It introduces Dual-Posterior KL Alignment (DPA) to inject visual-semantic priors into motion representations without requiring images at inference, and Latent Reconstruction Alignment (LRA) to address the cold-start problem through self-supervised pre-training.
Result: UniMotion achieves state-of-the-art performance across seven tasks spanning any-to-any understanding, generation, and editing among the three modalities (motion, language, RGB images), with especially strong advantages on cross-modal compositional tasks.
Conclusion: UniMotion demonstrates that treating motion as a first-class continuous modality and using novel alignment techniques enables effective unified modeling of motion, language, and vision, opening new possibilities for multimodal understanding and generation.
Abstract: We present UniMotion, to our knowledge the first unified framework for simultaneous understanding and generation of human motion, natural language, and RGB images within a single architecture. Existing unified models handle only restricted modality subsets (e.g., Motion-Text or static Pose-Image) and predominantly rely on discrete tokenization, which introduces quantization errors and disrupts temporal continuity. UniMotion overcomes both limitations through a core principle: treating motion as a first-class continuous modality on equal footing with RGB. A novel Cross-Modal Aligned Motion VAE (CMA-VAE) and symmetric dual-path embedders construct parallel continuous pathways for Motion and RGB within a shared LLM backbone. To inject visual-semantic priors into motion representations without requiring images at inference, we propose Dual-Posterior KL Alignment (DPA), which distills a vision-fused encoder’s richer posterior into the motion-only encoder. To address the cold-start problem – where text supervision alone is too sparse to calibrate the newly introduced motion pathway – we further propose Latent Reconstruction Alignment (LRA), a self-supervised pre-training strategy that uses dense motion latents as unambiguous conditions to co-calibrate the embedder, backbone, and flow head, establishing a stable motion-aware foundation for all downstream tasks. UniMotion achieves state-of-the-art performance across seven tasks spanning any-to-any understanding, generation, and editing among the three modalities, with especially strong advantages on cross-modal compositional tasks.
[400] SpatialBoost: Enhancing Visual Representation through Language-Guided Reasoning
Byungwoo Jeon, Dongyoung Kim, Huiwon Jang, Insoo Kim, Jinwoo Shin
Main category: cs.CV
TL;DR: SpatialBoost enhances pre-trained vision encoders with 3D spatial awareness by converting 3D spatial information from 2D images into linguistic descriptions and using LLMs to inject this knowledge through multi-turn reasoning.
Details
Motivation: Existing vision encoders are trained on 2D images and lack 3D spatial understanding of object relationships, limiting their effectiveness in downstream applications requiring spatial reasoning.Method: Converts dense 3D spatial information from 2D images into linguistic expressions, then uses LLMs with multi-turn Chain-of-Thought reasoning to progressively inject spatial knowledge into vision encoders, building hierarchical spatial understanding.
Result: SpatialBoost improves DINOv3 performance from 55.9 to 59.7 mIoU on ADE20K, achieving state-of-the-art with 3.8% gain, showing effectiveness across benchmarks requiring 3D perception and general vision abilities.
Conclusion: SpatialBoost successfully enhances spatial awareness of vision encoders by leveraging linguistic descriptions of 3D spatial information through LLM reasoning, bridging the gap between 2D training and 3D understanding.
Abstract: Despite the remarkable success of large-scale pre-trained image representation models (i.e., vision encoders) across various vision tasks, they are predominantly trained on 2D image data and therefore often fail to capture 3D spatial relationships between objects and backgrounds in the real world, constraining their effectiveness in many downstream applications. To address this, we propose SpatialBoost, a scalable framework that enhances the spatial awareness of existing pre-trained vision encoders by injecting 3D spatial knowledge expressed in linguistic descriptions. The core idea involves converting dense 3D spatial information from 2D images into linguistic expressions, which is then used to inject such spatial knowledge into vision encoders through a Large Language Model (LLM). To this end, we adopt a multi-turn Chain-of-Thought (CoT) reasoning process that progressively incorporates dense spatial knowledge and builds hierarchical spatial understanding. To validate effectiveness, we adapt SpatialBoost to state-of-the-art vision encoders such as DINOv3, and evaluate its performance gains on a wide range of benchmarks requiring both 3D perception and general vision abilities. For instance, SpatialBoost improves DINOv3 performance from 55.9 to 59.7 mIoU on ADE20K, achieving state-of-the-art performance with 3.8% gain over the pre-trained DINOv3.
[401] End-to-End Training for Unified Tokenization and Latent Denoising
Shivam Duggal, Xingjian Bai, Zongze Wu, Richard Zhang, Eli Shechtman, Antonio Torralba, Phillip Isola, William T. Freeman
Main category: cs.CV
TL;DR: UNITE proposes a unified autoencoder architecture that combines image tokenization and latent diffusion generation into a single-stage training framework, achieving near SOTA performance without adversarial losses or pretrained encoders.
Details
Motivation: Current latent diffusion models require complex two-stage training: first train a tokenizer, then train diffusion in frozen latent space. This separation creates inefficiencies and suboptimal latent representations.Method: UNITE uses a Generative Encoder that serves as both image tokenizer and latent generator via weight sharing. Tokenization and generation are treated as the same latent inference problem under different conditioning regimes. Single-stage training jointly optimizes both tasks via two forward passes through the same network.
Result: Achieves FID 2.12 and 1.73 for Base and Large models on ImageNet 256×256, near state-of-the-art performance without adversarial losses or pretrained encoders. Works across image and molecule modalities.
Conclusion: Single-stage joint training of tokenization and generation from scratch is feasible and effective, creating a “common latent language” that benefits both tasks through shared parameter optimization.
Abstract: Latent diffusion models (LDMs) enable high-fidelity synthesis by operating in learned latent spaces. However, training state-of-the-art LDMs requires complex staging: a tokenizer must be trained first, before the diffusion model can be trained in the frozen latent space. We propose UNITE - an autoencoder architecture for unified tokenization and latent diffusion. UNITE consists of a Generative Encoder that serves as both image tokenizer and latent generator via weight sharing. Our key insight is that tokenization and generation can be viewed as the same latent inference problem under different conditioning regimes: tokenization infers latents from fully observed images, whereas generation infers them from noise together with text or class conditioning. Motivated by this, we introduce a single-stage training procedure that jointly optimizes both tasks via two forward passes through the same Generative Encoder. The shared parameters enable gradients to jointly shape the latent space, encouraging a “common latent language”. Across image and molecule modalities, UNITE achieves near state of the art performance without adversarial losses or pretrained encoders (e.g., DINO), reaching FID 2.12 and 1.73 for Base and Large models on ImageNet 256 x 256. We further analyze the Generative Encoder through the lenses of representation alignment and compression. These results show that single stage joint training of tokenization & generation from scratch is feasible.
[402] Adapting Point Cloud Analysis via Multimodal Bayesian Distribution Learning
Xingyu Zhu, Liang Yi, Shuo Wang, Wenbo Zhu, Yonglinag Wu, Beier Zhu, Hanwang Zhang
Main category: cs.CV
TL;DR: BayesMM: A multimodal Bayesian distribution learning framework for test-time adaptation in 3D point cloud analysis that fuses textual priors with streaming visual features using Bayesian model averaging.
Details
Motivation: Current multimodal 3D vision-language models degrade under domain shifts, and existing test-time adaptation methods using cache-based mechanisms suffer from limited historical information storage and heuristic fusion of prediction logits, leading to progressive information loss and unstable adaptation.Method: BayesMM models textual priors (from semantic prompts) and streaming visual features as Gaussian distributions, fuses them via Bayesian model averaging that automatically adjusts modality contributions based on posterior evidence, enabling continual adaptation to test-time data without training.
Result: Extensive experiments on multiple point cloud benchmarks show BayesMM maintains robustness under distributional shifts, achieving over 4% average improvement compared to existing methods.
Conclusion: BayesMM provides an effective Bayesian framework for test-time adaptation in multimodal 3D point cloud analysis that addresses limitations of cache-based methods through principled distribution learning and fusion.
Abstract: Multimodal 3D vision-language models show strong generalization across diverse 3D tasks, but their performance still degrades notably under domain shifts. This has motivated recent studies on test-time adaptation (TTA), which enables models to adapt online using test-time data. Among existing TTA methods, cache-based mechanisms are widely adopted for leveraging previously observed samples in online prediction refinement. However, they store only limited historical information, leading to progressive information loss as the test stream evolves. In addition, their prediction logits are fused heuristically, making adaptation unstable. To address these limitations, we propose BayesMM, a Multimodal Bayesian Distribution Learning framework for test-time point cloud analysis. BayesMM models textual priors and streaming visual features of each class as Gaussian distributions: textual parameters are derived from semantic prompts, while visual parameters are updated online with arriving samples. The two modalities are fused via Bayesian model averaging, which automatically adjusts their contributions based on posterior evidence, yielding a unified prediction that adapts continually to evolving test-time data without training. Extensive experiments on multiple point cloud benchmarks demonstrate that BayesMM maintains robustness under distributional shifts, yielding over 4% average improvement.
[403] P-Flow: Prompting Visual Effects Generation
Rui Zhao, Mike Zheng Shou
Main category: cs.CV
TL;DR: P-Flow: Training-free framework for customizing dynamic visual effects in video generation through iterative prompt optimization using vision-language models.
Details
Motivation: Current video generation models lack effective customization for dynamic visual effects (temporally evolving phenomena like explosions), which require complex temporal reasoning and iterative prompt refinement that is difficult for humans to craft manually.Method: P-Flow uses vision-language models to perform test-time prompt optimization, refining prompts based on discrepancy between reference video effects and generated output, enabling iterative improvement without modifying the underlying video generation model.
Result: P-Flow achieves high-fidelity and diverse visual effect customization, outperforming other models on both text-to-video and image-to-video generation tasks.
Conclusion: The proposed training-free framework effectively customizes dynamic visual effects through iterative prompt optimization, addressing the challenge of complex temporal reasoning in video generation.
Abstract: Recent advancements in video generation models have significantly improved their ability to follow text prompts. However, the customization of dynamic visual effects, defined as temporally evolving and appearance-driven visual phenomena like object crushing or explosion, remains underexplored. Prior works on motion customization or control mainly focus on low-level motions of the subject or camera, which can be guided using explicit control signals such as motion trajectories. In contrast, dynamic visual effects involve higher-level semantics that are more naturally suited for control via text prompts. However, it is hard and time-consuming for humans to craft a single prompt that accurately specifies these effects, as they require complex temporal reasoning and iterative refinement over time. To address this challenge, we propose P-Flow, a novel training-free framework for customizing dynamic visual effects in video generation without modifying the underlying model. By leveraging the semantic and temporal reasoning capabilities of vision-language models, P-Flow performs test-time prompt optimization, refining prompts based on the discrepancy between the visual effects of the reference video and the generated output. Through iterative refinement, the prompts evolve to better induce the desired dynamic effect in novel scenes. Experiments demonstrate that P-Flow achieves high-fidelity and diverse visual effect customization and outperforms other models on both text-to-video and image-to-video generation tasks. Code is available at https://github.com/showlab/P-Flow.
[404] Principled Steering via Null-space Projection for Jailbreak Defense in Vision-Language Models
Xingyu Zhu, Beier Zhu, Shuo Wang, Junfeng Fang, Kesen Zhao, Hanwang Zhang, Xiangnan He
Main category: cs.CV
TL;DR: NullSteer: A null-space projected activation defense framework that enhances safety against visual jailbreak attacks in VLMs without impairing general capabilities
Details
Motivation: Vision-language models deployed in open-world scenarios are vulnerable to visual jailbreak attacks that induce harmful content generation. Existing activation steering methods can cause over-refusal and degrade performance on benign inputs, lacking theoretical interpretability and robustness.Method: Proposes NullSteer, a null-space projected activation defense framework that constructs refusal directions within model activations through linear transformation. Maintains zero perturbation in benign subspace while dynamically inducing refusal along potentially harmful directions.
Result: Significantly reduces harmful outputs under various jailbreak attacks (average ASR reduction over 15% on MiniGPT-4) while maintaining comparable performance to original model on general benchmarks.
Conclusion: NullSteer effectively balances safety and utility by theoretically achieving safety enhancement without impairing model’s general capabilities, addressing limitations of existing activation steering methods.
Abstract: As vision-language models (VLMs) are increasingly deployed in open-world scenarios, they can be easily induced by visual jailbreak attacks to generate harmful content, posing serious risks to model safety and trustworthy usage. Recent activation steering methods inject directional vectors into model activations during inference to induce refusal behaviors and have demonstrated effectiveness. However, a steering vector may both enhance refusal ability and cause over-refusal, thereby degrading model performance on benign inputs. Moreover, due to the lack of theoretical interpretability, these methods still suffer from limited robustness and effectiveness. To better balance safety and utility, we propose NullSteer, a null-space projected activation defense framework. Our method constructs refusal directions within model activations through a linear transformation: it maintains zero perturbation within the benign subspace while dynamically inducing refusal along potentially harmful directions, thereby theoretically achieving safety enhancement without impairing the model’s general capabilities. Extensive experiments show that NullSteer significantly reduces harmful outputs under various jailbreak attacks (average ASR reduction over 15 percent on MiniGPT-4) while maintaining comparable performance to the original model on general benchmarks.
[405] IAG: Input-aware Backdoor Attack on VLM-based Visual Grounding
Junxian Li, Beining Xu, Simin Chen, Jiatong Li, Jingdi Lei, Haodong Zhao, Di Zhang
Main category: cs.CV
TL;DR: IAG: A novel multi-target backdoor attack on VLM-based visual grounding systems that dynamically generates input-aware, text-guided triggers conditioned on target object descriptions.
Details
Motivation: Despite advances in vision-language models for visual grounding, their security vulnerabilities remain unexplored. The paper aims to investigate realistic threats to VLM-based grounding systems, particularly multi-target backdoor attacks that can dynamically adapt to different target descriptions.Method: Proposes IAG (Input-Aware, text-Guided attack) using a text-conditioned UNet to embed imperceptible target semantic cues into visual inputs. Employs joint training objective balancing language capability with perceptual reconstruction to ensure imperceptibility, effectiveness, and stealth.
Result: Achieves best attack success rates compared to baselines on multiple VLMs (LLaVA, InternVL, Ferret) and benchmarks (RefCOCO, RefCOCO+, RefCOCOg, Flickr30k Entities, ShowUI) without compromising clean accuracy. Maintains robustness against existing defenses and shows transferability across datasets and models.
Conclusion: Reveals critical security risks in grounding-capable VLMs, demonstrating the feasibility of realistic multi-target backdoor attacks. Highlights the need for further research on trustworthy multimodal understanding and security in vision-language systems.
Abstract: Recent advances in vision-language models (VLMs) have significantly enhanced the visual grounding task, which involves locating objects in an image based on natural language queries. Despite these advancements, the security of VLM-based grounding systems has not been thoroughly investigated. This paper reveals a novel and realistic vulnerability: the first multi-target backdoor attack on VLM-based visual grounding. Unlike prior attacks that rely on static triggers or fixed targets, we propose IAG, a method that dynamically generates input-aware, text-guided triggers conditioned on any specified target object description to execute the attack. This is achieved through a text-conditioned UNet that embeds imperceptible target semantic cues into visual inputs while preserving normal grounding performance on benign samples. We further develop a joint training objective that balances language capability with perceptual reconstruction to ensure imperceptibility, effectiveness, and stealth. Extensive experiments on multiple VLMs (e.g., LLaVA, InternVL, Ferret) and benchmarks (RefCOCO, RefCOCO+, RefCOCOg, Flickr30k Entities, and ShowUI) demonstrate that IAG achieves the best ASRs compared with other baselines on almost all settings without compromising clean accuracy, maintaining robustness against existing defenses, and exhibiting transferability across datasets and models. These findings underscore critical security risks in grounding-capable VLMs and highlight the need for further research on trustworthy multimodal understanding.
[406] FreeArtGS: Articulated Gaussian Splatting Under Free-moving Scenario
Hang Dai, Hongwei Fan, Han Zhang, Duojin Wu, Jiyao Zhang, Hao Dong
Main category: cs.CV
TL;DR: FreeArtGS reconstructs articulated objects from free-moving monocular RGB-D videos using part segmentation, joint estimation, and 3D Gaussian Splatting optimization.
Details
Motivation: Existing articulated object reconstruction methods require non-trivial axis alignment or suffer from insufficient coverage, limiting scalability for AR/robotics applications. There's a need for methods that work with simple setups and free-moving scenarios.Method: Combines free-moving part segmentation (using point-tracking and feature model priors) with joint estimation (calibrating object-to-camera poses, recovering joint type/axis), and 3DGS-based end-to-end optimization to jointly reconstruct textures, geometry, and joint angles.
Result: Outperforms existing methods on two benchmarks and real-world free-moving articulated objects, excelling in free-moving scenarios while remaining competitive in previous reconstruction settings.
Conclusion: FreeArtGS provides a practical and effective solution for realistic asset generation in articulated object reconstruction, particularly valuable for AR and robotics applications.
Abstract: The increasing demand for augmented reality and robotics is driving the need for articulated object reconstruction with high scalability. However, existing settings for reconstructing from discrete articulation states or casual monocular videos require non-trivial axis alignment or suffer from insufficient coverage, limiting their applicability. In this paper, we introduce FreeArtGS, a novel method for reconstructing articulated objects under free-moving scenario, a new setting with a simple setup and high scalability. FreeArtGS combines free-moving part segmentation with joint estimation and end-to-end optimization, taking only a monocular RGB-D video as input. By optimizing with the priors from off-the-shelf point-tracking and feature models, the free-moving part segmentation module identifies rigid parts from relative motion under unconstrained capture. The joint estimation module calibrates the unified object-to-camera poses and recovers joint type and axis robustly from part segmentation. Finally, 3DGS-based end-to-end optimization is implemented to jointly reconstruct visual textures, geometry, and joint angles of the articulated object. We conduct experiments on two benchmarks and real-world free-moving articulated objects. Experimental results demonstrate that FreeArtGS consistently excels in reconstructing free-moving articulated objects and remains highly competitive in previous reconstruction settings, proving itself a practical and effective solution for realistic asset generation. The project page is available at: https://freeartgs.github.io/
[407] StreamingClaw Technical Report
Jiawei Chen, Zhe Chen, Chaoqun Du, Maokui He, Wei He, Hengtao Li, Qizhen Li, Zide Liu, Hao Ma, Xuhao Pan, Chang Ren, Xudong Rao, Xintian Shen, Chenfeng Wang, Tao Wei, Chengjun Yu, Pengfei Yu, Shengyu Yao, Chunpeng Zhou, Kun Zhan, Lihao Zheng, Pan Zhou, Xuhan Zhu, Yufei Zheng
Main category: cs.CV
TL;DR: StreamingClaw is a unified agent framework for real-time streaming video understanding and embodied intelligence, supporting multimodal memory, proactive interaction, and perception-decision-action closed loops.
Details
Motivation: Current agents have fragmented capabilities: they only support offline video understanding, lack long-term multimodal memory, and struggle with real-time reasoning and proactive interaction under streaming inputs, which prevents them from sustaining perception and making real-time decisions in real-world environments.Method: StreamingClaw integrates five core capabilities: 1) real-time streaming reasoning, 2) reasoning about future events and proactive interaction with evolving objectives, 3) multimodal long-term storage with hierarchical evolution and efficient retrieval across agents, 4) perception-decision-action closed loop with streaming tools and action-centric skills, and 5) compatibility with OpenClaw framework for community resources.
Result: The framework integrates online real-time reasoning, multimodal long-term memory, and proactive interaction within a unified system, enabling direct control of the physical world through executable actions for practical embodied interaction deployment.
Conclusion: StreamingClaw addresses key bottlenecks in streaming video understanding for embodied intelligence by providing a comprehensive framework that supports real-time multimodal interaction, memory, and action execution in physical environments.
Abstract: Applications such as embodied intelligence rely on a real-time perception-decision-action closed loop, posing stringent challenges for streaming video understanding. However, current agents suffer from fragmented capabilities, such as supporting only offline video understanding, lacking long-term multimodal memory mechanisms, or struggling to achieve real-time reasoning and proactive interaction under streaming inputs. These shortcomings have become a key bottleneck for preventing them from sustaining perception, making real-time decisions, and executing actions in real-world environments. To alleviate these issues, we propose StreamingClaw, a unified agent framework for streaming video understanding and embodied intelligence. It is also an OpenClaw-compatible framework that supports real-time, multimodal streaming interaction. StreamingClaw integrates five core capabilities: (1) It supports real-time streaming reasoning. (2) It supports reasoning about future events and proactive interaction under the online evolution of interaction objectives. (3) It supports multimodal long-term storage, hierarchical evolution, and efficient retrieval of shared memory across multiple agents. (4) It supports a closed-loop of perception-decision-action. In addition to conventional tools and skills, it also provides streaming tools and action-centric skills tailored for real-world physical environments. (5) It is compatible with the OpenClaw framework, allowing it to fully leverage the resources and support of the open-source community. With these designs, StreamingClaw integrates online real-time reasoning, multimodal long-term memory, and proactive interaction within a unified framework. Moreover, by translating decisions into executable actions, it enables direct control of the physical world, supporting practical deployment of embodied interaction.
[408] Biophysics-Enhanced Neural Representations for Patient-Specific Respiratory Motion Modeling
Jan Boysen, Hristina Uzunova, Heinz Handels, Jan Ehrhardt
Main category: cs.CV
TL;DR: PRISM-RM uses implicit neural representations with physics regularization for respiratory motion modeling in radiotherapy, providing continuous diffeomorphic motion without fixed reference state.
Details
Motivation: Respiratory motion introduces significant uncertainties in lung/abdominal radiotherapy, requiring motion management. Current models need improvement in generalization and extrapolation capabilities for better dose targeting.Method: Proposes physics-regularized implicit neural representations (INR) for surrogate-based motion modeling. Uses trajectory-aware spatio-temporally continuous diffeomorphic motion representation with biophysical constraints for physiological plausibility.
Result: Trajectory-aware approach performs on par in interpolation and improves extrapolation compared to initial INR approach. Both INR approaches perform equally well in interpolation but underperform in extrapolation compared to sequential registration methods.
Conclusion: INRs show strong potential for respiratory motion modeling with methodological advantages, though current extrapolation performance needs improvement. Physics regularization enhances physiological plausibility.
Abstract: A precise spatial delivery of the radiation dose is crucial for the treatment success in radiotherapy. In the lung and upper abdominal region, respiratory motion introduces significant treatment uncertainties, requiring special motion management techniques. To address this, respiratory motion models are commonly used to infer the patient-specific respiratory motion and target the dose more efficiently. In this work, we investigate the possibility of using implicit neural representations (INR) for surrogate-based motion modeling. Therefore, we propose physics-regularized implicit surrogate-based modeling for respiratory motion (PRISM-RM). Our new integrated respiratory motion model is free of a fixed reference breathing state. Unlike conventional pairwise registration techniques, our approach provides a trajectory-aware spatio-temporally continuous and diffeomorphic motion representation, improving generalization to extrapolation scenarios. We introduce biophysical constraints, ensuring physiologically plausible motion estimation across time beyond the training data. Our results show that our trajectory-aware approach performs on par in interpolation and improves the extrapolation ability compared to our initially proposed INR-based approach. Compared to sequential registration-based approaches both our approaches perform equally well in interpolation, but underperform in extrapolation scenarios. However, the methodical features of INRs make them particularly effective for respiratory motion modeling, and with their performance steadily improving, they demonstrate strong potential for advancing this field.
[409] DA-VAE: Plug-in Latent Compression for Diffusion via Detail Alignment
Xin Cai, Zhiyuan You, Zhoutong Zhang, Tianfan Xue
Main category: cs.CV
TL;DR: DA-VAE increases VAE compression ratio for latent diffusion models by expanding latent dimensionality while preserving pretrained structure, enabling high-resolution generation with fewer tokens.
Details
Motivation: Reducing token count is crucial for efficient training/inference in latent diffusion models, especially at high resolutions. Existing high-compression tokenizers lose meaningful structure, making diffusion training harder, while retraining is costly.Method: DA-VAE uses explicit latent layout: first C channels from pretrained VAE at base resolution, plus D channels for higher-resolution details. Detail-alignment mechanism preserves original structure. Warm-start fine-tuning adapts pretrained diffusion backbone lightly.
Result: Enables 1024×1024 generation with Stable Diffusion 3.5 using only 32×32 tokens (4× fewer), within 5 H100-days. Unlocks 2048×2048 generation with SD3.5, achieving 6× speedup while preserving quality. Validated quantitatively on ImageNet.
Conclusion: DA-VAE provides efficient high-resolution generation for latent diffusion models by expanding VAE compression while preserving pretrained structure, requiring minimal backbone adaptation.
Abstract: Reducing token count is crucial for efficient training and inference of latent diffusion models, especially at high resolution. A common strategy is to build high-compression image tokenizers with more channels per token. However, when trained only for reconstruction, high-dimensional latent spaces often lose meaningful structure, making diffusion training harder. Existing methods address this with extra objectives such as semantic alignment or selective dropout, but usually require costly diffusion retraining. Pretrained diffusion models, however, already exhibit a structured, lower-dimensional latent space; thus, a simpler idea is to expand the latent dimensionality while preserving this structure. We therefore propose \textbf{D}etail-\textbf{A}ligned VAE, which increases the compression ratio of a pretrained VAE with only lightweight adaptation of the pretrained diffusion backbone. DA-VAE uses an explicit latent layout: the first $C$ channels come directly from the pretrained VAE at a base resolution, while an additional $D$ channels encode higher-resolution details. A simple detail-alignment mechanism encourages the expanded latent space to retain the structure of the original one. With a warm-start fine-tuning strategy, our method enables $1024 \times 1024$ image generation with Stable Diffusion 3.5 using only $32 \times 32$ tokens, $4\times$ fewer than the original model, within 5 H100-days. It further unlocks $2048 \times 2048$ generation with SD3.5, achieving a $6\times$ speedup while preserving image quality. We also validate the method and its design choices quantitatively on ImageNet.
[410] OpenEarth-Agent: From Tool Calling to Tool Creation for Open-Environment Earth Observation
Sijie Zhao, Feng Liu, Xueliang Zhang, Hao Chen, Xinyu Gu, Zhe Jiang, Fenghua Ling, Ben Fei, Wenlong Zhang, Junjue Wang, Weihao Xuan, Pengfeng Xiao, Naoto Yokoya, Lei Bai
Main category: cs.CV
TL;DR: OpenEarth-Agent: A tool-creation agent framework for Earth Observation that adapts to unseen data/tasks through workflow planning and tool creation, outperforming traditional tool-calling agents in open environments.
Details
Motivation: Earth Observation faces challenges in open environments due to diverse multi-source data and heterogeneous tasks. Existing tool-calling agents are limited to closed environments with pre-defined tools, restricting generalization to unseen data and tasks.Method: Introduces OpenEarth-Agent framework with adaptive workflow planning and tool creation capabilities instead of calling predefined tools. Uses open-ended integration of multi-stage tools and cross-domain knowledge bases for robust execution across the entire EO pipeline.
Result: OpenEarth-Agent successfully masters full-pipeline EO across multiple domains in open environments. On cross-benchmark Earth-Bench, with only 6 essential pre-trained models, it achieves performance comparable to tool-calling agents using 104 specialized tools, and outperforms them with complete toolset. Created tools show superior robustness to data anomalies.
Conclusion: OpenEarth-Agent represents a significant advancement for autonomous Earth Observation in open environments, demonstrating that tool-creation agents can effectively generalize to diverse data and tasks beyond the limitations of traditional tool-calling approaches.
Abstract: Earth Observation (EO) is essential for perceiving dynamic land surface changes, yet deploying autonomous EO in open environments is hindered by the immense diversity of multi-source data and heterogeneous tasks. While remote sensing agents have emerged to streamline EO workflows, existing tool-calling agents are confined to closed environments. They rely on pre-defined tools and are restricted to narrow scope, limiting their generalization to the diverse data and tasks. To overcome these limitations, we introduce OpenEarth-Agent, the first tool-creation agent framework tailored for open-environment EO. Rather than calling predefined tools, OpenEarth-Agent employs adaptive workflow planning and tool creation to generalize to unseen data and tasks. This adaptability is bolstered by an open-ended integration of multi-stage tools and cross-domain knowledge bases, enabling robust execution in the entire EO pipeline across multiple application domains. To comprehensively evaluate EO agents in open environments, we propose OpenEarth-Bench, a novel benchmark comprising 596 real-world, full-pipeline cases across seven application domains, explicitly designed to assess agents’ adaptive planning and tool creation capabilities. Only essential pre-trained model tools are provided in this benchmark, devoid of any other predefined task-specific tools. Extensive experiments demonstrate that OpenEarth-Agent successfully masters full-pipeline EO across multiple domains in the open environment. Notably, on the cross-benchmark Earth-Bench, our tool-creating agent equipped with 6 essential pre-trained models achieves performance comparable to tool-calling agents relying on 104 specialized tools, and significantly outperforms them when provided with the complete toolset. In several cases, the created tools exhibit superior robustness to data anomalies compared to human-engineered counterparts.
[411] From Synthetic Scenes to Real Performance: Enhancing Spatial Reasoning in VLMs
Massimo Rizzoli, Simone Alghisi, Seyed Mahed Mousavi, Giuseppe Riccardi
Main category: cs.CV
TL;DR: Fine-tuning VLMs with controlled synthetic data generation improves performance and reduces biases compared to real-world data fine-tuning.
Details
Motivation: Real-world data collection for VLM fine-tuning often suffers from biases, errors, and distribution imbalance, leading to overfitting and imbalanced performance. Existing synthetic data approaches lack control over distribution bias and annotation quality.Method: Redesigned fine-tuning process with: 1) Controlled generation of synthetic data and annotations free from bias, distribution imbalance, and errors by comprehensively sampling object attributes (color, shape, size, position); 2) Fine-tuning state-of-the-art VLMs on this balanced synthetic dataset and evaluating transferability to real-world data on absolute position tasks.
Result: Two key findings: 1) Fine-tuning on balanced synthetic data yields uniform performance across visual scenes and mitigates common biases; 2) Fine-tuning on synthetic stimuli improves performance by 13% on real-world data (COCO), outperforming models fine-tuned on the full COCO train set.
Conclusion: Controlled synthetic data generation for VLM fine-tuning is an effective approach to address biases and distribution issues in real-world data, leading to better performance and more balanced models.
Abstract: Fine-tuning Vision-Language Models (VLMs) is a common strategy to improve performance following an ad-hoc data collection and annotation of real-world scenes. However, this process is often prone to biases, errors, and distribution imbalance, resulting in overfitting and imbalanced performance. Although a few studies have tried to address this problem by generating synthetic data, they lacked control over distribution bias and annotation quality. To address these challenges, we redesign the fine-tuning process in two ways. First, we control the generation of data and its annotations, ensuring it is free from bias, distribution imbalance, and annotation errors. We automatically construct the dataset by comprehensively sampling objects’ attributes, including color, shape, size, and position within the scene. Secondly, using this annotated dataset, we fine-tune state-of-the-art VLMs and assess performance transferability to real-world data on the absolute position task. We conduct exhaustive evaluations on both synthetic and real-world benchmarks. Our experiments reveal two key findings: 1) fine-tuning on balanced synthetic data yields uniform performance across the visual scene and mitigates common biases; and 2) fine-tuning on synthetic stimuli improves performance by 13% on real-world data (COCO), outperforming models fine-tuned on the full COCO train set.
[412] ACPO: Counteracting Likelihood Displacement in Vision-Language Alignment with Asymmetric Constraints
Kaili Huang, Hongming Zhang, Rui Shen, Linjun Dai, Jiahao Wang, Hanming Deng, Lewei Lu
Main category: cs.CV
TL;DR: ACPO addresses DPO’s Likelihood Displacement in multimodal alignment by applying asymmetric scaling to suppress rejected responses while preserving chosen distributions, preventing Visual Anchor Collapse and reducing hallucinations in vision-language models.
Details
Motivation: Standard DPO suffers from Likelihood Displacement where both chosen and rejected response probabilities collapse, causing Visual Anchor Collapse in multimodal settings where models abandon visual evidence for language priors, leading to hallucinations.Method: Proposes Asymmetric Constrained Preference Optimization (ACPO) - a modality-agnostic alignment mechanism that applies dynamic, target-oriented scaling to preference optimization. It uses complexity-aware scaling coefficients applied exclusively to rejected rewards, asymmetrically suppressing gradient flow on rejected terms while preserving chosen distribution as gradient-stable reference.
Result: ACPO effectively reverses chosen-reward degradation of standard DPO, halts Visual Anchor Collapse, outperforms baselines on hallucination benchmarks (HallusionBench, MM-IFEval) and general leaderboards (MMBench, MMStar, OCRBenchV2) while improving general capabilities.
Conclusion: ACPO addresses fundamental optimization flaws in DPO for multimodal alignment, preventing visual evidence abandonment and reducing hallucinations through asymmetric gradient suppression, making it crucial for vision-language model alignment.
Abstract: While Direct Preference Optimization (DPO) has become the de facto approach for aligning Large Vision-Language Models (LVLMs), it suffers from Likelihood Displacement, where the probability of both chosen and rejected responses collapses. This optimization flaw is especially detrimental in multimodal settings: the erosion of chosen likelihoods – a failure we term Visual Anchor Collapse – causes models to abandon visual evidence for strong language priors, precipitating significant hallucinations. To address this, we propose Asymmetric Constrained Preference Optimization (ACPO), a modality-agnostic alignment mechanism that applies dynamic, target-oriented scaling to preference optimization. ACPO derives a complexity-aware scaling coefficient applied exclusively to the rejected reward, asymmetrically suppressing the gradient flow on the rejected term while preserving the chosen distribution as a gradient-stable reference. While fundamentally a general-purpose objective, breaking this gradient symmetry is crucial for multimodal tasks, as it mitigates the suppression of visual tokens by language priors. Experiments on InternVL models demonstrate that ACPO effectively reverses the chosen-reward degradation of standard DPO. By halting Visual Anchor Collapse, ACPO generally outperforms baselines on hallucination benchmarks (HallusionBench, MM-IFEval) and general leaderboards (MMBench, MMStar, OCRBenchV2) while driving concurrent improvements in general capabilities.
[413] AdaptVision: Efficient Vision-Language Models via Adaptive Visual Acquisition
Zichuan Lin, Yicheng Liu, Yang Yang, Lvfang Tao, Deheng Ye
Main category: cs.CV
TL;DR: AdaptVision enables VLMs to autonomously determine minimum visual tokens needed per sample using coarse-to-fine active vision with bounding box tools, trained via reinforcement learning with decoupled objectives.
Details
Motivation: Current VLMs use excessive visual tokens causing computational overhead, and existing efficient methods use fixed-ratio compression without adapting to varying task requirements. The paper aims to enable VLMs to autonomously determine minimum visual tokens needed per sample.Method: Proposes AdaptVision with coarse-to-fine visual token acquisition: starts with compressed low-resolution tokens, selectively invokes bounding box tool to crop key regions when needed. Uses reinforcement learning with Decoupled Turn Policy Optimization (DTPO) that separates tool learning (correct tool usage) and accuracy improvement (response refinement) objectives.
Result: AdaptVision achieves superior performance on multiple VQA benchmarks while consuming substantially fewer visual tokens than state-of-the-art efficient VLM methods.
Conclusion: The proposed active vision approach enables adaptive visual token acquisition, balancing accuracy and efficiency through decoupled reinforcement learning optimization.
Abstract: Vision-Language Models (VLMs) have achieved remarkable success in visual question answering tasks, but their reliance on large numbers of visual tokens introduces significant computational overhead. While existing efficient VLM approaches reduce visual tokens through fixed-ratio compression, they operate passively and lack the ability to adapt to varying task requirements. This motivates a fundamental question: Can VLMs autonomously determine the minimum number of visual tokens required for each sample? Inspired by human active vision mechanisms, we introduce AdaptVision, an efficient VLM paradigm that enables adaptive visual token acquisition through a coarse-to-fine approach. Our model initially processes compressed visual tokens from low-resolution images and selectively acquires additional visual information by invoking a bounding box tool to crop key regions when necessary. We train AdaptVision using a reinforcement learning framework that carefully balances accuracy and efficiency. Central to our approach is Decoupled Turn Policy Optimization (DTPO), which decouples the learning objective into two components: (1) tool learning, which optimizes correct tool utilization, and (2) accuracy improvement, which refines the generated responses to improve answer correctness. Based on this formulation, we further decouple advantage estimation by computing separate advantages for tokens associated with each objective. This formulation enables more effective optimization for AdaptVision compared to vanilla GRPO. Comprehensive experiments across multiple VQA benchmarks demonstrate that AdaptVision achieves superior performance while consuming substantially fewer visual tokens than state-of-the-art efficient VLM methods.
[414] A Backbone Benchmarking Study on Self-supervised Learning as a Auxiliary Task with Texture-based Local Descriptors for Face Analysis
Shukesh Reddy, Abhijit Das
Main category: cs.CV
TL;DR: Benchmarking different backbones for self-supervised learning (SSL) as auxiliary task in face analysis, specifically using Masked Auto-Encoder (MAE) to reconstruct texture features alongside primary tasks in Local Pattern SSAT framework.
Details
Motivation: To understand the impact of different backbone architectures on self-supervised learning for face analysis, specifically investigating how combining primary tasks with SSL auxiliary tasks (like MAE) affects representation learning for various face analysis tasks.Method: Proposed Local Pattern SSAT (L-SSAT) framework combining primary face analysis tasks with SSL auxiliary task of Masked Auto-Encoder (MAE) for texture feature reconstruction. Conducted comprehensive benchmarking across multiple backbone architectures (shallow to deep) and evaluated on three face analysis tasks: face attribute prediction, emotion classification, and deepfake detection.
Result: Achieved average accuracies of 0.94 on FaceForensics++ (deepfake detection), 0.87 on CelebA (face attribute prediction), and 0.88 on AffectNet (emotion classification). Found that backbone effectiveness is highly task-dependent with no single unified backbone performing best across all face analysis paradigms.
Conclusion: Backbone selection for SSL-based face analysis is highly dependent on the specific downstream task, with no single generalized backbone performing optimally across different face analysis paradigms. The study provides guidance for backbone selection based on target application.
Abstract: In this work, we benchmark with different backbones and study their impact for self-supervised learning (SSL) as an auxiliary task to blend texture-based local descriptors into feature modelling for efficient face analysis. It is established in previous work that combining a primary task and a self-supervised auxiliary task enables more robust and discriminative representation learning. We employed different shallow to deep backbones for the SSL task of Masked Auto-Encoder (MAE) as an auxiliary objective to reconstruct texture features such as local patterns alongside the primary task in local pattern SSAT (L-SSAT), ensuring robust and unbiased face analysis. To expand the benchmark, we conducted a comprehensive comparative analysis across multiple model configurations within the proposed framework. To this end, we address the three research questions: “What is the role of the backbone in performance L-SSAT?”, “What type of backbone is effective for different face analysis tasks?”, and “Is there any generalized backbone for effective face analysis with L-SSAT?”. Towards answering these questions, we provide a detailed study and experiments. The performance evaluation demonstrates that the backbone for the proposed method is highly dependent on the downstream task, achieving average accuracies of 0.94 on FaceForensics++, 0.87 on CelebA, and 0.88 on AffectNet. For consistency of feature representation quality and generalisation capability across various face analysis paradigms, including face attribute prediction, emotion classification, and deepfake detection, there is no unified backbone.
[415] PAM: A Pose-Appearance-Motion Engine for Sim-to-Real HOI Video Generation
Mingju Gao, Kaisen Yang, Huan-ang Gao, Bohan Li, Ao Ding, Wenyi Li, Yangcheng Yu, Jinkun Liu, Shaocong Xu, Yike Niu, Haohan Chi, Hao Chen, Hao Tang, Li Yi, Hao Zhao
Main category: cs.CV
TL;DR: PAM: A unified Pose-Appearance-Motion Engine for controllable hand-object interaction video generation that combines pose, appearance, and motion in one framework, outperforming existing methods on benchmarks and enabling effective data augmentation for downstream tasks.
Details
Motivation: Existing HOI generation research is fragmented across three disjoint tracks: pose-only synthesis (no pixels), single-image HOI generation (no dynamics), and video generation methods requiring full pose sequences and ground-truth first frames. There's a need for a unified engine that brings together pose, appearance, and motion for controllable HOI video generation.Method: PAM (Pose-Appearance-Motion Engine) is a unified framework for controllable HOI video generation. It combines depth, segmentation, and keypoints as input conditions to generate high-resolution videos. The method validates performance through comprehensive experiments and ablation studies.
Result: On DexYCB: FVD of 29.13 (vs. 38.83 for InterDyn), MPJPE of 19.37 mm (vs. 30.05 mm for CosHand), generating 480x720 videos vs. 256x256/256x384 baselines. On OAKINK2: full multi-condition model improves FVD from 68.76 to 46.31. Ablation shows combining depth, segmentation, and keypoints yields best results. Synthetic data augmentation with 3,400 videos enables model trained on 50% real data to match 100% real baseline.
Conclusion: PAM successfully unifies pose, appearance, and motion for HOI video generation, achieving state-of-the-art performance on benchmarks and demonstrating practical value through effective synthetic data augmentation for downstream tasks like hand pose estimation.
Abstract: Hand-object interaction (HOI) reconstruction and synthesis are becoming central to embodied AI and AR/VR. Yet, despite rapid progress, existing HOI generation research remains fragmented across three disjoint tracks: (1) pose-only synthesis that predicts MANO trajectories without producing pixels; (2) single-image HOI generation that hallucinates appearance from masks or 2D cues but lacks dynamics; and (3) video generation methods that require both the entire pose sequence and the ground-truth first frame as inputs, preventing true sim-to-real deployment. Inspired by the philosophy of Joo et al. (2018), we think that HOI generation requires a unified engine that brings together pose, appearance, and motion within one coherent framework. Thus we introduce PAM: a Pose-Appearance-Motion Engine for controllable HOI video generation. The performance of our engine is validated by: (1) On DexYCB, we obtain an FVD of 29.13 (vs. 38.83 for InterDyn), and MPJPE of 19.37 mm (vs. 30.05 mm for CosHand), while generating higher-resolution 480x720 videos compared to 256x256 and 256x384 baselines. (2) On OAKINK2, our full multi-condition model improves FVD from 68.76 to 46.31. (3) An ablation over input conditions on DexYCB shows that combining depth, segmentation, and keypoints consistently yields the best results. (4) For a downstream hand pose estimation task using SimpleHand, augmenting training with 3,400 synthetic videos (207k frames) allows a model trained on only 50% of the real data plus our synthetic data to match the 100% real baseline.
[416] Mixture of Mini Experts: Overcoming the Linear Layer Bottleneck in Multiple Instance Learning
Daniel Shao, Joel Runevic, Richard J. Chen, Drew F. K. Williamson, Ahrong Kim, Andrew H. Song, Faisal Mahmood
Main category: cs.CV
TL;DR: MAMMOTH introduces a multi-head mixture of experts module for Multiple Instance Learning in computational pathology, replacing standard linear layers with task-specific transformations to improve slide classification performance.
Details
Motivation: Current MIL frameworks for whole-slide image classification focus on patch feature extraction and aggregation, but overlook the critical linear layer that transforms general-purpose features into task-specific features, which may be a performance bottleneck.Method: MAMMOTH uses a parameter-efficient, multi-head mixture of experts module that applies low-rank transformations tailored to each patch’s phenotype, compatible with any existing MIL aggregation method.
Result: Across 8 MIL methods and 19 classification tasks, MAMMOTH improved performance in 130 of 152 configurations, with average +3.8% performance gain, showing task-specific transformation has larger impact than aggregation method choice.
Conclusion: The linear transformation layer in MIL frameworks is a significant performance bottleneck, and MAMMOTH’s task-specific transformations substantially improve whole-slide image classification across diverse methods and tasks.
Abstract: Multiple Instance Learning (MIL) is the predominant framework for classifying gigapixel whole-slide images in computational pathology. MIL follows a sequence of 1) extracting patch features, 2) applying a linear layer to obtain task-specific patch features, and 3) aggregating the patches into a slide feature for classification. While substantial efforts have been devoted to optimizing patch feature extraction and aggregation, none have yet addressed the second point, the critical layer which transforms general-purpose features into task-specific features. We hypothesize that this layer constitutes an overlooked performance bottleneck and that stronger representations can be achieved with a low-rank transformation tailored to each patch’s phenotype, yielding synergistic effects with any of the existing MIL approaches. To this end, we introduce MAMMOTH, a parameter-efficient, multi-head mixture of experts module designed to improve the performance of any MIL model with minimal alterations to the total number of parameters. Across eight MIL methods and 19 different classification tasks, we find that such task-specific transformation has a larger effect on performance than the choice of aggregation method. For instance, when equipped with MAMMOTH, even simple methods such as max or mean pooling attain higher average performance than any method with the standard linear layer. Overall, MAMMOTH improves performance in 130 of the 152 examined configurations, with an average $+3.8%$ change in performance. Code is available at https://github.com/mahmoodlab/mammoth.
[417] Omni-WorldBench: Towards a Comprehensive Interaction-Centric Evaluation for World Models
Meiqi Wu, Zhixin Cai, Fufangchen Zhao, Xiaokun Feng, Rujing Dang, Bingze Song, Ruitian Tian, Jiashu Zhu, Jiachen Lei, Hao Dou, Jing Tang, Lei Sun, Jiahong Wu, Xiangxiang Chu, Zeming Liu, Kaiqi Huang
Main category: cs.CV
TL;DR: Omni-WorldBench is a new benchmark for evaluating interactive response capabilities of 4D world models, addressing limitations in current evaluation methods that neglect temporal dynamics and interaction-driven state transitions.
Details
Motivation: Current world model evaluation benchmarks focus narrowly on visual fidelity/text-video alignment for generative models or use static 3D reconstruction metrics that ignore temporal dynamics. The authors argue that future world modeling lies in 4D generation (joint spatial-temporal modeling), where interactive response capability is crucial but lacks systematic evaluation.Method: Proposes Omni-WorldBench with two components: 1) Omni-WorldSuite - systematic prompt suite spanning diverse interaction levels and scene types; 2) Omni-Metrics - agent-based evaluation framework quantifying causal impact of interaction actions on both final outcomes and intermediate state evolution trajectories.
Result: Extensive evaluation of 18 representative world models reveals critical limitations in interactive response capabilities. The benchmark provides actionable insights for future research in interactive 4D world modeling.
Conclusion: Omni-WorldBench addresses a critical gap in evaluating interactive 4D world models and will be publicly released to foster progress in this important research direction.
Abstract: Video–based world models have emerged along two dominant paradigms: video generation and 3D reconstruction. However, existing evaluation benchmarks either focus narrowly on visual fidelity and text–video alignment for generative models, or rely on static 3D reconstruction metrics that fundamentally neglect temporal dynamics. We argue that the future of world modeling lies in 4D generation, which jointly models spatial structure and temporal evolution. In this paradigm, the core capability is interactive response: the ability to faithfully reflect how interaction actions drive state transitions across space and time. Yet no existing benchmark systematically evaluates this critical dimension. To address this gap, we propose Omni–WorldBench, a comprehensive benchmark specifically designed to evaluate the interactive response capabilities of world models in 4D settings. Omni–WorldBench comprises two key components: Omni–WorldSuite, a systematic prompt suite spanning diverse interaction levels and scene types; and Omni–Metrics, an agent-based evaluation framework that quantifies world modeling capabilities by measuring the causal impact of interaction actions on both final outcomes and intermediate state evolution trajectories. We conduct extensive evaluations of 18 representative world models across multiple paradigms. Our analysis reveals critical limitations of current world models in interactive response, providing actionable insights for future research. Omni-WorldBench will be publicly released to foster progress in interactive 4D world modeling.
[418] Benchmarking Deep Learning Models for Aerial LiDAR Point Cloud Semantic Segmentation under Real Acquisition Conditions: A Case Study in Navarre
Alex Salvatierra, José Antonio Sanz, Christian Gutiérrez, Mikel Galar
Main category: cs.CV
TL;DR: Benchmark study evaluating four deep learning models (KPConv, RandLA-Net, Superpoint Transformer, Point Transformer V3) for 3D semantic segmentation on aerial LiDAR data under operational flight conditions.
Details
Motivation: Most 3D semantic segmentation models focus on indoor/terrestrial datasets, leaving their performance under real aerial acquisition conditions insufficiently explored. Need to evaluate state-of-the-art architectures on large-scale aerial LiDAR data covering heterogeneous landscapes.Method: Experimental benchmark comparing four representative deep learning models on a large-scale aerial LiDAR dataset from Navarre, Spain, covering urban, rural, and industrial landscapes. Evaluated across five semantic classes (ground, vegetation, buildings, vehicles, etc.) with focus on class imbalance and geometric variability challenges.
Result: All models achieved high overall accuracy (>93%). KPConv attained highest mean IoU (78.51%) with consistent performance across classes. Point Transformer V3 showed superior performance on underrepresented vehicle class (75.11% IoU). Superpoint Transformer and RandLA-Net traded segmentation robustness for computational efficiency.
Conclusion: The study provides valuable insights into 3D semantic segmentation performance on aerial LiDAR data, highlighting KPConv’s overall superiority and Point Transformer V3’s strength on challenging underrepresented classes, while revealing trade-offs between segmentation quality and computational efficiency.
Abstract: Recent advances in deep learning have significantly improved 3D semantic segmentation, but most models focus on indoor or terrestrial datasets. Their behavior under real aerial acquisition conditions remains insufficiently explored, and although a few studies have addressed similar scenarios, they differ in dataset design, acquisition conditions, and model selection. To address this gap, we conduct an experimental benchmark evaluating several state-of-the-art architectures on a large-scale aerial LiDAR dataset acquired under operational flight conditions in Navarre, Spain, covering heterogeneous urban, rural, and industrial landscapes. This study compares four representative deep learning models, including KPConv, RandLA-Net, Superpoint Transformer, and Point Transformer V3, across five semantic classes commonly found in airborne surveys, such as ground, vegetation, buildings, and vehicles, highlighting the inherent challenges of class imbalance and geometric variability in aerial data. Results show that all tested models achieve high overall accuracy exceeding 93%, with KPConv attaining the highest mean IoU (78.51%) through consistent performance across classes, particularly on challenging and underrepresented categories. Point Transformer V3 demonstrates superior performance on the underrepresented vehicle class (75.11% IoU), while Superpoint Transformer and RandLA-Net trade off segmentation robustness for computational efficiency.
[419] Riverine Land Cover Mapping through Semantic Segmentation of Multispectral Point Clouds
Sopitta Thurachen, Josef Taher, Matti Lehtomäki, Leena Matikainen, Linnea Blåfield, Mikel Calle Navarro, Antero Kukko, Tomi Westerlund, Harri Kaartinen
Main category: cs.CV
TL;DR: Point Transformer v2 applied to multispectral LiDAR point clouds for semantic segmentation of riverine land cover classes, achieving high accuracy and demonstrating improved generalization through multi-dataset training.
Details
Motivation: Accurate land cover mapping in riverine environments is crucial for river management, ecological understanding, and geomorphic change monitoring. The study aims to leverage advanced deep learning for semantic segmentation of multispectral LiDAR data in these complex environments.Method: Used Point Transformer v2 (PTv2) architecture for semantic segmentation of 3-channel LiDAR point clouds (geometry + spectral features). Trained on Oulanka river data with geometry and spectral features, conducted ablation studies on feature importance, and investigated multi-dataset training with additional sparsely annotated river data to improve generalization.
Result: Full-feature configuration achieved mIoU of 0.950, significantly outperforming geometry-only baseline. Intensity and reflectance features were identified as key for accurate land cover mapping. Multi-dataset training showed improved generalization performance despite limited high-quality annotations.
Conclusion: Transformer-based architectures show strong potential for multispectral point cloud analysis in riverine environments, offering new capabilities for sediment transport monitoring and river management applications. Multi-dataset training can improve model robustness with limited annotated data.
Abstract: Accurate land cover mapping in riverine environments is essential for effective river management, ecological understanding, and geomorphic change monitoring. This study explores the use of Point Transformer v2 (PTv2), an advanced deep neural network architecture designed for point cloud data, for land cover mapping through semantic segmentation of multispectral LiDAR data in real-world riverine environments. We utilize the geometric and spectral information from the 3-channel LiDAR point cloud to map land cover classes, including sand, gravel, low vegetation, high vegetation, forest floor, and water. The PTv2 model was trained and evaluated on point cloud data from the Oulanka river in northern Finland using both geometry and spectral features. To improve the model’s generalization in new riverine environments, we additionally investigate multi-dataset training that adds sparsely annotated data from an additional river dataset. Results demonstrated that using the full-feature configuration resulted in performance with a mean Intersection over Union (mIoU) of 0.950, significantly outperforming the geometry baseline. Other ablation studies revealed that intensity and reflectance features were the key for accurate land cover mapping. The multi-dataset training experiment showed improved generalization performance, suggesting potential for developing more robust models despite limited high-quality annotated data. Our work demonstrates the potential of applying transformer-based architectures to multispectral point clouds in riverine environments. The approach offers new capabilities for monitoring sediment transport and other river management applications.
[420] EgoGroups: A Benchmark For Detecting Social Groups of People in the Wild
Jeffri Murrugarra-Llerena, Pranav Chitale, Zicheng Liu, Kai Ao, Yujin Ham, Guha Balakrishnan, Paola Cascante-Bonilla
Main category: cs.CV
TL;DR: EgoGroups: A first-person view dataset for social group detection spanning 65 countries with diverse cultural contexts, weather/time conditions, and crowd densities, used to evaluate VLM/LLM performance on social interaction understanding.
Details
Motivation: Existing social group detection benchmarks are limited by low scene diversity, reliance on third-person camera sources, and lack of real-world evaluation in diverse cultural contexts and unconstrained settings.Method: Introduced EgoGroups dataset with first-person views from 65 countries covering various crowd densities and weather/time conditions, with dense human annotations for persons and social groups plus geographic/scene metadata. Evaluated state-of-the-art VLM/LLMs and supervised models on group detection capabilities.
Result: VLMs and LLMs can outperform supervised baselines in zero-shot settings, while crowd density and cultural regions significantly influence model performance.
Conclusion: EgoGroups addresses limitations of existing benchmarks and reveals important insights about VLM/LLM capabilities for social group detection in diverse real-world contexts.
Abstract: Social group detection, or the identification of humans involved in reciprocal interpersonal interactions (e.g., family members, friends, and customers and merchants), is a crucial component of social intelligence needed for agents transacting in the world. The few existing benchmarks for social group detection are limited by low scene diversity and reliance on third-person camera sources (e.g., surveillance footage). Consequently, these benchmarks generally lack real-world evaluation on how groups form and evolve in diverse cultural contexts and unconstrained settings. To address this gap, we introduce EgoGroups, a first-person view dataset that captures social dynamics in cities around the world. EgoGroups spans 65 countries covering low, medium, and high-crowd settings under four weather/time-of-day conditions. We include dense human annotations for person and social groups, along with rich geographic and scene metadata. Using this dataset, we performed an extensive evaluation of state-of-the-art VLM/LLMs and supervised models on their group detection capabilities. We found several interesting findings, including VLMs and LLMs can outperform supervised baselines in a zero-shot setting, while crowd density and cultural regions clearly influence model performance.
[421] GenOpticalFlow: A Generative Approach to Unsupervised Optical Flow Learning
Yixuan Luo, Feng Qiao, Zhexiao Xiong, Yanjing Li, Nathan Jacobs
Main category: cs.CV
TL;DR: FlowGen synthesizes large-scale, perfectly aligned frame-flow pairs for supervised optical flow training without human annotations by leveraging pre-trained depth estimation and next-frame generation models.
Details
Motivation: Supervised optical flow methods require expensive ground-truth annotations, limiting scalability. Unsupervised/semi-supervised methods suffer from unreliable supervision signals based on brightness constancy and smoothness assumptions, leading to inaccurate motion estimation in complex real-world scenarios.Method: Leverages pre-trained depth estimation network to generate pseudo optical flows, which serve as conditioning inputs for a next-frame generation model trained to produce high-fidelity, pixel-aligned subsequent frames. Also proposes inconsistent pixel filtering strategy to identify and remove unreliable pixels in generated frames.
Result: Extensive experiments on KITTI2012, KITTI2015, and Sintel demonstrate competitive or superior results compared to existing unsupervised and semi-supervised approaches.
Conclusion: FlowGen presents a scalable and annotation-free solution for optical flow learning that achieves state-of-the-art performance without human supervision.
Abstract: Optical flow estimation is a fundamental problem in computer vision, yet the reliance on expensive ground-truth annotations limits the scalability of supervised approaches. Although unsupervised and semi-supervised methods alleviate this issue, they often suffer from unreliable supervision signals based on brightness constancy and smoothness assumptions, leading to inaccurate motion estimation in complex real-world scenarios. To overcome these limitations, we introduce \textbf{\modelname}, a novel framework that synthesizes large-scale, perfectly aligned frame–flow data pairs for supervised optical flow training without human annotations. Specifically, our method leverages a pre-trained depth estimation network to generate pseudo optical flows, which serve as conditioning inputs for a next-frame generation model trained to produce high-fidelity, pixel-aligned subsequent frames. This process enables the creation of abundant, high-quality synthetic data with precise motion correspondence. Furthermore, we propose an \textit{inconsistent pixel filtering} strategy that identifies and removes unreliable pixels in generated frames, effectively enhancing fine-tuning performance on real-world datasets. Extensive experiments on KITTI2012, KITTI2015, and Sintel demonstrate that \textbf{\modelname} achieves competitive or superior results compared to existing unsupervised and semi-supervised approaches, highlighting its potential as a scalable and annotation-free solution for optical flow learning. We will release our code upon acceptance.
[422] DUO-VSR: Dual-Stream Distillation for One-Step Video Super-Resolution
Zhengyao Lv, Menghan Xia, Xintao Wang, Kwan-Yee K. Wong
Main category: cs.CV
TL;DR: DUO-VSR is a three-stage framework for accelerating diffusion-based video super-resolution to one-step generation using dual-stream distillation combining distribution matching and adversarial supervision.
Details
Motivation: Diffusion-based VSR achieves high fidelity but suffers from high computational costs. Existing acceleration methods like DMD face training instability and insufficient supervision when applied to VSR.Method: Three-stage framework: 1) Progressive Guided Distillation Initialization for stabilization, 2) Dual-Stream Distillation combining DMD and RFS-GAN streams for complementary supervision, 3) Preference-Guided Refinement for perceptual quality alignment.
Result: DUO-VSR achieves superior visual quality and efficiency over previous one-step VSR approaches, demonstrating effective acceleration of diffusion models for video tasks.
Conclusion: The proposed dual-stream distillation framework successfully addresses training instability and supervision insufficiency in accelerating diffusion-based VSR, enabling high-quality one-step generation.
Abstract: Diffusion-based video super-resolution (VSR) has recently achieved remarkable fidelity but still suffers from prohibitive sampling costs. While distribution matching distillation (DMD) can accelerate diffusion models toward one-step generation, directly applying it to VSR often results in training instability alongside degraded and insufficient supervision. To address these issues, we propose DUO-VSR, a three-stage framework built upon a Dual-Stream Distillation strategy that unifies distribution matching and adversarial supervision for one-step VSR. Firstly, a Progressive Guided Distillation Initialization is employed to stabilize subsequent training through trajectory-preserving distillation. Next, the Dual-Stream Distillation jointly optimizes the DMD and Real-Fake Score Feature GAN (RFS-GAN) streams, with the latter providing complementary adversarial supervision leveraging discriminative features from both real and fake score models. Finally, a Preference-Guided Refinement stage further aligns the student with perceptual quality preferences. Extensive experiments demonstrate that DUO-VSR achieves superior visual quality and efficiency over previous one-step VSR approaches.
[423] Explore with Long-term Memory: A Benchmark and Multimodal LLM-based Reinforcement Learning Framework for Embodied Exploration
Sen Wang, Bangwei Liu, Zhenkun Gao, Lizhuang Ma, Xuhong Wang, Yuan Xie, Xin Tan
Main category: cs.CV
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting).
Details
Motivation: Cannot determine motivation as paper content is unavailable.Method: Cannot determine method as paper content is unavailable.
Result: Cannot determine results as paper content is unavailable.
Conclusion: Cannot draw conclusions about paper content due to access limitations.
Abstract: Failed to fetch summary for 2601.10744: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.10744&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[424] Repurposing Geometric Foundation Models for Multi-view Diffusion
Wooseok Jang, Seonghu Jeon, Jisang Han, Jinhyeok Choi, Minkyung Kwon, Seungryong Kim, Saining Xie, Sainan Liu
Main category: cs.CV
TL;DR: GLD uses geometric foundation model features as latent space for multi-view diffusion, achieving better novel view synthesis with geometric consistency and faster training than VAE/RAE approaches.
Details
Motivation: Current novel view synthesis methods use view-independent VAE latent spaces that lack geometric consistency across viewpoints, which is crucial for multi-view generation.Method: Proposes Geometric Latent Diffusion (GLD) that repurposes geometrically consistent feature spaces from geometric foundation models as the latent space for multi-view diffusion models.
Result: GLD outperforms VAE and RAE on 2D image quality and 3D consistency metrics, accelerates training by 4.4x compared to VAE, and remains competitive with state-of-the-art methods despite training from scratch without text-to-image pretraining.
Conclusion: Geometric foundation model features provide an effective latent space for novel view synthesis, enabling geometrically consistent multi-view generation with improved quality and efficiency.
Abstract: While recent advances in generative latent spaces have driven substantial progress in single-image generation, the optimal latent space for novel view synthesis (NVS) remains largely unexplored. In particular, NVS requires geometrically consistent generation across viewpoints, but existing approaches typically operate in a view-independent VAE latent space. In this paper, we propose Geometric Latent Diffusion (GLD), a framework that repurposes the geometrically consistent feature space of geometric foundation models as the latent space for multi-view diffusion. We show that these features not only support high-fidelity RGB reconstruction but also encode strong cross-view geometric correspondences, providing a well-suited latent space for NVS. Our experiments demonstrate that GLD outperforms both VAE and RAE on 2D image quality and 3D consistency metrics, while accelerating training by more than 4.4x compared to the VAE latent space. Notably, GLD remains competitive with state-of-the-art methods that leverage large-scale text-to-image pretraining, despite training its diffusion model from scratch without such generative pretraining.
[425] The Dual Mechanisms of Spatial Reasoning in Vision-Language Models
Kelly Cui, Nikhil Prakash, Ayush Raina, David Bau, Antonio Torralba, Tamar Rott Shaham
Main category: cs.CV
TL;DR: VLMs use two concurrent mechanisms for spatial reasoning: language model layers represent content-independent spatial relations, but the dominant spatial information comes from vision encoder representations that encode object layouts globally across all image tokens.
Details
Motivation: To understand how vision-language models (VLMs) compute associations between objects and their spatial relations, which is crucial for multimodal tasks like image captioning and visual question answering.Method: Analyzed where and how spatial associations are computed within VLMs by examining two concurrent mechanisms: (1) language model backbone layers representing content-independent spatial relations on visual tokens, and (2) vision encoder representations encoding object layouts globally across visual tokens.
Result: Found that while language model layers play a secondary role, the dominant spatial information originates from vision encoders whose representations encode object layouts and extend beyond object regions into surrounding background areas. Enhancing these vision-derived spatial representations globally improves spatial reasoning performance.
Conclusion: Spatial association in VLMs is computed through two mechanisms, with vision encoders playing the central role in enabling spatial reasoning by providing globally distributed spatial signals across all image tokens.
Abstract: Many multimodal tasks, such as image captioning and visual question answering, require vision-language models (VLMs) to associate objects with their properties and spatial relations. Yet it remains unclear where and how such associations are computed within VLMs. In this work, we show that VLMs rely on two concurrent mechanisms to represent such associations. In the language model backbone, intermediate layers represent content-independent spatial relations on top of visual tokens corresponding to objects. However, this mechanism plays only a secondary role in shaping model predictions. Instead, the dominant source of spatial information originates in the vision encoder, whose representations encode the layout of objects and are directly exploited by the language model backbone. Notably, this spatial signal is distributed globally across visual tokens, extending beyond object regions into surrounding background areas. We show that enhancing these vision-derived spatial representations globally across all image tokens improves spatial reasoning performance on naturalistic images. Together, our results clarify how spatial association is computed within VLMs and highlight the central role of vision encoders in enabling spatial reasoning.
[426] DualCoT-VLA: Visual-Linguistic Chain of Thought via Parallel Reasoning for Vision-Language-Action Models
Zhide Zhong, Junfeng Li, Junjie He, Haodong Yan, Xin Gong, Guanyi Zhao, Yingjie Cai, Jiantao Gao, Xu Yan, Bingbing Liu, Yingcong Chen, Liuqing Yang, Haoang Li
Main category: cs.CV
TL;DR: DualCoT-VLA introduces a parallel visual-linguistic Chain-of-Thought method for Vision-Language-Action models that addresses limitations in multi-step task planning and spatial perception while reducing inference latency.
Details
Motivation: Standard VLA models struggle with complex multi-step tasks requiring logical planning and precise spatial perception. Current CoT-based VLA models have limitations: 1) inability to simultaneously capture low-level visual details and high-level planning due to isolated single-modal CoT, and 2) high inference latency with compounding errors from autoregressive decoding.Method: Proposes DualCoT-VLA with parallel reasoning mechanism: integrates visual CoT for low-level spatial understanding and linguistic CoT for high-level task planning. Introduces parallel CoT mechanism with two sets of learnable query tokens, shifting from autoregressive step-by-step reasoning to single-step forward reasoning to reduce latency.
Result: Achieves state-of-the-art performance on LIBERO and RoboCasa GR1 benchmarks, and demonstrates effectiveness on real-world platforms.
Conclusion: DualCoT-VLA successfully addresses limitations of current CoT-based VLA models by enabling comprehensive multi-modal reasoning while significantly reducing inference latency through parallel processing.
Abstract: Vision-Language-Action (VLA) models map visual observations and language instructions directly to robotic actions. While effective for simple tasks, standard VLA models often struggle with complex, multi-step tasks requiring logical planning, as well as precise manipulations demanding fine-grained spatial perception. Recent efforts have incorporated Chain-of-Thought (CoT) reasoning to endow VLA models with a ``thinking before acting’’ capability. However, current CoT-based VLA models face two critical limitations: 1) an inability to simultaneously capture low-level visual details and high-level logical planning due to their reliance on isolated, single-modal CoT; 2) high inference latency with compounding errors caused by step-by-step autoregressive decoding. To address these limitations, we propose DualCoT-VLA, a visual-linguistic CoT method for VLA models with a parallel reasoning mechanism. To achieve comprehensive multi-modal reasoning, our method integrates a visual CoT for low-level spatial understanding and a linguistic CoT for high-level task planning. Furthermore, to overcome the latency bottleneck, we introduce a parallel CoT mechanism that incorporates two sets of learnable query tokens, shifting autoregressive reasoning to single-step forward reasoning. Extensive experiments demonstrate that our DualCoT-VLA achieves state-of-the-art performance on the LIBERO and RoboCasa GR1 benchmarks, as well as in real-world platforms.
[427] VideoDetective: Clue Hunting via both Extrinsic Query and Intrinsic Relevance for Long Video Understanding
Ruoliu Yang, Chu Wu, Caifeng Shan, Ran He, Chaoyou Fu
Main category: cs.CV
TL;DR: VideoDetective: A framework for long-video QA that uses visual-temporal affinity graphs and hypothesis-verification-refinement loops to identify query-relevant segments, overcoming MLLM context window limitations.
Details
Motivation: Current MLLMs struggle with long video understanding due to limited context windows, requiring identification of sparse query-relevant segments. Existing methods localize clues based only on the query, ignoring video structure and varying segment relevance.Method: Divides videos into segments and creates visual-temporal affinity graphs based on visual similarity and temporal proximity. Uses a Hypothesis-Verification-Refinement loop to estimate relevance scores of observed segments and propagate them to unseen segments, producing a global relevance distribution to guide localization of critical segments.
Result: Achieves substantial gains across mainstream MLLMs on representative benchmarks, with accuracy improvements up to 7.5% on VideoMME-long. Consistently outperforms existing methods.
Conclusion: VideoDetective effectively addresses long-video understanding challenges by integrating query-to-segment relevance and inter-segment affinity, enabling efficient clue hunting and improved performance on long-video QA tasks.
Abstract: Long video understanding remains challenging for multimodal large language models (MLLMs) due to limited context windows, which necessitate identifying sparse query-relevant video segments. However, existing methods predominantly localize clues based solely on the query, overlooking the video’s intrinsic structure and varying relevance across segments. To address this, we propose VideoDetective, a framework that integrates query-to-segment relevance and inter-segment affinity for effective clue hunting in long-video question answering. Specifically, we divide a video into various segments and represent them as a visual-temporal affinity graph built from visual similarity and temporal proximity. We then perform a Hypothesis-Verification-Refinement loop to estimate relevance scores of observed segments to the query and propagate them to unseen segments, yielding a global relevance distribution that guides the localization of the most critical segments for final answering with sparse observation. Experiments show our method consistently achieves substantial gains across a wide range of mainstream MLLMs on representative benchmarks, with accuracy improvements of up to 7.5% on VideoMME-long. Our code is available at https://videodetective.github.io/
[428] MatSegNet: a New Boundary-aware Deep Learning Model for Accurate Carbide Precipitate Analysis in High-Strength Steels
Xiaohan Bie, Manoj Arthanari, Evelin Barbosa de Melo, Baihua Ren, Juancheng Li, Nicolas Brodusch, Stephen Yue, Salim Brahimi, Raynald Gauvin, Jun Song
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to API rate limiting preventing access to paper detailsMethod: Cannot analyze method as paper content is unavailable
Result: No results available due to technical error accessing paper
Conclusion: Cannot provide analysis due to API rate limiting preventing access to paper content
Abstract: Failed to fetch summary for 2312.17251: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2312.17251&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[429] Scene Prior Filtering for Depth Super-Resolution
Zhengxue Wang, Zhiqiang Yan, Ming-Hsuan Yang, Jinshan Pan, Guangwei Gao, Ying Tai, Jian Yang
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to failed paper fetchMethod: Unable to determine method due to failed paper fetch
Result: Unable to determine results due to failed paper fetch
Conclusion: Unable to determine conclusion due to failed paper fetch
Abstract: Failed to fetch summary for 2402.13876: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2402.13876&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[430] Point-In-Context: Understanding Point Cloud via In-Context Learning
Mengyuan Liu, Zhongbin Fang, Xia Li, Joachim M. Buhmann, Deheng Ye, Xiangtai Li, Chen Change Loy
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to retrieval failureMethod: Unable to determine method due to retrieval failure
Result: Unable to determine results due to retrieval failure
Conclusion: Unable to draw conclusions due to retrieval failure
Abstract: Failed to fetch summary for 2404.12352: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2404.12352&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[431] Accurate Quantization for Gait Representation Learning
S. Tian, H. Gao, G. Hong, S. Wang, J. Wang, X. Yu, S. Zhang
Main category: cs.CV
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting).
Details
Motivation: Cannot determine motivation without access to paper content.Method: Cannot determine method without access to paper content.
Result: Cannot determine results without access to paper content.
Conclusion: Cannot draw conclusions without access to paper content.
Abstract: Failed to fetch summary for 2405.13859: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2405.13859&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[432] DifAttack++: Query-Efficient Black-Box Adversarial Attack via Hierarchical Disentangled Feature Space in Cross-Domain
Jun Liu, Jiantao Zhou, Jiandian Zeng, Jinyu Tian, Zheng Li
Main category: cs.CV
TL;DR: Failed to fetch summary for paper 2406.03017 due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to missing paper contentMethod: Unable to determine method due to missing paper content
Result: Unable to determine results due to missing paper content
Conclusion: Unable to determine conclusion due to missing paper content
Abstract: Failed to fetch summary for 2406.03017: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2406.03017&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[433] Training-Free Layout-to-Image Generation with Marginal Attention Constraints
Huancheng Chen, Jingtao Li, Weiming Zhuang, Haris Vikalo, Lingjuan Lyu
Main category: cs.CV
TL;DR: Paper analysis unavailable due to HTTP 429 error when fetching summary
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to draw conclusions due to fetch failure
Abstract: Failed to fetch summary for 2411.10495: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2411.10495&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[434] Learn from Foundation Model: Fruit Detection Model without Manual Annotation
Yanan Wang, Zhenghao Fei, Ruichen Li, Yibin Ying
Main category: cs.CV
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation as paper content is unavailable due to API rate limitingMethod: Cannot determine method as paper content is unavailable due to API rate limiting
Result: Cannot determine results as paper content is unavailable due to API rate limiting
Conclusion: Cannot determine conclusion as paper content is unavailable due to API rate limiting
Abstract: Failed to fetch summary for 2411.16196: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2411.16196&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[435] TPCL: Task Progressive Curriculum Learning for Robust Visual Question Answering
Ahmed Akl, Abdelwahed Khamis, Zhe Wang, Ali Cheraghian, Sara Khalifa, Kewen Wang
Main category: cs.CV
TL;DR: Unable to analyze paper 2411.17292 due to HTTP 429 error when fetching abstract from arXiv API
Details
Motivation: Cannot determine motivation as abstract retrieval failedMethod: Cannot determine method as abstract retrieval failed
Result: Cannot determine results as abstract retrieval failed
Conclusion: Cannot draw conclusions without access to paper content
Abstract: Failed to fetch summary for 2411.17292: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2411.17292&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[436] Lightweight Gaze Estimation Model Via Fusion Global Information
Zhang Cheng, Yanxia Wang
Main category: cs.CV
TL;DR: Unable to analyze paper 2411.18064 due to HTTP 429 error when fetching abstract from arXiv API
Details
Motivation: Cannot determine motivation as abstract is unavailable due to rate limiting errorMethod: No method information available - arXiv API returned HTTP 429 (Too Many Requests)
Result: No results available - paper content could not be retrieved
Conclusion: Unable to provide analysis due to technical limitations in accessing the paper abstract
Abstract: Failed to fetch summary for 2411.18064: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2411.18064&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[437] Monitoring access to piped water and sanitation infrastructure in Africa at disaggregated scales using satellite imagery and self-supervised learning
Othmane Echchabi, Aya Lahlou, Nizar Talty, Josh Malcolm Manto, Tongshu Zheng, Ka Leung Lam
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2411.19093: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2411.19093&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[438] 3DSceneEditor: Controllable 3D Scene Editing with Gaussian Splatting
Ziyang Yan, Yihua Shao, Minwen Liao, Siyu Chen, Nan Wang, Muyuan Lin, Jenq-Neng Hwang, Hao Zhao, Fabio Remondino, Lei Li
Main category: cs.CV
TL;DR: Unable to analyze paper 2412.01583 due to HTTP 429 error (rate limiting) when fetching abstract from arXiv API
Details
Motivation: Cannot determine motivation as abstract retrieval failedMethod: Cannot determine method as abstract retrieval failed
Result: Cannot determine results as abstract retrieval failed
Conclusion: Cannot draw conclusions without access to paper abstract
Abstract: Failed to fetch summary for 2412.01583: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2412.01583&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[439] DesCLIP: Robust Continual Learning via General Attribute Descriptions for VLM-Based Visual Recognition
Chiyuan He, Zihuan Qiu, Fanman Meng, Linfeng Xu, Qingbo Wu, Hongliang Li
Main category: cs.CV
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot draw conclusions as paper content is unavailable
Abstract: Failed to fetch summary for 2502.00618: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2502.00618&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[440] Enhanced Structured Lasso Pruning with Class-wise Information
Xiang Liu, Mingchen Li, Xia Li, Leigang Qu, Guansu Wang, Zifan Peng, Yijun Song, Zemin Liu, Linshan Jiang, Jialin Li
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2502.09125: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2502.09125&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[441] UASTrack: A Unified Adaptive Selection Framework with Modality-Customization in Single Object Tracking
He Wang, Tianyang Xu, Zhangyong Tang, Xiao-Jun Wu, Josef Kittler
Main category: cs.CV
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot draw conclusions as paper content is unavailable
Abstract: Failed to fetch summary for 2502.18220: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2502.18220&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[442] Automatic Construction of Pattern Classifiers Capable of Continuous Incremental Learning and Unlearning Tasks Based on Compact-Sized Probabilistic Neural Network
Tetsuya Hoya, Shunpei Morita
Main category: cs.CV
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot draw conclusions without access to paper content
Abstract: Failed to fetch summary for 2501.00725: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2501.00725&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[443] Bridging the Perception Gap in Image Super-Resolution Evaluation
Shaolin Su, Josep M. Rocafort, Danna Xue, David Serrano-Lozano, Lei Sun, Javier Vazquez-Corral
Main category: cs.CV
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting).
Details
Motivation: Cannot determine motivation without access to paper content.Method: Cannot determine method without access to paper content.
Result: Cannot determine results without access to paper content.
Conclusion: Cannot draw conclusions without access to paper content.
Abstract: Failed to fetch summary for 2503.13074: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2503.13074&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[444] Let Synthetic Data Shine: Domain Reassembly and Soft-Fusion for Single Domain Generalization
Hao Li, Yubin Xiao, Ke Liang, Mengzhu Wang, Long Lan, Kenli Li, Xinwang Liu
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Unable to determine motivation due to API access limitations preventing paper retrievalMethod: Unable to determine method due to API access limitations preventing paper retrieval
Result: Unable to determine results due to API access limitations preventing paper retrieval
Conclusion: Unable to draw conclusions about the paper due to technical limitations in accessing the content
Abstract: Failed to fetch summary for 2503.13617: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2503.13617&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[445] LEMON: A Large Endoscopic MONocular Dataset and Foundation Model for Perception in Surgical Settings
Chengan Che, Chao Wang, Tom Vercauteren, Sophia Tsoka, Luis C. Garcia-Peraza-Herrera
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to access errorMethod: Unable to determine method due to access error
Result: Unable to determine results due to access error
Conclusion: Unable to determine conclusion due to access error
Abstract: Failed to fetch summary for 2503.19740: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2503.19740&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[446] All Patches Matter, More Patches Better: Enhance AI-Generated Image Detection via Panoptic Patch Learning
Zheng Yang, Ruoxin Chen, Zhiyuan Yan, Ke-Yue Zhang, Xinghe Fu, Shuang Wu, Xiujun Shu, Taiping Yao, Shouhong Ding, Zequn Qin, Xi Li
Main category: cs.CV
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting). The paper ID 2504.01396 suggests it’s from April 2024, but content is unavailable.
Details
Motivation: Cannot determine motivation without access to the paper content.Method: Cannot determine method without access to the paper content.
Result: Cannot determine results without access to the paper content.
Conclusion: Cannot draw conclusions without access to the paper content.
Abstract: Failed to fetch summary for 2504.01396: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2504.01396&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[447] Tiny Neural Networks for Multi-Object Tracking in a Modular Kalman Framework
Christian Alexander Holz, Christian Bader, Markus Enzweiler, Matthias Drüppel
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2504.02519: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2504.02519&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[448] Subject Information Extraction for Novelty Detection with Domain Shifts
Yangyang Qu, Dazhi Fu, Jicong Fan
Main category: cs.CV
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot draw conclusions as paper content is unavailable
Abstract: Failed to fetch summary for 2504.21247: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2504.21247&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[449] CompBench: Benchmarking Complex Instruction-guided Image Editing
Bohan Jia, Wenxuan Huang, Yuntian Tang, Junbo Qiao, Jincheng Liao, Shaosheng Cao, Fei Zhao, Zhaopeng Feng, Zhouhong Gu, Zhenfei Yin, Lei Bai, Wanli Ouyang, Lin Chen, Fei Zhao, Yao Hu, Zihan Wang, Yuan Xie, Shaohui Lin
Main category: cs.CV
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting). The paper ID 2505.12200 suggests it’s from May 2025, but no content available for analysis.
Details
Motivation: Cannot determine motivation as paper content is unavailable due to API rate limiting.Method: Cannot determine method as paper content is unavailable.
Result: Cannot determine results as paper content is unavailable.
Conclusion: Cannot draw conclusions about the paper without access to its content.
Abstract: Failed to fetch summary for 2505.12200: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2505.12200&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[450] SPKLIP: Aligning Spike Video Streams with Natural Language
Yongchang Gao, Meiling Jin, Zhaofei Yu, Tiejun Huang, Guozhang Chen
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2505.12656: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2505.12656&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[451] Segmenting Visuals With Querying Words: Language Anchors For Semi-Supervised Image Segmentation
Numair Nadeem, Saeed Anwar, Muhammad Hamza Asad, Abdul Bais
Main category: cs.CV
TL;DR: Paper ID 2506.13925: Unable to fetch abstract due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation as abstract is unavailableMethod: Cannot determine method as abstract is unavailable
Result: Cannot determine results as abstract is unavailable
Conclusion: Cannot draw conclusions as abstract is unavailable
Abstract: Failed to fetch summary for 2506.13925: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2506.13925&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[452] Foresight Diffusion: Improving Sampling Consistency in Predictive Diffusion Models
Yu Zhang, Xingzhuo Guo, Haoran Xu, Jialong Wu, Mingsheng Long
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to retrieval failureMethod: Unable to determine method due to retrieval failure
Result: Unable to determine results due to retrieval failure
Conclusion: Unable to determine conclusion due to retrieval failure
Abstract: Failed to fetch summary for 2505.16474: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2505.16474&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[453] Thalia: A Global, Multi-Modal Dataset for Volcanic Activity Monitoring
Nikolas Papadopoulos, Nikolaos Ioannis Bountos, Maria Sdraka, Andreas Karavias, Gustau Camps-Valls, Ioannis Papoutsis
Main category: cs.CV
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting). The paper ID 2505.17782 suggests it’s from May 2025, but no content available for analysis.
Details
Motivation: Cannot determine motivation without access to paper content.Method: Cannot determine method without access to paper content.
Result: Cannot determine results without access to paper content.
Conclusion: Cannot draw conclusions without access to paper content.
Abstract: Failed to fetch summary for 2505.17782: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2505.17782&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[454] Frequency-Adaptive Discrete Cosine-ViT-ResNet Architecture for Sparse-Data Vision
Ziyue Kang, Weichuan Zhang
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to missing paper contentMethod: Unable to determine method due to missing paper content
Result: Unable to determine results due to missing paper content
Conclusion: Unable to analyze paper due to technical error fetching content
Abstract: Failed to fetch summary for 2505.22701: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2505.22701&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[455] ReSpace: Text-Driven Autoregressive 3D Indoor Scene Synthesis and Editing
Martin JJ. Bucher, Iro Armeni
Main category: cs.CV
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting). The paper ID 2506.02459 could not be retrieved from arXiv API.
Details
Motivation: Cannot determine motivation as paper content is unavailable due to API rate limiting.Method: Cannot determine method as paper content is unavailable due to API rate limiting.
Result: Cannot determine results as paper content is unavailable due to API rate limiting.
Conclusion: Cannot draw conclusions as paper content is unavailable due to API rate limiting.
Abstract: Failed to fetch summary for 2506.02459: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2506.02459&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[456] Go Beyond Earth: Understanding Human Actions and Scenes in Microgravity Environments
Di Wen, Lei Qi, Kunyu Peng, Kailun Yang, Fei Teng, Ao Luo, Jia Fu, Yufan Chen, Ruiping Liu, Yitian Shi, M. Saquib Sarfraz, Rainer Stiefelhagen
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2506.02845: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2506.02845&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[457] Reasoning-Aligned Perception Decoupling for Scalable Multi-modal Reasoning
Yunhao Gou, Kai Chen, Zhili Liu, Lanqing Hong, Xin Jin, Zhenguo Li, James T. Kwok, Yu Zhang
Main category: cs.CV
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation without access to paper contentMethod: Cannot determine method without access to paper content
Result: Cannot determine results without access to paper content
Conclusion: Cannot determine conclusion without access to paper content
Abstract: Failed to fetch summary for 2506.04559: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2506.04559&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[458] LEO-VL: Efficient Scene Representation for Scalable 3D Vision-Language Learning
Jiangyong Huang, Xiaojian Ma, Xiongkun Linghu, Junchao He, Qing Li, Song-Chun Zhu, Yixin Chen, Baoxiong Jia, Siyuan Huang
Main category: cs.CV
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting). Need to retry or use alternative methods to access the paper content.
Details
Motivation: Cannot determine motivation without access to the paper content.Method: Cannot determine method without access to the paper content.
Result: Cannot determine results without access to the paper content.
Conclusion: Cannot draw conclusions without access to the paper content.
Abstract: Failed to fetch summary for 2506.09935: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2506.09935&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[459] Taming Video Models for 3D and 4D Generation via Zero-Shot Camera Control
Chenxi Song, Yanming Yang, Tong Zhao, Ruibo Li, Chi Zhang
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2509.15130: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.15130&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[460] Ctrl-Z Sampling: Scaling Diffusion Sampling with Controlled Random Zigzag Explorations
Shunqi Mao, Wei Guo, Chaoyi Zhang, Jieting Long, Ke Xie, Weidong Cai
Main category: cs.CV
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation as paper content is unavailable due to API rate limitingMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot draw conclusions about the paper due to access limitations
Abstract: Failed to fetch summary for 2506.20294: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2506.20294&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[461] PoseMaster: A Unified 3D Native Framework for Stylized Pose Generation
Hongyu Yan, Kunming Luo, Weiyu Li, Kaiyi Zhang, Yixun Liang, Jingwei Huang, Chunchao Guo, Ping Tan
Main category: cs.CV
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot determine conclusion as paper content is unavailable
Abstract: Failed to fetch summary for 2506.21076: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2506.21076&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[462] A Kernel Space-based Multidimensional Sparse Model for Dynamic PET Image Denoising
Kuang Xiaodong, Li Bingxuan, Li Yuan, Rao Fan, Ma Gege, Xie Qingguo, Mok Greta S P, Liu Huafeng, Zhu Wentao
Main category: cs.CV
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot draw conclusions about paper content due to access limitations
Abstract: Failed to fetch summary for 2509.18801: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.18801&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[463] Unleashing the Potential of All Test Samples: Mean-Shift Guided Test-Time Adaptation
Jizhou Han, Chenhao Ding, SongLin Dong, Yuhang He, Xinyuan Gao, Yihong Gong
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2507.00462: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2507.00462&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[464] HUG-VAS: A Hierarchical NURBS-Based Generative Model for Aortic Geometry Synthesis and Controllable Editing
Pan Du, Mingqi Xu, Xiaozhi Zhu, Jian-xun Wang
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2507.11474: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2507.11474&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[465] OCRGenBench: A Comprehensive Benchmark for Evaluating OCR Generative Capabilities
Peirong Zhang, Haowei Xu, Jiaxin Zhang, Xuhan Zheng, Guitao Xu, Yuyi Zhang, Junle Liu, Zhenhua Yang, Wei Zhou, Lianwen Jin
Main category: cs.CV
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation without access to the paper contentMethod: Cannot determine method without access to the paper content
Result: Cannot determine results without access to the paper content
Conclusion: Cannot determine conclusion without access to the paper content
Abstract: Failed to fetch summary for 2507.15085: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2507.15085&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[466] Learning to Generate Rigid Body Interactions with Video Diffusion Models
David Romero, Ariana Bermudez, Viacheslav Iablochnikov, Hao Li, Fabio Pizzati, Ivan Laptev
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to technical error in fetching paper contentMethod: Unable to determine method due to technical error in fetching paper content
Result: Unable to determine results due to technical error in fetching paper content
Conclusion: Unable to draw conclusions due to technical error in fetching paper content
Abstract: Failed to fetch summary for 2510.02284: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.02284&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[467] PAUL: Uncertainty-Guided Partition and Augmentation for Robust Cross-View Geo-Localization under Noisy Correspondence
Zheng Li, Xueyi Zhang, Yanming Guo, Yuxiang Xie, Ding Zhaoyun, Siqi Cai, Haizhou Li, Mingrui Lao
Main category: cs.CV
TL;DR: Unable to analyze paper 2508.20066 due to HTTP 429 error when fetching abstract from arXiv API
Details
Motivation: Cannot determine motivation as abstract content is unavailableMethod: Cannot determine method as abstract content is unavailable
Result: Cannot determine results as abstract content is unavailable
Conclusion: Cannot draw conclusions about paper content due to data retrieval failure
Abstract: Failed to fetch summary for 2508.20066: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2508.20066&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[468] DisPatch: Disarming Adversarial Patches in Object Detection with Diffusion Models
Jin Ma, Mohammed Aldeen, Christopher Salas, Feng Luo, Mashrur Chowdhury, Mert Pesé, Long Cheng
Main category: cs.CV
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) when trying to access arXiv API
Details
Motivation: Cannot determine motivation as paper content is unavailable due to API rate limitingMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot determine conclusion as paper content is unavailable
Abstract: Failed to fetch summary for 2509.04597: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.04597&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[469] StaR-KVQA: Structured Reasoning Traces for Implicit-Knowledge Visual Question Answering
Zhihao Wen, Wenkang Wei, Yuan Fang, Xingtong Yu, Hui Zhang, Weicheng Zhu, Xin Zhang
Main category: cs.CV
TL;DR: Failed to fetch summary for paper 2510.06638 due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to failed API requestMethod: Unable to determine method due to failed API request
Result: Unable to determine results due to failed API request
Conclusion: Unable to determine conclusion due to failed API request
Abstract: Failed to fetch summary for 2510.06638: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.06638&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[470] Neural Collapse-Inspired Multi-Label Federated Learning under Label-Distribution Skew
Can Peng, Yuyuan Liu, Yingyu Yang, Pramit Saha, Qianye Yang, J. Alison Noble
Main category: cs.CV
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot draw conclusions about paper content due to access limitations
Abstract: Failed to fetch summary for 2509.12544: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.12544&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[471] Your VAR Model is Secretly an Efficient and Explainable Generative Classifier
Yi-Chung Chen, David I. Inouye, Jing Gao
Main category: cs.CV
TL;DR: Unable to analyze paper 2510.12060 due to HTTP 429 error when fetching abstract from arXiv API
Details
Motivation: Cannot determine motivation as abstract content is unavailableMethod: Cannot determine method as abstract content is unavailable
Result: Cannot determine results as abstract content is unavailable
Conclusion: Cannot draw conclusions about paper content due to data retrieval failure
Abstract: Failed to fetch summary for 2510.12060: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.12060&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[472] Texture Vector-Quantization and Reconstruction Aware Prediction for Generative Super-Resolution
Qifan Li, Jiale Zou, Jinhua Zhang, Wei Long, Xingyu Zhou, Shuhang Gu
Main category: cs.CV
TL;DR: Unable to analyze paper 2509.23774 due to HTTP 429 error when fetching abstract from arXiv API
Details
Motivation: Cannot determine motivation as abstract retrieval failedMethod: Cannot determine method as abstract retrieval failed
Result: Cannot determine results as abstract retrieval failed
Conclusion: Cannot draw conclusions without access to paper content
Abstract: Failed to fetch summary for 2509.23774: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.23774&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[473] What “Not” to Detect: Negation-Aware VLMs via Structured Reasoning and Token Merging
Inha Kang, Youngsun Lim, Seonho Lee, Jiho Choi, Junsuk Choe, Hyunjung Shim
Main category: cs.CV
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot draw conclusions without access to paper content
Abstract: Failed to fetch summary for 2510.13232: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.13232&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[474] UP2You: Fast Reconstruction of Yourself from Unconstrained Photo Collections
Zeyu Cai, Ziyang Li, Xiaoben Li, Boqian Li, Zeyu Wang, Zhenyu Zhang, Yuliang Xiu
Main category: cs.CV
TL;DR: Paper ID 2509.24817 could not be fetched due to HTTP 429 error (rate limiting). No abstract available for analysis.
Details
Motivation: Unable to determine motivation as the paper content could not be retrieved due to API rate limiting.Method: Unable to determine method as the paper content could not be retrieved due to API rate limiting.
Result: Unable to determine results as the paper content could not be retrieved due to API rate limiting.
Conclusion: Unable to determine conclusion as the paper content could not be retrieved due to API rate limiting.
Abstract: Failed to fetch summary for 2509.24817: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.24817&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[475] KeySG: Hierarchical Keyframe-Based 3D Scene Graphs
Abdelrhman Werby, Dennis Rotondi, Fabio Scaparro, Kai O. Arras
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to missing paper contentMethod: Unable to determine method due to missing paper content
Result: Unable to determine results due to missing paper content
Conclusion: Unable to draw conclusions due to missing paper content
Abstract: Failed to fetch summary for 2510.01049: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.01049&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[476] FideDiff: Efficient Diffusion Model for High-Fidelity Image Motion Deblurring
Xiaoyang Liu, Zhengyan Zhou, Zihang Xu, Jiezhang Cao, Zheng Chen, Yulun Zhang
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to missing paper contentMethod: Unable to determine method due to missing paper content
Result: Unable to determine results due to missing paper content
Conclusion: Unable to draw conclusions due to missing paper content
Abstract: Failed to fetch summary for 2510.01641: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.01641&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[477] LinearSR: Unlocking Linear Attention for Stable and Efficient Image Super-Resolution
Xiaohui Li, Shaobin Zhuang, Shuo Cao, Yang Yang, Yuandong Pu, Qi Qin, Siqi Luo, Bin Fu, Yihao Liu
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to failed paper fetchMethod: Unable to determine method due to failed paper fetch
Result: Unable to determine results due to failed paper fetch
Conclusion: Unable to draw conclusions due to failed paper fetch
Abstract: Failed to fetch summary for 2510.08771: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.08771&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[478] GIR-Bench: Versatile Benchmark for Generating Images with Reasoning
Hongxiang Li, Yaowei Li, Bin Lin, Yuwei Niu, Yuhang Yang, Xiaoshuang Huang, Jiayin Cai, Xiaolong Jiang, Yao Hu, Long Chen
Main category: cs.CV
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot draw conclusions as paper content is unavailable
Abstract: Failed to fetch summary for 2510.11026: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.11026&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[479] Buffer layers for Test-Time Adaptation
Hyeongyu Kim, Geonhui Han, Dosik Hwang
Main category: cs.CV
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation as paper content is unavailable due to API rate limitingMethod: Cannot determine method as paper content is unavailable due to API rate limiting
Result: Cannot determine results as paper content is unavailable due to API rate limiting
Conclusion: Cannot draw conclusions as paper content is unavailable due to API rate limiting
Abstract: Failed to fetch summary for 2510.21271: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.21271&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[480] PAGE-4D: Disentangled Pose and Geometry Estimation for VGGT-4D Perception
Kaichen Zhou, Yuhan Wang, Grace Chen, Xinhai Chang, Gaspard Beaudouin, Fangneng Zhan, Paul Pu Liang, Mengyu Wang
Main category: cs.CV
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot determine conclusion as paper content is unavailable
Abstract: Failed to fetch summary for 2510.17568: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.17568&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[481] GAS: Improving Discretization of Diffusion ODEs via Generalized Adversarial Solver
Aleksandr Oganov, Ilya Bykov, Eva Neudachina, Mishan Aliev, Alexander Tolmachev, Alexander Sidorov, Aleksandr Zuev, Andrey Okhotin, Denis Rakitin, Aibek Alanov
Main category: cs.CV
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting). The paper ID 2510.17699 suggests it’s from October 2024, but no abstract or content is available for analysis.
Details
Motivation: Cannot determine motivation without access to paper content. The HTTP 429 error indicates the arXiv API rate limit has been exceeded.Method: Cannot analyze method without paper content. The arXiv ID format suggests this is a recent paper from October 2024.
Result: No results available due to access limitations. The paper may be relevant to multimodal LLMs but cannot be assessed without content.
Conclusion: Unable to provide analysis due to technical limitations in accessing the paper content. Need to wait for rate limits to reset or use alternative access methods.
Abstract: Failed to fetch summary for 2510.17699: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.17699&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[482] PixelVLA: Advancing Pixel-level Understanding in Vision-Language-Action Model
Wenqi Liang, Gan Sun, Yao He, Jiahua Dong, Suyan Dai, Ivan Laptev, Salman Khan, Yang Cong
Main category: cs.CV
TL;DR: Paper 2511.01571: Could not fetch summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2511.01571: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.01571&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[483] A Stitch in Time: Learning Procedural Workflow via Self-Supervised Plackett-Luce Ranking
Chengan Che, Chao Wang, Xinyue Chen, Sophia Tsoka, Luis C. Garcia-Peraza-Herrera
Main category: cs.CV
TL;DR: Paper 2511.17805 could not be fetched due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2511.17805: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.17805&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[484] Sharing the Learned Knowledge-base to Estimate Convolutional Filter Parameters for Continual Image Restoration
Aupendu Kar, Krishnendu Ghosh, Prabir Kumar Biswas
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2511.05421: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.05421&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[485] Multi-Context Fusion Transformer for Pedestrian Crossing Intention Prediction in Urban Environments
Yuanzhe Li, Hang Zhong, Steffen Müller
Main category: cs.CV
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation due to missing paper contentMethod: Cannot determine method due to missing paper content
Result: Cannot determine results due to missing paper content
Conclusion: Cannot determine conclusion due to missing paper content
Abstract: Failed to fetch summary for 2511.20011: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.20011&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[486] StyleQoRA: Quality-Aware Low-Rank Adaptation for Few-Shot Multi-Style Editing
Cong Cao, Huanjing Yue, Yujie Xu, Xiaodong Xu
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to API rate limiting preventing access to paper contentMethod: Cannot analyze method without access to paper content
Result: No results available due to technical access issues
Conclusion: Paper analysis impossible due to API rate limiting preventing content retrieval
Abstract: Failed to fetch summary for 2511.11236: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.11236&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[487] Real-Time Long Horizon Air Quality Forecasting via Group-Relative Policy Optimization
Inha Kang, Eunki Kim, Wonjeong Ryu, Jaeyo Shin, Seungjun Yu, Yoon-Hee Kang, Seongeun Jeong, Eunhye Kim, Soontae Kim, Hyunjung Shim
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to access errorMethod: Unable to determine method due to access error
Result: Unable to determine results due to access error
Conclusion: Unable to determine conclusion due to access error
Abstract: Failed to fetch summary for 2511.22169: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.22169&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[488] Can You Learn to See Without Images? Procedural Warm-Up for Vision Transformers
Zachary Shinnick, Liangze Jiang, Hemanth Saratchandran, Damien Teney, Anton van den Hengel
Main category: cs.CV
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot draw conclusions without access to paper content
Abstract: Failed to fetch summary for 2511.13945: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.13945&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[489] Satellite to Street : Disaster Impact Estimator
Sreesritha Sai, Sai Venkata Suma Sreeja, Sai Sri Deepthi, Nikhil
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2512.00065: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.00065&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[490] First Frame Is the Place to Go for Video Content Customization
Jingxi Chen, Zongxia Li, Zhichao Liu, Guangyao Shi, Xiyang Wu, Fuxiao Liu, Cornelia Fermuller, Brandon Y. Feng, Yiannis Aloimonos
Main category: cs.CV
TL;DR: Unable to analyze paper 2511.15700 due to HTTP 429 error when fetching abstract from arXiv API
Details
Motivation: Cannot determine motivation as abstract retrieval failedMethod: Cannot determine method as abstract retrieval failed
Result: Cannot determine results as abstract retrieval failed
Conclusion: Cannot draw conclusions without access to paper abstract
Abstract: Failed to fetch summary for 2511.15700: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.15700&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[491] DepthFocus: Controllable Depth Estimation for See-Through Scenes
Junhong Min, Jimin Kim, Minwook Kim, Cheol-Hui Min, Youngpil Jeon, Minyong Choi
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2511.16993: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.16993&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[492] Flowception: Temporally Expansive Flow Matching for Video Generation
Tariq Berrada Ifriqi, John Nguyen, Karteek Alahari, Jakob Verbeek, Ricky T. Q. Chen
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to failed paper fetchMethod: Unable to determine method due to failed paper fetch
Result: Unable to determine results due to failed paper fetch
Conclusion: Unable to draw conclusions due to failed paper fetch
Abstract: Failed to fetch summary for 2512.11438: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.11438&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[493] Downscaling Intelligence: Exploring Perception and Reasoning Bottlenecks in Small Multimodal Models
Mark Endo, Serena Yeung-Levy
Main category: cs.CV
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot draw conclusion due to missing paper content
Abstract: Failed to fetch summary for 2511.17487: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.17487&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[494] Intrinsic Image Fusion for Multi-View 3D Material Reconstruction
Peter Kocsis, Lukas Höllein, Matthias Nießner
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to failed paper fetchMethod: Unable to determine method due to failed paper fetch
Result: Unable to determine results due to failed paper fetch
Conclusion: Unable to analyze paper due to technical fetching error
Abstract: Failed to fetch summary for 2512.13157: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.13157&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[495] MetroGS: Efficient and Stable Reconstruction of Geometrically Accurate High-Fidelity Large-Scale Scenes
Kehua Chen, Tianlu Mao, Xinzhu Ma, Hao Jiang, Zehao Li, Zihan Liu, Shuqi Gao, Honglong Zhao, Feng Dai, Yucheng Zhang, Zhaoqi Wang
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to failed paper fetchMethod: Unable to determine method due to failed paper fetch
Result: Unable to determine results due to failed paper fetch
Conclusion: Unable to determine conclusion due to failed paper fetch
Abstract: Failed to fetch summary for 2511.19172: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.19172&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[496] TTP: Test-Time Padding for Adversarial Detection and Robust Adaptation on Vision-Language Models
Zhiwei Li, Yitian Pang, Weining Wang, Zhenan Sun, Qi Li
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to technical error fetching paper contentMethod: Unable to determine method due to technical error fetching paper content
Result: Unable to determine results due to technical error fetching paper content
Conclusion: Unable to determine conclusion due to technical error fetching paper content
Abstract: Failed to fetch summary for 2512.16523: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.16523&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[497] IDSplat: Instance-Decomposed 3D Gaussian Splatting for Driving Scenes
Carl Lindström, Mahan Rafidashti, Maryam Fatemi, Lars Hammarstrand, Martin R. Oswald, Lennart Svensson
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2511.19235: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.19235&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[498] InfoTok: Adaptive Discrete Video Tokenizer via Information-Theoretic Compression
Haotian Ye, Qiyuan He, Jiaqi Han, Puheng Li, Jiaojiao Fan, Zekun Hao, Fitsum Reda, Yogesh Balaji, Huayu Chen, Sheng Liu, Angela Yao, James Zou, Stefano Ermon, Haoxiang Wang, Ming-Yu Liu
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Unable to determine motivation due to technical error in fetching paper contentMethod: Unable to determine method due to technical error in fetching paper content
Result: Unable to determine results due to technical error in fetching paper content
Conclusion: Unable to draw conclusions due to technical error in fetching paper content
Abstract: Failed to fetch summary for 2512.16975: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.16975&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[499] SelfMOTR: Revisiting MOTR with Self-Generating Detection Priors
Fabian Gülhan, Emil Mededovic, Yuli Wu, Johannes Stegmaier
Main category: cs.CV
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation without access to paper contentMethod: Cannot determine method without access to paper content
Result: Cannot determine results without access to paper content
Conclusion: Cannot draw conclusions without access to paper content
Abstract: Failed to fetch summary for 2511.20279: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.20279&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[500] UAVLight: A Benchmark for Illumination-Robust 3D Reconstruction in Unmanned Aerial Vehicle (UAV) Scenes
Kang Du, Xue Liao, Junpeng Xia, Chaozheng Guo, Yi Gu, Yirui Guan, Duotun Wang, Sheng Huang, Zeyu Wang
Main category: cs.CV
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting). The paper with ID 2511.21565 could not be analyzed.
Details
Motivation: Cannot determine motivation due to inability to access paper content.Method: Cannot determine method due to inability to access paper content.
Result: Cannot determine results due to inability to access paper content.
Conclusion: Cannot draw conclusions due to inability to access paper content.
Abstract: Failed to fetch summary for 2511.21565: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.21565&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[501] EZ-SP: Fast and Lightweight Superpoint-Based 3D Segmentation
Louis Geist, Loic Landrieu, Damien Robert
Main category: cs.CV
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) when trying to access arXiv API
Details
Motivation: Cannot determine motivation as paper content is unavailable due to API rate limitingMethod: Cannot determine method as paper content is unavailable due to API rate limiting
Result: Cannot determine results as paper content is unavailable due to API rate limiting
Conclusion: Cannot draw conclusions as paper content is unavailable due to API rate limiting
Abstract: Failed to fetch summary for 2512.00385: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.00385&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[502] Vision-language models lag human performance on physical dynamics and intent reasoning
Tianjun Gu, Jingyu Gong, Zhizhong Zhang, Yuan Xie, Lizhuang Ma, Xin Tan, Athanasios V
Main category: cs.CV
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot draw conclusions as paper content is unavailable
Abstract: Failed to fetch summary for 2601.01547: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.01547&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[503] PhysGen: Physically Grounded 3D Shape Generation for Industrial Design
Yingxuan You, Chen Zhao, Hantao Zhang, Ming Xu, Pascal Fua
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to failed API requestMethod: Unable to determine method due to failed API request
Result: Unable to determine results due to failed API request
Conclusion: Unable to draw conclusions due to failed API request
Abstract: Failed to fetch summary for 2512.00422: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.00422&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[504] ELVIS: Enhance Low-Light for Video Instance Segmentation in the Dark
Joanne Lin, Ruirui Lin, Yini Li, David Bull, Nantheera Anantrasirichai
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to failed paper fetchMethod: Unable to determine method due to failed paper fetch
Result: Unable to determine results due to failed paper fetch
Conclusion: Unable to draw conclusions due to failed paper fetch
Abstract: Failed to fetch summary for 2512.01495: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.01495&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[505] Goal Force: Teaching Video Models To Accomplish Physics-Conditioned Goals
Nate Gillman, Yinghua Zhou, Zitian Tang, Evan Luo, Arjan Chakravarthy, Daksh Aggarwal, Michael Freeman, Charles Herrmann, Chen Sun
Main category: cs.CV
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot draw conclusions without access to paper content
Abstract: Failed to fetch summary for 2601.05848: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.05848&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[506] FreqEdit: Preserving High-Frequency Features for Robust Multi-Turn Image Editing
Yucheng Liao, Jiajun Liang, Kaiqian Cui, Baoquan Zhao, Haoran Xie, Wei Liu, Qing Li, Xudong Mao
Main category: cs.CV
TL;DR: Unable to analyze paper 2512.01755 due to HTTP 429 error when fetching abstract from arXiv API
Details
Motivation: Cannot determine motivation as abstract retrieval failedMethod: Cannot determine method as abstract retrieval failed
Result: Cannot determine results as abstract retrieval failed
Conclusion: Cannot draw conclusions without access to paper content
Abstract: Failed to fetch summary for 2512.01755: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.01755&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[507] Denoise to Track: Harnessing Video Diffusion Priors for Robust Correspondence
Tianyu Yuan, Yuanbo Yang, Lin-Zhuo Chen, Yao Yao, Zhuzhong Qian
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to failed paper fetchMethod: Unable to determine method due to failed paper fetch
Result: Unable to determine results due to failed paper fetch
Conclusion: Unable to draw conclusions due to failed paper fetch
Abstract: Failed to fetch summary for 2512.04619: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.04619&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[508] Memory-V2V: Memory-Augmented Video-to-Video Diffusion for Consistent Multi-Turn Editing
Dohun Lee, Chun-Hao Paul Huang, Xuelin Chen, Jong Chul Ye, Duygu Ceylan, Hyeonho Jeong
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2601.16296: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.16296&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[509] LatentFM: A Latent Flow Matching Approach for Generative Medical Image Segmentation
Huynh Trinh Ngoc, Hoang Anh Nguyen Kim, Toan Nguyen Hai, Long Tran Quoc
Main category: cs.CV
TL;DR: Unable to analyze paper 2512.04821 due to HTTP 429 error when fetching from arXiv API
Details
Motivation: Cannot determine motivation due to failed data retrievalMethod: No method information available
Result: No results available
Conclusion: Cannot draw conclusions without paper content
Abstract: Failed to fetch summary for 2512.04821: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.04821&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[510] OpenVTON-Bench: A Large-Scale High-Resolution Benchmark for Controllable Virtual Try-On Evaluation
Jin Li, Tao Chen, Shuai Jiang, Weijie Wang, Jingwen Luo, Chenhui Wu
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to API rate limiting preventing access to paper contentMethod: Unable to determine method due to API rate limiting preventing access to paper content
Result: Unable to determine results due to API rate limiting preventing access to paper content
Conclusion: Unable to determine conclusion due to API rate limiting preventing access to paper content
Abstract: Failed to fetch summary for 2601.22725: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.22725&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[511] SCAIL: Towards Studio-Grade Character Animation via In-Context Learning of 3D-Consistent Pose Representations
Wenhao Yan, Sheng Ye, Zhuoyi Yang, Jiayan Teng, ZhenHui Dong, Kairui Wen, Xiaotao Gu, Yong-Jin Liu, Jie Tang
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2512.05905: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.05905&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[512] sim2art: Accurate Articulated Object Modeling from a Single Video using Synthetic Training Data Only
Arslan Artykov, Tom Ravaud, Corentin Sautier, Vincent Lepetit
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to draw conclusions due to fetch failure
Abstract: Failed to fetch summary for 2512.07698: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.07698&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[513] Leveraging Multispectral Sensors for Color Correction in Mobile Cameras
Luca Cogo, Marco Buzzelli, Simone Bianco, Javier Vazquez-Corral, Raimondo Schettini
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to access errorMethod: Unable to determine method due to access error
Result: Unable to determine results due to access error
Conclusion: Unable to determine conclusion due to access error
Abstract: Failed to fetch summary for 2512.08441: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.08441&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[514] LoGoColor: Local-Global 3D Colorization for 360° Scenes
Yeonjin Chang, Juhwan Cho, Seunghyeon Seo, Wonsik Shin, Nojun Kwak
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to access errorMethod: Unable to determine method due to access error
Result: Unable to determine results due to access error
Conclusion: Unable to determine conclusion due to access error
Abstract: Failed to fetch summary for 2512.09278: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.09278&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[515] Audio-sync Video Instance Editing with Granularity-Aware Mask Refiner
Haojie Zheng, Shuchen Weng, Jingqi Liu, Siqi Yang, Boxin Shi, Xinlong Wang
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to failed paper fetchMethod: Unable to determine method due to failed paper fetch
Result: Unable to determine results due to failed paper fetch
Conclusion: Unable to determine conclusion due to failed paper fetch
Abstract: Failed to fetch summary for 2512.10571: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.10571&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[516] Feature Recalibration Based Olfactory-Visual Multimodal Model for Enhanced Rice Deterioration Detection
Rongqiang Zhao, Hengrui Hu, Yijing Wang, Mingchun Sun, Jie Liu
Main category: cs.CV
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) when trying to access arXiv API
Details
Motivation: The system attempted to retrieve paper information from arXiv but encountered a rate limiting error, preventing access to the paper's contentMethod: Attempted to use arXiv API query with specific paper ID (2602.14408) but received HTTP 429 status indicating too many requests
Result: Failed to fetch paper summary; no content available for analysis due to API rate limiting
Conclusion: Cannot analyze the paper as the content could not be retrieved; need to wait for rate limits to reset or use alternative methods to access the paper
Abstract: Failed to fetch summary for 2602.14408: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.14408&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[517] CausalCLIP: Causally-Informed Feature Disentanglement and Filtering for Generalizable Detection of Generated Images
Bo Liu, Qiao Qin, Qinghui He
Main category: cs.CV
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot draw conclusions as paper content is unavailable
Abstract: Failed to fetch summary for 2512.13285: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.13285&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[518] No Need For Real Anomaly: MLLM Empowered Zero-Shot Video Anomaly Detection
Zunkai Dai, Ke Li, Jiajia Liu, Jie Yang, Yuanyuan Qiao
Main category: cs.CV
TL;DR: Unable to analyze paper 2602.19248 due to HTTP 429 error when fetching abstract from arXiv API
Details
Motivation: Cannot determine motivation as abstract retrieval failed due to rate limiting (HTTP 429)Method: Cannot determine method as abstract retrieval failed
Result: Cannot determine results as abstract retrieval failed
Conclusion: Cannot draw conclusions without access to the paper’s abstract
Abstract: Failed to fetch summary for 2602.19248: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.19248&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[519] GroundingME: Exposing the Visual Grounding Gap in MLLMs through Multi-Dimensional Evaluation
Rang Li, Lei Li, Shuhuai Ren, Hao Tian, Shuhao Gu, Shicheng Li, Zihao Yue, Yudong Wang, Wenhan Ma, Zhe Yang, Jingyuan Ma, Zhifang Sui, Fuli Luo
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to failed paper fetchMethod: Unable to determine method due to failed paper fetch
Result: Unable to determine results due to failed paper fetch
Conclusion: Unable to determine conclusion due to failed paper fetch
Abstract: Failed to fetch summary for 2512.17495: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.17495&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[520] Toward Real-Time Surgical Scene Segmentation via a Spike-Driven Video Transformer with Spike-Informed Pretraining
Shihao Zou, Jingjing Li, Wei Ji, Jincai Huang, Kai Wang, Guo Dan, Weixin Si, Yi Pan
Main category: cs.CV
TL;DR: Unable to analyze paper 2512.21284 due to HTTP 429 error when fetching abstract from arXiv API
Details
Motivation: Cannot determine motivation without access to paper abstractMethod: Cannot determine method without access to paper abstract
Result: Cannot determine results without access to paper abstract
Conclusion: Cannot draw conclusions without access to paper abstract
Abstract: Failed to fetch summary for 2512.21284: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.21284&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[521] Scene-VLM: Multimodal Video Scene Segmentation via Vision-Language Models
Nimrod Berman, Adam Botach, Emanuel Ben-Baruch, Shunit Haviv Hakimi, Asaf Gendler, Ilan Naiman, Erez Yosef, Igor Kviatkovsky
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2512.21778: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.21778&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[522] Taxonomy-Aware Representation Alignment for Hierarchical Visual Recognition with Large Multimodal Models
Hulingxiao He, Zhi Tan, Yuxin Peng
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to retrieval errorMethod: Unable to determine method due to retrieval error
Result: Unable to determine results due to retrieval error
Conclusion: Unable to determine conclusion due to retrieval error
Abstract: Failed to fetch summary for 2603.00431: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.00431&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[523] ClearAIR: A Human-Visual-Perception-Inspired All-in-One Image Restoration
Xu Zhang, Huan Zhang, Guoli Wang, Qian Zhang, Lefei Zhang
Main category: cs.CV
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot determine conclusion as paper content is unavailable
Abstract: Failed to fetch summary for 2601.02763: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.02763&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[524] MPFlow: Multi-modal Posterior-Guided Flow Matching for Zero-Shot MRI Reconstruction
Seunghoi Kim, Chen Jin, Henry F. J. Tregidgo, Matteo Figini, Daniel C. Alexander
Main category: cs.CV
TL;DR: Paper 2603.03710: Could not fetch summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to missing abstractMethod: Unable to determine method due to missing abstract
Result: Unable to determine results due to missing abstract
Conclusion: Unable to determine conclusion due to missing abstract
Abstract: Failed to fetch summary for 2603.03710: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.03710&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[525] Gen3R: 3D Scene Generation Meets Feed-Forward Reconstruction
Jiaxin Huang, Yuanbo Yang, Bangbang Yang, Lin Ma, Yuewen Ma, Yiyi Liao
Main category: cs.CV
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation without access to paper contentMethod: Cannot determine method without access to paper content
Result: Cannot determine results without access to paper content
Conclusion: Cannot determine conclusion without access to paper content
Abstract: Failed to fetch summary for 2601.04090: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.04090&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[526] UniDrive-WM: Unified Understanding, Planning and Generation World Model For Autonomous Driving
Zhexiao Xiong, Xin Ye, Burhan Yaman, Sheng Cheng, Yiren Lu, Jingru Luo, Nathan Jacobs, Liu Ren
Main category: cs.CV
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation without access to paper contentMethod: Cannot determine method without access to paper content
Result: Cannot determine results without access to paper content
Conclusion: Cannot draw conclusions without access to paper content
Abstract: Failed to fetch summary for 2601.04453: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.04453&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[527] Spherical VAE with Cluster-Aware Feasible Regions: Guaranteed Prevention of Posterior Collapse
Zegu Zhang, Jian Zhang
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2603.10935: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.10935&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[528] VideoAuto-R1: Video Auto Reasoning via Thinking Once, Answering Twice
Shuming Liu, Mingchen Zhuge, Changsheng Zhao, Jun Chen, Lemeng Wu, Zechun Liu, Chenchen Zhu, Zhipeng Cai, Chong Zhou, Haozhe Liu, Ernie Chang, Saksham Suri, Hongyu Xu, Qi Qian, Wei Wen, Balakrishnan Varadarajan, Zhuang Liu, Hu Xu, Florian Bordes, Raghuraman Krishnamoorthi, Bernard Ghanem, Vikas Chandra, Yunyang Xiong
Main category: cs.CV
TL;DR: Paper 2601.05175: Unable to fetch summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation due to inability to access paper contentMethod: Cannot determine method due to inability to access paper content
Result: Cannot determine results due to inability to access paper content
Conclusion: Cannot draw conclusions due to inability to access paper content
Abstract: Failed to fetch summary for 2601.05175: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.05175&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[529] ObjectForesight: Predicting Future 3D Object Trajectories from Human Videos
Rustin Soraki, Homanga Bharadhwaj, Ali Farhadi, Roozbeh Mottaghi
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2601.05237: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.05237&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[530] SupScene: Scene-Structured Overlap Supervision for Image Retrieval in Unconstrained SfM
Xulei Shi, Maoyu Wang, Yuning Peng, Guanbo Wang, Xin Wang, Yifan Liao, Qi Chen, Pengjie Tao
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to technical error in accessing paper contentMethod: Unable to determine method due to technical error in accessing paper content
Result: Unable to determine results due to technical error in accessing paper content
Conclusion: Unable to draw conclusions due to technical error in accessing paper content
Abstract: Failed to fetch summary for 2601.11930: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.11930&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[531] Nuanced Emotion Recognition Based on a Segment-based MLLM Framework Leveraging Qwen3-Omni for AH Detection
Liang Tang, Hongda Li, Jiayu Zhang, Long Chen, Shuxian Li, Siqi Pei, Tiaonan Duan, Yuhao Cheng
Main category: cs.CV
TL;DR: The paper ID 2603.13406 could not be retrieved due to HTTP 429 error (rate limiting), so no abstract or content is available for analysis.
Details
Motivation: Unable to determine motivation as the paper content could not be fetched from arXiv due to rate limiting restrictions.Method: No method information available since the paper content could not be retrieved.
Result: No results available as the paper content could not be accessed.
Conclusion: Unable to draw any conclusions about the paper due to lack of accessible content.
Abstract: Failed to fetch summary for 2603.13406: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.13406&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[532] VIRTUE: Versatile Video Retrieval Through Unified Embeddings
Shaunak Halbe, Bhagyashree Puranik, Jayakrishnan Unnikrishnan, Kushan Thakkar, Vimal Bhat, Toufiq Parag
Main category: cs.CV
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation without access to paper contentMethod: Cannot determine method without access to paper content
Result: Cannot determine results without access to paper content
Conclusion: Cannot draw conclusions without access to paper content
Abstract: Failed to fetch summary for 2601.12193: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.12193&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[533] Spatial Transcriptomics as Images for Large-Scale Pretraining
Yishun Zhu, Jiaxin Qi, Jian Wang, Yuhua Zheng, Jianqiang Huang
Main category: cs.CV
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation due to access limitationsMethod: Cannot determine method due to access limitations
Result: Cannot determine results due to access limitations
Conclusion: Cannot draw conclusions due to access limitations
Abstract: Failed to fetch summary for 2603.13432: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.13432&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[534] ReflexSplit: Single Image Reflection Separation via Layer Fusion-Separation
Chia-Ming Lee, Yu-Fan Lin, Jin-Hui Jiang, Yu-Jou Hsiao, Chih-Chung Hsu, Yu-Lun Liu
Main category: cs.CV
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation without access to paper contentMethod: Cannot determine method without access to paper content
Result: Cannot determine results without access to paper content
Conclusion: Cannot draw conclusions without access to paper content
Abstract: Failed to fetch summary for 2601.17468: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.17468&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[535] SPACE-CLIP: Spatial Perception via Adaptive CLIP Embeddings for Monocular Depth Estimation
Taewan Cho, Taeryang Kim, Andrew Jaeyong Choi
Main category: cs.CV
TL;DR: Failed to fetch summary for paper 2601.17657 due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to failed API requestMethod: Unable to determine method due to failed API request
Result: Unable to determine results due to failed API request
Conclusion: Unable to determine conclusion due to failed API request
Abstract: Failed to fetch summary for 2601.17657: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.17657&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[536] Causal World Modeling for Robot Control
Lin Li, Qihang Zhang, Yiming Luo, Shuai Yang, Ruilin Wang, Fei Han, Mingrui Yu, Zelin Gao, Nan Xue, Xing Zhu, Yujun Shen, Yinghao Xu
Main category: cs.CV
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation without paper contentMethod: Cannot determine method without paper content
Result: Cannot determine results without paper content
Conclusion: Cannot determine conclusion without paper content
Abstract: Failed to fetch summary for 2601.21998: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.21998&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[537] Evaluating OCR Performance for Assistive Technology: Effects of Walking Speed, Camera Placement, and Camera Type
Junchi Feng, Nikhil Ballem, Mahya Beheshti, Giles Hamilton-Fletcher, Todd Hudson, Maurizio Porfiri, William H. Seiple, John-Ross Rizzo
Main category: cs.CV
TL;DR: Paper ID 2602.02223 appears to be unavailable due to HTTP 429 (rate limiting) error when trying to fetch the abstract from arXiv API.
Details
Motivation: Unable to determine motivation as the paper content could not be retrieved due to API rate limiting.Method: No method information available due to failed API request.
Result: No results available due to failed data retrieval.
Conclusion: Unable to analyze paper content due to technical limitations in accessing the abstract.
Abstract: Failed to fetch summary for 2602.02223: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.02223&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[538] PTB-XL-Image-17K: A Large-Scale Synthetic ECG Image Dataset with Comprehensive Ground Truth for Deep Learning-Based Digitization
Naqcho Ali Mehdi, Aamir Ali Drigh
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2602.07446: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.07446&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[539] Fast-WAM: Do World Action Models Need Test-time Future Imagination?
Tianyuan Yuan, Zibin Dong, Yicheng Liu, Hang Zhao
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to draw conclusions due to fetch failure
Abstract: Failed to fetch summary for 2603.16666: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.16666&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[540] BAAF: Universal Transformation of One-Class Classifiers for Unsupervised Image Anomaly Detection
Declan McIntosh, Alexandra Branzan Albu
Main category: cs.CV
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot draw conclusions as paper content is unavailable
Abstract: Failed to fetch summary for 2602.13091: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.13091&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[541] When Safety Collides: Resolving Multi-Category Harmful Conflicts in Text-to-Image Diffusion via Adaptive Safety Guidance
Yongli Xiang, Ziming Hong, Zhaoqing Wang, Xiangyu Zhao, Bo Han, Tongliang Liu
Main category: cs.CV
TL;DR: Paper 2602.20880 summary unavailable due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation due to abstract fetch failureMethod: Cannot determine method due to abstract fetch failure
Result: Cannot determine results due to abstract fetch failure
Conclusion: Cannot determine conclusion due to abstract fetch failure
Abstract: Failed to fetch summary for 2602.20880: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.20880&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[542] Interpretable Cross-Domain Few-Shot Learning with Rectified Target-Domain Local Alignment
Yaze Zhao, Yixiong Zou, Yuhua Li, Ruixuan Li
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting). No abstract available for analysis.
Details
Motivation: Unable to determine motivation as paper content is not accessible due to arXiv rate limiting.Method: No method information available due to HTTP 429 error preventing access to paper details.
Result: No results available as the paper summary could not be retrieved from arXiv API.
Conclusion: Cannot provide analysis due to technical limitations in accessing the paper content.
Abstract: Failed to fetch summary for 2603.17655: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.17655&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[543] Easy3E: Feed-Forward 3D Asset Editing via Rectified Voxel Flow
Shimin Hu, Yuanyi Wei, Fei Zha, Yudong Guo, Juyong Zhang
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to access errorMethod: Unable to determine method due to access error
Result: Unable to determine results due to access error
Conclusion: Unable to determine conclusion due to access error
Abstract: Failed to fetch summary for 2602.21499: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.21499&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[544] ThinkOmni: Lifting Textual Reasoning to Omni-modal Scenarios via Guidance Decoding
Yiran Guan, Sifan Tu, Dingkang Liang, Linghao Zhu, Jianzhong Ju, Zhenbo Luo, Jian Luan, Yuliang Liu, Xiang Bai
Main category: cs.CV
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation due to failed paper fetchMethod: Cannot determine method due to failed paper fetch
Result: Cannot determine results due to failed paper fetch
Conclusion: Cannot determine conclusion due to failed paper fetch
Abstract: Failed to fetch summary for 2602.23306: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.23306&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[545] Diffusion Probe: Generated Image Result Prediction Using CNN Probes
Benlei Cui, Bukun Huang, Zhizeng Ye, Xuemei Dong, Tuo Chen, Hui Xue, Dingkang Yang, Longtao Huang, Jingqun Tang, Haiwen Hong
Main category: cs.CV
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation as paper content is unavailable due to API rate limitingMethod: Cannot determine method as paper content is unavailable due to API rate limiting
Result: Cannot determine results as paper content is unavailable due to API rate limiting
Conclusion: Cannot determine conclusion as paper content is unavailable due to API rate limiting
Abstract: Failed to fetch summary for 2602.23783: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.23783&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[546] SwitchCraft: Training-Free Multi-Event Video Generation with Attention Controls
Qianxun Xu, Chenxi Song, Yujun Cai, Chi Zhang
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2602.23956: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.23956&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[547] DriveCode: Domain Specific Numerical Encoding for LLM-Based Autonomous Driving
Zhiye Wang, Yanbo Jiang, Rui Zhou, Bo Zhang, Fang Zhang, Zhenhua Xu, Yaqin Zhang, Jianqiang Wang
Main category: cs.CV
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation without paper contentMethod: Cannot determine method without paper content
Result: Cannot determine results without paper content
Conclusion: Cannot determine conclusion without paper content
Abstract: Failed to fetch summary for 2603.00919: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.00919&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[548] FREE-Edit: Using Editing-aware Injection in Rectified Flow Models for Zero-shot Image-Driven Video Editing
Maomao Li, Yunfei Liu, Yu Li
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to access errorMethod: Unable to determine method due to access error
Result: Unable to determine results due to access error
Conclusion: Unable to analyze paper due to technical access issues
Abstract: Failed to fetch summary for 2603.01164: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.01164&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[549] Tri-path DINO: Feature Complementary Learning for Remote Sensing Multi-Class Change Detection
Kai Zheng, Hang-Cheng Dong, Shoulei Liu, Zhenkai Wu, Fupeng Wei, Lei Ding, Wei Zhang
Main category: cs.CV
TL;DR: Paper 2603.01498: Could not fetch summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to access restrictionsMethod: Unable to determine method due to access restrictions
Result: Unable to determine results due to access restrictions
Conclusion: Unable to determine conclusion due to access restrictions
Abstract: Failed to fetch summary for 2603.01498: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.01498&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[550] Proact-VL: A Proactive VideoLLM for Real-Time AI Companions
Weicai Yan, Yuhong Dai, Qi Ran, Haodong Li, Wang Lin, Hao Liao, Xing Xie, Tao Jin, Jianxun Lian
Main category: cs.CV
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation as paper content is unavailable due to API rate limitingMethod: Cannot determine method as paper content is unavailable due to API rate limiting
Result: Cannot determine results as paper content is unavailable due to API rate limiting
Conclusion: Cannot draw conclusions as paper content is unavailable due to API rate limiting
Abstract: Failed to fetch summary for 2603.03447: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.03447&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[551] DAGE: Dual-Stream Architecture for Efficient and Fine-Grained Geometry Estimation
Tuan Duc Ngo, Jiahui Huang, Seoung Wug Oh, Kevin Blackburn-Matzen, Evangelos Kalogerakis, Chuang Gan, Joon-Young Lee
Main category: cs.CV
TL;DR: Unable to analyze paper 2603.03744 due to HTTP 429 error when fetching from arXiv API
Details
Motivation: Cannot determine motivation as paper content could not be retrievedMethod: Cannot determine method as paper content could not be retrieved
Result: Cannot determine results as paper content could not be retrieved
Conclusion: Cannot draw conclusions as paper content could not be retrieved
Abstract: Failed to fetch summary for 2603.03744: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.03744&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[552] Towards Highly Transferable Vision-Language Attack via Semantic-Augmented Dynamic Contrastive Interaction
Yuanbo Li, Tianyang Xu, Cong Hu, Tao Zhou, Xiao-Jun Wu, Josef Kittler
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to failed API requestMethod: Unable to determine method due to failed API request
Result: Unable to determine results due to failed API request
Conclusion: Unable to determine conclusion due to failed API request
Abstract: Failed to fetch summary for 2603.04839: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.04839&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[553] Multi-Paradigm Collaborative Adversarial Attack Against Multi-Modal Large Language Models
Yuanbo Li, Tianyang Xu, Cong Hu, Tao Zhou, Xiao-Jun Wu, Josef Kittler
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to access errorMethod: Unable to determine method due to access error
Result: Unable to determine results due to access error
Conclusion: Unable to determine conclusion due to access error
Abstract: Failed to fetch summary for 2603.04846: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.04846&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[554] PaQ-DETR: Learning Pattern and Quality-Aware Dynamic Queries for Object Detection
Zhengjian Kang, Jun Zhuang, Kangtong Mo, Qi Chen, Rui Liu, Ye Zhang
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to failed paper fetchMethod: Unable to determine method due to failed paper fetch
Result: Unable to determine results due to failed paper fetch
Conclusion: Unable to determine conclusion due to failed paper fetch
Abstract: Failed to fetch summary for 2603.06917: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.06917&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[555] MipSLAM: Alias-Free Gaussian Splatting SLAM
Yingzhao Li, Yan Li, Shixiong Tian, Yanjie Liu, Lijun Zhao, Gim Hee Lee
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2603.06989: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.06989&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[556] SWIFT: Sliding Window Reconstruction for Few-Shot Training-Free Generated Video Attribution
Chao Wang, Zijin Yang, Yaofei Wang, Yuang Qi, Weiming Zhang, Nenghai Yu, Kejiang Chen
Main category: cs.CV
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot determine conclusion as paper content is unavailable
Abstract: Failed to fetch summary for 2603.08536: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.08536&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[557] CIGPose: Causal Intervention Graph Neural Network for Whole-Body Pose Estimation
Bohao Li, Zhicheng Cao, Huixian Li, Yangming Guo
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting) when querying arXiv API
Details
Motivation: Unable to determine motivation due to failed API requestMethod: Unable to determine method due to failed API request
Result: Unable to determine results due to failed API request
Conclusion: Unable to determine conclusion due to failed API request
Abstract: Failed to fetch summary for 2603.09418: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.09418&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[558] SignSparK: Efficient Multilingual Sign Language Production via Sparse Keyframe Learning
Jianhe Low, Alexandre Symeonidis-Herzig, Maksym Ivashechkin, Ozge Mercanoglu Sincan, Richard Bowden
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to failed paper fetchMethod: Unable to determine method due to failed paper fetch
Result: Unable to determine results due to failed paper fetch
Conclusion: Unable to determine conclusion due to failed paper fetch
Abstract: Failed to fetch summary for 2603.10446: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.10446&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[559] UniStitch: Unifying Semantic and Geometric Features for Image Stitching
Yuan Mei, Lang Nie, Kang Liao, Yunqiu Xu, Chunyu Lin, Bin Xiao
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting) when querying arXiv API for paper ID 2603.10568
Details
Motivation: Unable to determine motivation as the paper content could not be retrieved due to API rate limitingMethod: No method information available - paper content inaccessible
Result: No results available - failed to fetch paper summary
Conclusion: Cannot provide analysis due to technical limitations in accessing the paper
Abstract: Failed to fetch summary for 2603.10568: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.10568&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[560] Event-based Photometric Stereo via Rotating Illumination and Per-Pixel Learning
Hyunwoo Kim, Won-Hoe Kim, Sanghoon Lee, Jianfei Cai, Giljoo Nam, Jae-Sang Hyun
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2603.10748: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.10748&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[561] Continual Learning with Vision-Language Models via Semantic-Geometry Preservation
Chiyuan He, Zihuan Qiu, Fanman Meng, Runtong Zhang, Linfeng Xu, Qingbo Wu, Hongliang Li
Main category: cs.CV
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot draw conclusions as paper content is unavailable
Abstract: Failed to fetch summary for 2603.12055: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.12055&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[562] Unleashing Video Language Models for Fine-grained HRCT Report Generation
Yingying Fang, Huichi Zhou, KinHei Lee, Yijia Wang, Zhenxuan Zhang, Jiahao Huang, Guang Yang
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to access restrictionsMethod: Unable to determine method due to access restrictions
Result: Unable to determine results due to access restrictions
Conclusion: Unable to determine conclusion due to access restrictions
Abstract: Failed to fetch summary for 2603.12469: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.12469&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[563] HIFICL: High-Fidelity In-Context Learning for Multimodal Tasks
Xiaoyu Li, Yuhang Liu, Zheng Luo, Xuanshuo Kang, Fangqi Lou, Xiaohua Wu, Zihan Xiong
Main category: cs.CV
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation without access to paper contentMethod: Cannot determine method without access to paper content
Result: Cannot determine results without access to paper content
Conclusion: Cannot determine conclusion without access to paper content
Abstract: Failed to fetch summary for 2603.12760: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.12760&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[564] VIRD: View-Invariant Representation through Dual-Axis Transformation for Cross-View Pose Estimation
Juhye Park, Wooju Lee, Dasol Hong, Changki Sung, Youngwoo Seo, Dongwan Kang, Hyun Myung
Main category: cs.CV
TL;DR: Unable to analyze paper 2603.12918 due to HTTP 429 error when fetching abstract from arXiv API
Details
Motivation: Cannot determine motivation without access to the paper abstractMethod: Cannot determine method without access to the paper abstract
Result: Cannot determine results without access to the paper abstract
Conclusion: Cannot determine conclusion without access to the paper abstract
Abstract: Failed to fetch summary for 2603.12918: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.12918&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[565] Joint Segmentation and Grading with Iterative Optimization for Multimodal Glaucoma Diagnosis
Zhiwei Wang, Yuxing Li, Meilu Zhu, Defeng He, Edmund Y. Lam
Main category: cs.CV
TL;DR: Unable to analyze paper 2603.14188 due to HTTP 429 error when fetching abstract from arXiv API
Details
Motivation: Cannot determine motivation as abstract is unavailable due to rate limiting errorMethod: Cannot determine method as abstract is unavailable due to rate limiting error
Result: Cannot determine results as abstract is unavailable due to rate limiting error
Conclusion: Cannot draw conclusions about the paper due to inability to access the abstract content
Abstract: Failed to fetch summary for 2603.14188: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.14188&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[566] Sparse but not Simpler: A Multi-Level Interpretability Analysis of Vision Transformers
Siyu Zhang
Main category: cs.CV
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting). Need to try again later or use alternative methods to access the paper information.
Details
Motivation: Cannot determine motivation without access to the paper content.Method: Cannot determine method without access to the paper content.
Result: Cannot determine results without access to the paper content.
Conclusion: Cannot draw conclusions without access to the paper content.
Abstract: Failed to fetch summary for 2603.15919: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.15919&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[567] VIGOR: VIdeo Geometry-Oriented Reward for Temporal Generative Alignment
Tengjiao Yin, Jinglei Shi, Heng Guo, Xi Wang
Main category: cs.CV
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) - need to retry later or use alternative methods
Details
Motivation: Cannot determine motivation without access to paper contentMethod: Cannot determine method without access to paper content
Result: Cannot determine results without access to paper content
Conclusion: Cannot draw conclusions without access to paper content
Abstract: Failed to fetch summary for 2603.16271: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.16271&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[568] TinyGLASS: Real-Time Self-Supervised In-Sensor Anomaly Detection
Pietro Bonazzi, Rafael Sutter, Luigi Capogrosso, Mischa Buob, Michele Magno
Main category: cs.CV
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting). The paper ID 2603.16451 suggests it’s from March 2023, but no content available for analysis.
Details
Motivation: Cannot determine motivation without access to paper content. The arXiv API rate limiting prevents retrieval of abstract or paper details.Method: Method unknown due to API access restrictions. The paper ID format suggests it’s a computer science/machine learning paper from March 2023.
Result: No results available for analysis. The HTTP 429 error indicates too many requests to arXiv API.
Conclusion: Cannot provide conclusion without paper content. Need to wait for API rate limits to reset or use alternative methods to access paper details.
Abstract: Failed to fetch summary for 2603.16451: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.16451&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[569] GigaWorld-Policy: An Efficient Action-Centered World–Action Model
Angen Ye, Boyuan Wang, Chaojun Ni, Guan Huang, Guosheng Zhao, Hao Li, Hengtao Li, Jie Li, Jindi Lv, Jingyu Liu, Min Cao, Peng Li, Qiuping Deng, Wenjun Mei, Xiaofeng Wang, Xinze Chen, Xinyu Zhou, Yang Wang, Yifan Chang, Yifan Li, Yukun Zhou, Yun Ye, Zhichao Liu, Zheng Zhu
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2603.17240: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.17240&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[570] Omni-I2C: A Holistic Benchmark for High-Fidelity Image-to-Code Generation
Jiawei Zhou, Chi Zhang, Xiang Feng, Qiming Zhang, Haibo Qiu, Lihuo He, Dengpan Ye, Xinbo Gao, Jing Zhang
Main category: cs.CV
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot draw conclusions as paper content is unavailable
Abstract: Failed to fetch summary for 2603.17508: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.17508&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[571] CrowdGaussian: Reconstructing High-Fidelity 3D Gaussians for Human Crowd from a Single Image
Yizheng Song, Yiyu Zhuang, Qipeng Xu, Haixiang Wang, Jiahe Zhu, Jing Tian, Siyu Zhu, Hao Zhu
Main category: cs.CV
TL;DR: Paper 2603.17779 could not be analyzed due to HTTP 429 error when fetching from arXiv API
Details
Motivation: Unable to determine motivation as the paper content could not be retrievedMethod: Unable to determine method as the paper content could not be retrieved
Result: Unable to determine results as the paper content could not be retrieved
Conclusion: Unable to determine conclusion as the paper content could not be retrieved
Abstract: Failed to fetch summary for 2603.17779: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.17779&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[572] Identity as Presence: Towards Appearance and Voice Personalized Joint Audio-Video Generation
Yingjie Chen, Shilun Lin, Cai Xing, Qixin Yan, Wenjing Wang, Dingming Liu, Hao Liu, Chen Li, Jing Lyu
Main category: cs.CV
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot draw conclusions as paper content is unavailable
Abstract: Failed to fetch summary for 2603.17889: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.17889&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[573] ParallelVLM: Lossless Video-LLM Acceleration with Visual Alignment Aware Parallel Speculative Decoding
Quan Kong, Yuhao Shen, Yicheng Ji, Huan Li, Cong Wang
Main category: cs.CV
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot determine conclusion as paper content is unavailable
Abstract: Failed to fetch summary for 2603.19610: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.19610&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[574] 2K Retrofit: Entropy-Guided Efficient Sparse Refinement for High-Resolution 3D Geometry Prediction
Tianbao Zhang, Zhenyu Liang, Zhenbo Song, Nana Wang, Xiaomei Zhang, Xudong Cai, Zheng Zhu, Kejian Wu, Gang Wang, Zhaoxin Fan
Main category: cs.CV
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation without access to paper contentMethod: Cannot determine method without access to paper content
Result: Cannot determine results without access to paper content
Conclusion: Cannot draw conclusions without access to paper content
Abstract: Failed to fetch summary for 2603.19964: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.19964&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[575] SPOT: Point Cloud Based Stereo Visual Place Recognition for Similar and Opposing Viewpoints
Spencer Carmichael, Rahul Agrawal, Ram Vasudevan, Katherine A. Skinner
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to access errorMethod: Unable to determine method due to access error
Result: Unable to determine results due to access error
Conclusion: Unable to draw conclusions due to access error
Abstract: Failed to fetch summary for 2404.12339: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2404.12339&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[576] Expand Your SCOPE: Semantic Cognition over Potential-Based Exploration for Embodied Visual Navigation
Ningnan Wang, Weihuang Chen, Liming Chen, Haoxuan Ji, Zhongyu Guo, Xuchong Zhang, Hongbin Sun
Main category: cs.CV
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting). The paper ID 2511.08935 appears to be from November 2024, but no abstract or content could be retrieved.
Details
Motivation: Cannot determine motivation without access to paper content.Method: Cannot determine method without access to paper content.
Result: Cannot determine results without access to paper content.
Conclusion: Cannot draw conclusions without access to paper content.
Abstract: Failed to fetch summary for 2511.08935: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.08935&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[577] Foundation Models for Trajectory Planning in Autonomous Driving: A Review of Progress and Open Challenges
Kemal Oksuz, Alexandru Buburuzan, Anthony Knittel, Yuhan Yao, Puneet K. Dokania
Main category: cs.CV
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation as paper content is unavailable due to API rate limitingMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot draw conclusions about the paper due to technical limitations in accessing the content
Abstract: Failed to fetch summary for 2512.00021: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.00021&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[578] VoroLight: Learning Voronoi Surface Meshes via Sphere Intersection
Jiayin Lu, Ying Jiang, Yumeng He, Yin Yang, Chenfanfu Jiang
Main category: cs.CV
TL;DR: Unable to analyze paper 2512.12984 due to HTTP 429 error when fetching from arXiv API
Details
Motivation: Cannot determine motivation as paper content could not be retrieved due to API rate limitingMethod: No method information available due to failed API request
Result: No results available as paper content could not be accessed
Conclusion: Unable to provide analysis due to technical limitations in accessing the paper
Abstract: Failed to fetch summary for 2512.12984: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.12984&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[579] Real2Edit2Real: Generating Robotic Demonstrations via a 3D Control Interface
Yujie Zhao, Hongwei Fan, Di Chen, Shengcong Chen, Liliang Chen, Xiaoqi Li, Guanghui Ren, Hao Dong
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2512.19402: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.19402&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[580] GenAI-DrawIO-Creator: A Framework for Automated Diagram Generation
Jinze Yu, Dayuan Jiang
Main category: cs.CV
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot draw conclusions as paper content is unavailable
Abstract: Failed to fetch summary for 2601.05162: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.05162&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[581] HERE: Hierarchical Active Exploration of Radiance Field with Epistemic Uncertainty Minimization
Taekbeom Lee, Dabin Kim, Youngseok Jang, H. Jin Kim
Main category: cs.CV
TL;DR: Paper ID 2601.07242 could not be analyzed due to HTTP 429 error when fetching from arXiv API
Details
Motivation: Unable to determine motivation as the paper content could not be retrieved due to rate limitingMethod: No method information available due to failed API request
Result: No results available as the paper content could not be accessed
Conclusion: Cannot provide conclusion due to inability to fetch paper content from arXiv
Abstract: Failed to fetch summary for 2601.07242: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.07242&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
cs.AI
[582] AgenticGEO: A Self-Evolving Agentic System for Generative Engine Optimization
Jiaqi Yuan, Jialu Wang, Zihan Wang, Qingyun Sun, Ruijie Wang, Jianxin Li
Main category: cs.AI
TL;DR: AgenticGEO is a self-evolving agentic framework for Generative Engine Optimization that formulates optimization as a content-conditioned control problem, using evolutionary search and a co-evolving critic to adapt to black-box generative engines.
Details
Motivation: Existing Generative Engine Optimization methods rely on static heuristics, single-prompt optimization, or engine preference rules that are prone to overfitting and cannot adapt to diverse content or changing engine behaviors, requiring impractical amounts of interaction feedback.Method: Proposes AgenticGEO framework with MAP-Elites archive to evolve diverse compositional strategies, and a Co-Evolving Critic as a lightweight surrogate to approximate engine feedback for content-specific strategy selection and refinement, guiding both evolutionary search and inference-time planning.
Result: Achieves state-of-the-art performance and demonstrates robust transferability, outperforming 14 baselines across 3 datasets in both in-domain and cross-domain experiments on two representative generative engines.
Conclusion: AgenticGEO provides an effective self-evolving framework for Generative Engine Optimization that can robustly adapt to unpredictable black-box engine behaviors while reducing interaction costs.
Abstract: Generative search engines represent a transition from traditional ranking-based retrieval to Large Language Model (LLM)-based synthesis, transforming optimization goals from ranking prominence towards content inclusion. Generative Engine Optimization (GEO), specifically, aims to maximize visibility and attribution in black-box summarized outputs by strategically manipulating source content. However, existing methods rely on static heuristics, single-prompt optimization, or engine preference rule distillation that is prone to overfitting. They cannot flexibly adapt to diverse content or the changing behaviors of generative engines. Moreover, effectively optimizing these strategies requires an impractical amount of interaction feedback from the engines. To address these challenges, we propose AgenticGEO, a self-evolving agentic framework formulating optimization as a content-conditioned control problem, which enhances intrinsic content quality to robustly adapt to the unpredictable behaviors of black-box engines. Unlike fixed-strategy methods, AgenticGEO employs a MAP-Elites archive to evolve diverse, compositional strategies. To mitigate interaction costs, we introduce a Co-Evolving Critic, a lightweight surrogate that approximates engine feedback for content-specific strategy selection and refinement, efficiently guiding both evolutionary search and inference-time planning. Through extensive in-domain and cross-domain experiments on two representative engines, AgenticGEO achieves state-of-the-art performance and demonstrates robust transferability, outperforming 14 baselines across 3 datasets. Our code and model are available at: https://github.com/AIcling/agentic_geo.
[583] ProMAS: Proactive Error Forecasting for Multi-Agent Systems Using Markov Transition Dynamics
Xinkui Zhao, Sai Liu, Yifan Zhang, Qingyu Ma, Guanjie Cheng, Naibo Wang, Chang Liu
Main category: cs.AI
TL;DR: PROMAS is a proactive framework for real-time error detection in multi-agent LLM systems using Markov transitions and causal delta features to predict reasoning failures before they propagate.
Details
Motivation: Current multi-agent LLM systems are fragile to logical fallacies that can propagate and cause system-wide failures. Most research relies on post-hoc failure analysis, which prevents real-time intervention. There's a need for proactive error detection to balance diagnostic precision with real-time demands.Method: PROMAS uses Markov transitions for predictive error analysis. It extracts Causal Delta Features to capture semantic displacement, maps them to a quantized Vector Markov Space to model reasoning as probabilistic transitions, and integrates a Proactive Prediction Head with Jump Detection to localize errors via risk acceleration rather than static thresholds.
Result: On the Who&When benchmark, PROMAS achieves 22.97% step-level accuracy while processing only 27% of reasoning logs. This performance rivals reactive monitors like MASC while reducing data overhead by 73%. The method trades some accuracy for significantly improved intervention latency.
Conclusion: PROMAS provides a proactive framework for error detection in multi-agent LLM systems that balances diagnostic precision with real-time intervention needs, offering a practical solution for preventing error propagation in collaborative reasoning systems.
Abstract: The integration of Large Language Models into Multi-Agent Systems (MAS) has enabled the so-lution of complex, long-horizon tasks through collaborative reasoning. However, this collec-tive intelligence is inherently fragile, as a single logical fallacy can rapidly propagate and lead to system-wide failure. Most current research re-lies on post-hoc failure analysis, thereby hinder-ing real-time intervention. To address this, we propose PROMAS, a proactive framework utiliz-ing Markov transitions for predictive error anal-ysis. PROMAS extracts Causal Delta Features to capture semantic displacement, mapping them to a quantized Vector Markov Space to model reasoning as probabilistic transitions. By inte-grating a Proactive Prediction Head with Jump Detection, the method localizes errors via risk acceleration rather than static thresholds. On the Who&When benchmark, PROMAS achieves 22.97% step-level accuracy while processing only 27% of reasoning logs. This performance rivals reactive monitors like MASC while reducing data overhead by 73%. Although this strategy entails an accuracy trade-off compared to post-hoc meth-ods, it significantly improves intervention latency, balancing diagnostic precision with the real-time demands of autonomous reasoning.
[584] Domain-Specialized Tree of Thought through Plug-and-Play Predictors
Xuanqi Gao, Haoyu Wang, Jun Sun, Shiqing Ma, Chao Shen
Main category: cs.AI
TL;DR: DST introduces a lightweight supervised predictor for Tree of Thoughts (ToT) framework that enables dynamic pruning, balancing exploration depth with computational efficiency in LLM reasoning.
Details
Motivation: Current Tree of Thoughts implementations face a critical trade-off between exploration depth and computational efficiency, relying on expensive LLM-based self-evaluation or rigid heuristics that make them impractical for broad applications.Method: DST is an adaptable, plug-and-play predictor that serves as a lightweight supervised heuristic to guide the ToT search process, enabling dynamic context-aware pruning that expands search beams only when encountering uncertainty or task complexity.
Result: The method achieves accuracy competitive with or superior to strong baselines including standard ToT while reducing computational overhead by 26-75% across mathematical reasoning, general reasoning, and complex logical reasoning benchmarks.
Conclusion: DST effectively resolves the accuracy-efficiency trade-off in tree-based reasoning, transforming ToT from a resource-intensive technique into a scalable and practical paradigm for complex problem-solving in LLMs.
Abstract: While Large Language Models (LLMs) have advanced complex reasoning, prominent methods like the Tree of Thoughts (ToT) framework face a critical trade-off between exploration depth and computational efficiency. Existing ToT implementations often rely on heavyweight LLM-based self-evaluation or rigid heuristics for branch pruning, making them prohibitively expensive and inflexible for broad application. To address this, we introduce DST, an adaptable, plug-and-play predictor that serves as a lightweight, supervised heuristic to guide the ToT search process. Our predictor enables dynamic, context-aware pruning, allowing the search to proceed with near-greedy efficiency on simpler reasoning steps while adaptively expanding the search beam only when encountering uncertainty or task complexity. We evaluate our approach on a diverse suite of benchmarks spanning mathematical reasoning, general reasoning, and complex logical reasoning. Experimental results demonstrate that our method achieves accuracy competitive with or superior to strong baselines, including standard ToT, while reducing computational overhead by 26-75%. Our work effectively resolves the accuracy-efficiency trade-off in tree-based reasoning, transforming ToT from a resource-intensive technique into a scalable and practical paradigm for complex problem-solving in LLMs.
[585] FactorSmith: Agentic Simulation Generation via Markov Decision Process Decomposition with Planner-Designer-Critic Refinement
Ali Shamsaddinlou, Morteza NourelahiAlamdari
Main category: cs.AI
TL;DR: FactorSmith is a framework that generates executable game simulations from natural language by combining factored POMDP decomposition with hierarchical agentic workflows for iterative quality refinement.
Details
Motivation: LLMs struggle with reasoning over large codebases when generating simulations from natural language. Existing approaches need better context management and quality refinement mechanisms.Method: Combines factored POMDP decomposition (from FactorSim) to reduce context by modularizing simulation steps, with a hierarchical planner-designer-critic agentic workflow (inspired by SceneSmith) for iterative refinement at each generation step.
Result: Experiments on PyGame Learning Environment show improved prompt alignment, fewer runtime errors, and higher code quality compared to non-agentic factored baselines.
Conclusion: The combination of factored decomposition and agentic workflows effectively addresses LLM reasoning limitations in simulation generation, producing higher quality executable code from natural language specifications.
Abstract: Generating executable simulations from natural language specifications remains a challenging problem due to the limited reasoning capacity of large language models (LLMs) when confronted with large, interconnected codebases. This paper presents FactorSmith, a framework that synthesizes playable game simulations in code from textual descriptions by combining two complementary ideas: factored POMDP decomposition for principled context reduction and a hierarchical planner-designer-critic agentic workflow for iterative quality refinement at every generation step. Drawing on the factored partially observable Markov decision process (POMDP) representation introduced by FactorSim [Sun et al., 2024], the proposed method decomposes a simulation specification into modular steps where each step operates only on a minimal subset of relevant state variables, limiting the context window that any single LLM call must process. Inspired by the agentic trio architecture of SceneSmith [Pfaff et al., 2025], FactorSmith embeds within every factored step a three-agent interaction: a planner that orchestrates workflow, a designer that proposes code artifacts, and a critic that evaluates quality through structured scoring, enabling iterative refinement with checkpoint rollback. This paper formalizes the combined approach, presents the mathematical framework underpinning context selection and agentic refinement, and describes the open-source implementation. Experiments on the PyGame Learning Environment benchmark demonstrate that FactorSmith generates simulations with improved prompt alignment, fewer runtime errors, and higher code quality compared to non-agentic factored baselines.
[586] Me, Myself, and $π$ : Evaluating and Explaining LLM Introspection
Atharv Naphade, Samarth Bhargav, Sean Lim, Mcnair Shah
Main category: cs.AI
TL;DR: LLMs can introspect about their own cognitive processes, showing privileged access to their policies and parameters, with attention diffusion enabling this capability without explicit training.
Details
Motivation: Current evaluations of introspection in LLMs fail to distinguish genuine meta-cognition from general world knowledge or text-based self-simulation. There's a need for principled evaluation of whether LLMs truly have introspective capabilities.Method: Proposed a formal taxonomy of introspection as latent computation over model policies/parameters. Created Introspect-Bench, a multifaceted evaluation suite for rigorous capability testing. Analyzed attention diffusion mechanisms.
Result: Frontier models show privileged access to their own policies, outperforming peer models in predicting their own behavior. Provided causal, mechanistic evidence showing how LLMs learn to introspect without explicit training via attention diffusion.
Conclusion: LLMs exhibit genuine introspective capabilities that go beyond mere application of general knowledge, with attention mechanisms playing a key role in enabling meta-cognitive processes.
Abstract: A hallmark of human intelligence is Introspection-the ability to assess and reason about one’s own cognitive processes. Introspection has emerged as a promising but contested capability in large language models (LLMs). However, current evaluations often fail to distinguish genuine meta-cognition from the mere application of general world knowledge or text-based self-simulation. In this work, we propose a principled taxonomy that formalizes introspection as the latent computation of specific operators over a model’s policy and parameters. To isolate the components of generalized introspection, we present Introspect-Bench, a multifaceted evaluation suite designed for rigorous capability testing. Our results show that frontier models exhibit privileged access to their own policies, outperforming peer models in predicting their own behavior. Furthermore, we provide causal, mechanistic evidence explaining both how LLMs learn to introspect without explicit training, and how the mechanism of introspection emerges via attention diffusion.
[587] AgentComm-Bench: Stress-Testing Cooperative Embodied AI Under Latency, Packet Loss, and Bandwidth Collapse
Aayam Bansal, Ishaan Gangwani
Main category: cs.AI
TL;DR: AgentComm-Bench is a benchmark suite for evaluating cooperative multi-agent embodied AI systems under realistic communication impairments like latency, packet loss, and bandwidth constraints, revealing catastrophic performance degradation in communication-dependent tasks.
Details
Motivation: Current cooperative multi-agent methods for embodied AI are evaluated under idealized communication conditions (zero latency, no packet loss, unlimited bandwidth), but real-world deployment on robots, autonomous vehicles, and drone swarms faces various communication impairments that need systematic evaluation.Method: Introduces AgentComm-Bench benchmark suite with evaluation protocol that systematically stress-tests cooperative embodied AI under six communication impairment dimensions: latency, packet loss, bandwidth collapse, asynchronous updates, stale memory, and conflicting sensor evidence. Covers three task families: cooperative perception, multi-agent waypoint navigation, and cooperative zone search. Evaluates five communication strategies including a proposed lightweight method based on redundant message coding with staleness-aware fusion.
Result: Communication-dependent tasks degrade catastrophically: stale memory and bandwidth collapse cause over 96% performance drops in navigation, while content corruption reduces perception F1 by over 85%. Vulnerability depends on impairment type-task design interaction. Redundant message coding more than doubles navigation performance under 80% packet loss. Perception fusion is robust to packet loss but amplifies corrupted data.
Conclusion: AgentComm-Bench provides practical evaluation protocol for cooperative embodied AI under realistic communication conditions. Recommends that future work report performance under multiple impairment conditions to ensure robustness in real-world deployments.
Abstract: Cooperative multi-agent methods for embodied AI are almost universally evaluated under idealized communication: zero latency, no packet loss, and unlimited bandwidth. Real-world deployment on robots with wireless links, autonomous vehicles on congested networks, or drone swarms in contested spectrum offers no such guarantees. We introduce AgentComm-Bench, a benchmark suite and evaluation protocol that systematically stress-tests cooperative embodied AI under six communication impairment dimensions: latency, packet loss, bandwidth collapse, asynchronous updates, stale memory, and conflicting sensor evidence. AgentComm-Bench spans three task families: cooperative perception, multi-agent waypoint navigation, and cooperative zone search, and evaluates five communication strategies, including a lightweight method we propose based on redundant message coding with staleness-aware fusion. Our experiments reveal that communication-dependent tasks degrade catastrophically: stale memory and bandwidth collapse cause over 96% performance drops in navigation, while content corruption (stale or conflicting data) reduces perception F1 by over 85%. Vulnerability depends on the interaction between impairment type and task design; perception fusion is robust to packet loss but amplifies corrupted data. Redundant message coding more than doubles navigation performance under 80% packet loss. We release AgentComm-Bench as a practical evaluation protocol and recommend that cooperative embodied AI work report performance under multiple impairment conditions.
[588] LLM-Enhanced Energy Contrastive Learning for Out-of-Distribution Detection in Text-Attributed Graphs
Xiaoxu Ma, Dong Li, Minglai Shao, Xintao Wu, Chen Zhao
Main category: cs.AI
TL;DR: LECT: LLM-enhanced energy contrastive learning for OOD detection in text-attributed graphs, using LLMs to generate pseudo-OOD nodes and energy-based contrastive learning to distinguish IND/OOD nodes.
Details
Motivation: Existing methods for text-attributed graphs assume consistent training/testing distributions, leading to performance degradation on OOD data. Need to maintain accurate node classification while identifying OOD nodes.Method: Integrates LLMs and energy-based contrastive learning. Uses LLMs to generate dependency-aware pseudo-OOD nodes by leveraging semantic understanding, then applies energy contrastive learning to distinguish IND/OOD nodes.
Result: Extensive experiments on six benchmark datasets show consistent outperformance over state-of-the-art baselines, achieving high classification accuracy and robust OOD detection capabilities.
Conclusion: LECT effectively addresses OOD detection in text-attributed graphs by combining LLMs for sample generation and energy contrastive learning for discrimination, demonstrating strong performance across multiple datasets.
Abstract: Text-attributed graphs, where nodes are enriched with textual attributes, have become a powerful tool for modeling real-world networks such as citation, social, and transaction networks. However, existing methods for learning from these graphs often assume that the distributions of training and testing data are consistent. This assumption leads to significant performance degradation when faced with out-of-distribution (OOD) data. In this paper, we address the challenge of node-level OOD detection in text-attributed graphs, with the goal of maintaining accurate node classification while simultaneously identifying OOD nodes. We propose a novel approach, LLM-Enhanced Energy Contrastive Learning for Out-of-Distribution Detection in Text-Attributed Graphs (LECT), which integrates large language models (LLMs) and energy-based contrastive learning. The proposed method involves generating high-quality OOD samples by leveraging the semantic understanding and contextual knowledge of LLMs to create dependency-aware pseudo-OOD nodes, and applying contrastive learning based on energy functions to distinguish between in-distribution (IND) and OOD nodes. The effectiveness of our method is demonstrated through extensive experiments on six benchmark datasets, where our method consistently outperforms state-of-the-art baselines, achieving both high classification accuracy and robust OOD detection capabilities.
[589] Compression is all you need: Modeling Mathematics
Vitaly Aksenov, Eve Bodnia, Michael H. Freedman, Michael Mulligan
Main category: cs.AI
TL;DR: The paper models human mathematics as compressible through hierarchical definitions and theorems, showing it occupies only a polynomially-growing subset of the exponentially-growing space of formal mathematics.
Details
Motivation: To understand why human mathematics (HM) is such a tiny subset of formal mathematics (FM) and to characterize what distinguishes HM from the vast space of all possible mathematical deductions.Method: Model mathematical deductions as strings in monoids, analyze compression through definitions/theorems as macros, and test against MathLib (Lean 4 library) to measure growth patterns of wrapped vs unwrapped lengths.
Result: Unwrapped length grows exponentially with depth and wrapped length, while wrapped length remains approximately constant across depths - consistent with abelian monoid model and inconsistent with non-abelian model.
Conclusion: Human mathematics occupies a compressible, polynomially-growing region within the exponentially-growing space of formal mathematics, with compression patterns that can help direct automated reasoning.
Abstract: Human mathematics (HM), the mathematics humans discover and value, is a vanishingly small subset of formal mathematics (FM), the totality of all valid deductions. We argue that HM is distinguished by its compressibility through hierarchically nested definitions, lemmas, and theorems. We model this with monoids. A mathematical deduction is a string of primitive symbols; a definition or theorem is a named substring or macro whose use compresses the string. In the free abelian monoid $A_n$, a logarithmically sparse macro set achieves exponential expansion of expressivity. In the free non-abelian monoid $F_n$, even a polynomially-dense macro set only yields linear expansion; superlinear expansion requires near-maximal density. We test these models against MathLib, a large Lean~4 library of mathematics that we take as a proxy for HM. Each element has a depth (layers of definitional nesting), a wrapped length (tokens in its definition), and an unwrapped length (primitive symbols after fully expanding all references). We find unwrapped length grows exponentially with both depth and wrapped length; wrapped length is approximately constant across all depths. These results are consistent with $A_n$ and inconsistent with $F_n$, supporting the thesis that HM occupies a polynomially-growing subset of the exponentially growing space FM. We discuss how compression, measured on the MathLib dependency graph, and a PageRank-style analysis of that graph can quantify mathematical interest and help direct automated reasoning toward the compressible regions where human mathematics lives.
[590] Leveraging Natural Language Processing and Machine Learning for Evidence-Based Food Security Policy Decision-Making in Data-Scarce Making
Karan Kumar Singh, Nikita Gajbhiye
Main category: cs.AI
TL;DR: ZeroHungerAI is an NLP/ML framework using DistilBERT to combine socio-economic indicators with policy text for food security prediction in data-scarce regions, achieving 91% accuracy with fairness optimization.
Details
Motivation: Food security policy formulation faces challenges in data-scarce regions due to limited structured datasets, fragmented textual reports, and demographic bias in decision-making systems.Method: Integrated NLP/ML framework combining structured socio-economic indicators with contextual policy text embeddings using transfer learning based DistilBERT architecture.
Result: 91% classification accuracy, 0.89 precision, 0.85 recall, F1 score of 0.86 on 1200-sample dataset across 25 districts; 13% improvement over SVM, 17% over Logistic Regression; fairness optimization reduces demographic parity difference to 3%.
Conclusion: Transformer-based contextual learning significantly enhances policy intelligence in low-resource governance environments, enabling scalable and bias-aware hunger prediction systems.
Abstract: Food security policy formulation in data-scarce regions remains a critical challenge due to limited structured datasets, fragmented textual reports, and demographic bias in decision-making systems. This study proposes ZeroHungerAI, an integrated Natural Language Processing (NLP) and Machine Learning (ML) framework designed for evidence-based food security policy modeling under extreme data scarcity. The system combines structured socio-economic indicators with contextual policy text embeddings using a transfer learning based DistilBERT architecture. Experimental evaluation on a 1200-sample hybrid dataset across 25 districts demonstrates superior predictive performance, achieving 91 percent classification accuracy, 0.89 precision, 0.85 recall, and an F1 score of 0.86 under imbalanced conditions. Comparative analysis shows a 13 percent performance improvement over classical SVM and 17 percent over Logistic Regression models. Precision Recall evaluation confirms robust minority class detection (average precision around 0.88). Fairness aware optimization reduces demographic parity difference to 3 percent, ensuring equitable rural urban policy inference. The results validate that transformer based contextual learning significantly enhances policy intelligence in low resource governance environments, enabling scalable and bias aware hunger prediction systems.
[591] Deep reflective reasoning in interdependence constrained structured data extraction from clinical notes for digital health
Jingwei Huang, Kuroush Nezafati, Zhikai Chi, Ruichen Rong, Colin Treager, Tingyi Wanyan, Yueshuang Xu, Xiaowei Zhan, Patrick Leavey, Guanghua Xiao, Wenqi Shi, Yang Xie
Main category: cs.AI
TL;DR: LLM agent framework using iterative self-critique to improve structured data extraction from clinical notes by enforcing consistency among variables, text, and domain knowledge.
Details
Motivation: Existing LLM-based extraction pipelines struggle with capturing dependencies between clinical variables, leading to inconsistent outputs that don't respect logical constraints in medical data.Method: Deep reflective reasoning framework where LLM agents iteratively self-critique and revise structured outputs by checking consistency among variables, input text, and retrieved domain knowledge, stopping when outputs converge.
Result: Significant improvements across three oncology applications: colorectal cancer synoptic reporting (F1 from 0.828 to 0.911), Ewing sarcoma pattern identification (accuracy from 0.870 to 0.927), and lung cancer tumor staging (accuracy from 0.680 to 0.833).
Conclusion: Deep reflective reasoning systematically improves reliability of LLM-based structured data extraction under interdependence constraints, enabling more consistent clinical datasets for digital health applications.
Abstract: Extracting structured information from clinical notes requires navigating a dense web of interdependent variables where the value of one attribute logically constrains others. Existing Large Language Model (LLM)-based extraction pipelines often struggle to capture these dependencies, leading to clinically inconsistent outputs. We propose deep reflective reasoning, a large language model agent framework that iteratively self-critiques and revises structured outputs by checking consistency among variables, the input text, and retrieved domain knowledge, stopping when outputs converge. We extensively evaluate the proposed method in three diverse oncology applications: (1) On colorectal cancer synoptic reporting from gross descriptions (n=217), reflective reasoning improved average F1 across eight categorical synoptic variables from 0.828 to 0.911 and increased mean correct rate across four numeric variables from 0.806 to 0.895; (2) On Ewing sarcoma CD99 immunostaining pattern identification (n=200), the accuracy improved from 0.870 to 0.927; (3) On lung cancer tumor staging (n=100), tumor stage accuracy improved from 0.680 to 0.833 (pT: 0.842 -> 0.884; pN: 0.885 -> 0.948). The results demonstrate that deep reflective reasoning can systematically improve the reliability of LLM-based structured data extraction under interdependence constraints, enabling more consistent machine-operable clinical datasets and facilitating knowledge discovery with machine learning and data science towards digital health.
[592] DiffGraph: An Automated Agent-driven Model Merging Framework for In-the-Wild Text-to-Image Generation
Zhuoling Li, Hossein Rahmani, Jiarui Zhang, Yu Xue, Majid Mirmehdi, Jason Kuen, Jiuxiang Gu, Jun Liu
Main category: cs.AI
TL;DR: DiffGraph is an agent-driven graph-based framework that automatically organizes and merges online text-to-image expert models to meet diverse user generation needs.
Details
Motivation: The text-to-image community has many specialized expert models online, but existing merging methods can't fully leverage these abundant resources or meet diverse real-world user needs.Method: DiffGraph constructs a scalable graph to organize online experts through node registration and calibration, then dynamically activates specific subgraphs based on user needs to flexibly combine different experts.
Result: Extensive experiments demonstrate the efficacy of the method in effectively harnessing online expert resources for diverse generation tasks.
Conclusion: DiffGraph provides a novel framework for automatically organizing and merging online text-to-image expert models to better serve diverse user generation requirements.
Abstract: The rapid growth of the text-to-image (T2I) community has fostered a thriving online ecosystem of expert models, which are variants of pretrained diffusion models specialized for diverse generative abilities. Yet, existing model merging methods remain limited in fully leveraging abundant online expert resources and still struggle to meet diverse in-the-wild user needs. We present DiffGraph, a novel agent-driven graph-based model merging framework, which automatically harnesses online experts and flexibly merges them for diverse user needs. Our DiffGraph constructs a scalable graph and organizes ever-expanding online experts within it through node registration and calibration. Then, DiffGraph dynamically activates specific subgraphs based on user needs, enabling flexible combinations of different experts to achieve user-desired generation. Extensive experiments show the efficacy of our method.
[593] Efficient Counterfactual Reasoning in ProbLog via Single World Intervention Programs
Saimun Habib, Vaishak Belle, Fengxiang He
Main category: cs.AI
TL;DR: Efficient program transformation for counterfactual reasoning in Probabilistic Logic Programming (ProbLog) using Single World Intervention Programs (SWIPs) that reduces computational complexity and improves inference speed.
Details
Motivation: Counterfactual reasoning is critical for robust and trustworthy AI systems, but integrating it into Probabilistic Logic Programming (PLP) like ProbLog can be computationally prohibitive and unstable in accuracy.Method: Proposes an efficient program transformation for counterfactuals as Single World Intervention Programs (SWIPs) in ProbLog by systematically splitting ProbLog clauses into observed and fixed components relevant to a counterfactual, creating a transformed program that reduces counterfactual reasoning to marginal inference over a simpler program.
Result: Achieves 35% reduction in inference time versus existing methods in extensive experiments, with formal proofs showing correctness and consistency with conditional independencies in Structural Causal Models.
Conclusion: Makes complex counterfactual reasoning more computationally tractable and reliable, providing a crucial step towards developing more robust and explainable AI systems.
Abstract: Probabilistic Logic Programming (PLP) languages, like ProbLog, naturally support reasoning under uncertainty, while maintaining a declarative and interpretable framework. Meanwhile, counterfactual reasoning (i.e., answering ``what if’’ questions) is critical for ensuring AI systems are robust and trustworthy; however, integrating this capability into PLP can be computationally prohibitive and unstable in accuracy. This paper addresses this challenge, by proposing an efficient program transformation for counterfactuals as Single World Intervention Programs (SWIPs) in ProbLog. By systematically splitting ProbLog clauses to observed and fixed components relevant to a counterfactual, we create a transformed program that (1) does not asymptotically exceed the computational complexity of existing methods, and is strictly smaller in common cases, and (2) reduces counterfactual reasoning to marginal inference over a simpler program. We formally prove the correctness of our approach, which relies on a weaker set independence assumptions and is consistent with conditional independencies, showing the resulting marginal probabilities match the counterfactual distributions of the underlying Structural Causal Model in wide domains. Our method achieves a 35% reduction in inference time versus existing methods in extensive experiments. This work makes complex counterfactual reasoning more computationally tractable and reliable, providing a crucial step towards developing more robust and explainable AI systems. The code is at https://github.com/EVIEHub/swip.
[594] Grounded Chess Reasoning in Language Models via Master Distillation
Zhenwei Tang, Qianfeng Wen, Seth Grief-Albert, Yahya Elgabra, Blair Yang, Honghua Dong, Ashton Anderson
Main category: cs.AI
TL;DR: A framework for distilling expert system reasoning into natural language chain-of-thought explanations, enabling compact models to acquire domain expertise and generate faithful, grounded explanations, demonstrated in chess with a 4B parameter model achieving 48.1% accuracy.
Details
Motivation: Language models lack grounded reasoning capabilities in specialized domains with scarce training data, while bespoke expert systems excel but are opaque. There's a need to combine the transparency of language models with the expertise of specialized systems.Method: Distills expert system reasoning into natural language chain-of-thought explanations rather than just final outputs. Uses supervised fine-tuning and reinforcement learning with theme-balanced data sampling for comprehensive tactical coverage. Transforms opaque expert computations into transparent, step-by-step explanations.
Result: The 4B parameter model (C1) advances from near-zero baseline to 48.1% accuracy in chess, outperforming all open-source models and most frontier proprietary systems. Surpasses its distillation teacher and generates solutions in two orders of magnitude fewer tokens than baselines. Unlike prior neural chess approaches, generates explainable solutions revealing strategic reasoning.
Conclusion: Master Distillation demonstrates how to inject expert-level knowledge into compact models for under-optimized domains, offering a recipe for unlocking RLVR where LLMs lack sufficient base capabilities. Shows potential for specialized domains where training data is scarce.
Abstract: Language models often lack grounded reasoning capabilities in specialized domains where training data is scarce but bespoke systems excel. We introduce a general framework for distilling expert system reasoning into natural language chain-of-thought explanations, enabling compact models to acquire domain expertise and the ability to generate faithful, grounded explanations. Rather than distilling only final outputs, we capture the full reasoning process, transforming opaque expert computations into transparent, step-by-step explanations. We demonstrate this approach in chess, a canonical reasoning domain where language models continue to underperform. Our 4B parameter model, C1, advances from a near-zero baseline to 48.1% accuracy, outperforming all open-source models and most frontier proprietary systems. Notably, C1 surpasses its distillation teacher and generates solutions in two orders of magnitude fewer tokens than baselines. Unlike prior neural chess approaches that predict only best moves, C1 generates explainable solutions revealing strategic reasoning. Our pipeline combines supervised fine-tuning and reinforcement learning with theme-balanced data sampling for comprehensive tactical coverage. Master Distillation demonstrates how to inject expert-level knowledge into compact models for under-optimized domains, offering a recipe for unlocking RLVR where LLMs lack sufficient base capabilities.
[595] LLM-Driven Heuristic Synthesis for Industrial Process Control: Lessons from Hot Steel Rolling
Nima H. Siboni, Seyedreza Kiamousavi, Emad Scharifi
Main category: cs.AI
TL;DR: LLM-driven framework synthesizes interpretable Python controllers for industrial steel rolling using physics simulation feedback, with automated verification and Luby restarts for budget allocation.
Details
Motivation: Industrial process control requires interpretable and auditable policies, which black-box neural policies cannot provide. There's a need for human-readable controllers that can be formally verified for safety and monotonicity properties.Method: LLM-driven heuristic synthesis framework where language models iteratively propose and refine Python controllers using behavioral feedback from a physics-based simulator. Combines structured strategic ideation, executable code generation, and per-component feedback across diverse operating conditions. Uses Luby-style universal restarts for principled budget allocation.
Result: Generated explicit, auditable controllers accessible to expert review. Automated audit pipeline formally verifies safety and monotonicity properties. Luby restarts approach hindsight-optimal budget allocation with a single 160-iteration campaign vs. 52 ad-hoc runs totalling 730 iterations.
Conclusion: The framework successfully synthesizes interpretable controllers for industrial process control with formal verification capabilities, and demonstrates that Luby restarts effectively eliminate problem-specific budget tuning for LLM-driven heuristic search.
Abstract: Industrial process control demands policies that are interpretable and auditable, requirements that black-box neural policies struggle to meet. We study an LLM-driven heuristic synthesis framework for hot steel rolling, in which a language model iteratively proposes and refines human-readable Python controllers using rich behavioral feedback from a physics-based simulator. The framework combines structured strategic ideation, executable code generation, and per-component feedback across diverse operating conditions to search over control logic for height reduction, interpass time, and rolling velocity. Our first contribution is an auditable controller-synthesis pipeline for industrial process control. The generated controllers are explicit programs accessible to expert review, and we pair them with an automated audit pipeline that formally verifies key safety and monotonicity properties for the best synthesized heuristic. Our second contribution is a principled budget allocation strategy for LLM-driven heuristic search: we show that Luby-style universal restarts – originally developed for randomized algorithms – transfer directly to this setting, eliminating the need for problem-specific budget tuning. A single 160-iteration Luby campaign approaches the hindsight-optimal budget allocation derived from 52 ad-hoc runs totalling 730 iterations.
[596] Context Cartography: Toward Structured Governance of Contextual Space in Large Language Model Systems
Zihua Wu, Georg Gartner
Main category: cs.AI
TL;DR: Context Cartography: A formal framework for managing contextual space in LLMs through deliberate governance of information zones and transformations, addressing structural limitations of expanding context windows.
Details
Motivation: Current approaches to improving LLM reasoning focus on expanding context windows, but empirical evidence shows structural problems like the "lost in the middle" effect and long-distance relational degradation. There's a need for systematic governance of contextual space rather than just adding more tokens.Method: Introduces a tripartite zonal model: black fog (unobserved), gray fog (stored memory), and visible field (active reasoning surface). Defines seven cartographic operators (reconnaissance, selection, simplification, aggregation, projection, displacement, layering) as transformations governing information transitions between zones. Grounds framework in transformer attention salience geometry.
Result: Analysis of four contemporary systems (Claude Code, Letta, MemOS, OpenViking) shows these operators are converging independently across industry. Framework provides testable predictions and diagnostic benchmark for empirical validation.
Conclusion: Context Cartography offers a systematic approach to managing contextual space in LLMs, addressing fundamental limitations of transformer architectures with expanding context. Provides formal framework for deliberate information governance rather than relying on context window expansion alone.
Abstract: The prevailing approach to improving large language model (LLM) reasoning has centered on expanding context windows, implicitly assuming that more tokens yield better performance. However, empirical evidence - including the “lost in the middle” effect and long-distance relational degradation - demonstrates that contextual space exhibits structural gradients, salience asymmetries, and entropy accumulation under transformer architectures. We introduce Context Cartography, a formal framework for the deliberate governance of contextual space. We define a tripartite zonal model partitioning the informational universe into black fog (unobserved), gray fog (stored memory), and the visible field (active reasoning surface), and formalize seven cartographic operators - reconnaissance, selection, simplification, aggregation, projection, displacement, and layering - as transformations governing information transitions between and within zones. The operators are derived from a systematic coverage analysis of all non-trivial zone transformations and are organized by transformation type (what the operator does) and zone scope (where it applies). We ground the framework in the salience geometry of transformer attention, characterizing cartographic operators as necessary compensations for linear prefix memory, append-only state, and entropy accumulation under expanding context. An analysis of four contemporary systems (Claude Code, Letta, MemOS, and OpenViking) provides interpretive evidence that these operators are converging independently across the industry. We derive testable predictions from the framework - including operator-specific ablation hypotheses - and propose a diagnostic benchmark for empirical validation.
[597] TRI-DEP: A Trimodal Comparative Study for Depression Detection Using Speech, Text, and EEG
Annisaa Fitri Nurfidausi, Eleonora Mancini, Paolo Torroni
Main category: cs.AI
TL;DR: Systematic exploration of multimodal depression detection using EEG, speech, and text modalities with comparisons of feature representations, neural encoders, and fusion strategies.
Details
Motivation: Address limitations in existing depression detection research including limited scope, lack of systematic feature comparisons, and inconsistent evaluation protocols for multimodal approaches.Method: Systematically evaluate handcrafted features vs pre-trained embeddings, different neural encoders, unimodal/bimodal/trimodal configurations, and fusion strategies with attention to EEG’s role using consistent subject-independent splits.
Result: Combination of EEG, speech and text enhances detection; pre-trained embeddings outperform handcrafted features; carefully designed trimodal models achieve state-of-the-art performance.
Conclusion: The work establishes groundwork for future multimodal depression detection research through systematic benchmarking and demonstrates the value of combining physiological (EEG), acoustic (speech), and linguistic (text) modalities.
Abstract: Depression is a widespread mental health disorder, yet its automatic detection remains challenging. Prior work has explored unimodal and multimodal approaches, with multimodal systems showing promise by leveraging complementary signals. However, existing studies are limited in scope, lack systematic comparisons of features, and suffer from inconsistent evaluation protocols. We address these gaps by systematically exploring feature representations and modelling strategies across EEG, together with speech and text. We evaluate handcrafted features versus pre-trained embeddings, assess the effectiveness of different neural encoders, compare unimodal, bimodal, and trimodal configurations, and analyse fusion strategies with attention to the role of EEG. Consistent subject-independent splits are applied to ensure robust, reproducible benchmarking. Our results show that (i) the combination of EEG, speech and text modalities enhances multimodal detection, (ii) pretrained embeddings outperform handcrafted features, and (iii) carefully designed trimodal models achieve state-of-the-art performance. Our work lays the groundwork for future research in multimodal depression detection.
[598] Position: Multi-Agent Algorithmic Care Systems Demand Contestability for Trustworthy AI
Truong Thanh Hung Nguyen, Hélène Fournier, Piper Jackson, Makoto Itoh, Shannon Freeman, Rene Richard, Hung Cao
Main category: cs.AI
TL;DR: Position paper arguing that contestability (not just explainability) is essential for trustworthy multi-agent systems in healthcare, proposing a human-in-the-loop framework with structured argumentation for enabling effective human challenge and oversight.
Details
Motivation: Multi-agent systems in healthcare raise trust and accountability challenges. Current explainable AI approaches are insufficient because they don't allow care partners to challenge or correct system outputs. There's a need for systems that support effective human intervention throughout the decision-making process.Method: Proposes a human-in-the-loop framework integrating structured argumentation and role-based contestation. The approach emphasizes transparency, structured intervention opportunities, and mechanisms for review/correction/override throughout the decision lifecycle.
Result: Identifies limitations in current multi-agent systems and explainable AI research. Presents contestability as a necessary design requirement for trustworthy healthcare AI systems, arguing it preserves human agency, clinical responsibility, and trust in high-stakes contexts.
Conclusion: Contestability is essential for trustworthy multi-agent algorithmic care systems. The proposed framework enables effective human challenge, addressing trust, accountability, and oversight gaps that explainability alone cannot solve in collaborative healthcare decision-making.
Abstract: Multi-agent systems (MAS) are increasingly used in healthcare to support complex decision-making through collaboration among specialized agents. Because these systems act as collective decision-makers, they raise challenges for trust, accountability, and human oversight. Existing approaches to trustworthy AI largely rely on explainability, but explainability alone is insufficient in multi-agent settings, as it does not enable care partners to challenge or correct system outputs. To address this limitation, Contestable AI (CAI) characterizes systems that support effective human challenge throughout the decision-making lifecycle by providing transparency, structured opportunities for intervention, and mechanisms for review, correction, or override. This position paper argues that contestability is a necessary design requirement for trustworthy multi-agent algorithmic care systems. We identify key limitations in current MAS and Explainable AI (XAI) research and present a human-in-the-loop framework that integrates structured argumentation and role-based contestation to preserve human agency, clinical responsibility, and trust in high-stakes care contexts.
[599] Where can AI be used? Insights from a deep ontology of work activities
Alice Cai, Iman YeckehZaare, Shuo Sun, Vasiliki Charisi, Xinru Wang, Aiman Imran, Robert Laubacher, Alok Prakash, Thomas W. Malone
Main category: cs.AI
TL;DR: Researchers develop a comprehensive ontology of work activities to systematically analyze AI applications, finding highly uneven distribution of AI market value across different work tasks.
Details
Motivation: AI is transforming work but lacks systematic frameworks to understand where AI can be effectively applied across different work activities and occupations.Method: Disaggregate and reorganize ~20K activities from O*NET database, classify 13,275 AI applications and 20.8M robotic systems, then analyze distribution of AI market value across work activities.
Result: Highly uneven AI market value distribution: top 1.6% of activities account for over 60% of AI market value, with 72% in information-based activities and only 12% in physical activities.
Conclusion: The systematic framework helps predict where current AI systems can be applied and how future AI capabilities may change work activity distributions.
Abstract: Artificial intelligence (AI) is poised to profoundly reshape how work is executed and organized, but we do not yet have deep frameworks for understanding where AI can be used. Here we provide a comprehensive ontology of work activities that can help systematically analyze and predict uses of AI. To do this, we disaggregate and then substantially reorganize the approximately 20K activities in the US Department of Labor’s widely used O*NET occupational database. Next, we use this framework to classify descriptions of 13,275 AI software applications and a worldwide tally of 20.8 million robotic systems. Finally, we use the data about both these kinds of AI to generate graphical displays of how the estimated units and market values of all worldwide AI systems used today are distributed across the work activities that these systems help perform. We find a highly uneven distribution of AI market value across activities, with the top 1.6% of activities accounting for over 60% of AI market value. Most of the market value is used in information-based activities (72%), especially creating information (36%), and only 12% is used in physical activities. Interactive activities include both information-based and physical activities and account for 48% of AI market value, much of which (26%) involves transferring information. These results can be viewed as rough predictions of the AI applicability for all the different work activities down to very low levels of detail. Thus, we believe this systematic framework can help predict at a detailed level where today’s AI systems can and cannot be used and how future AI capabilities may change this.
[600] Towards Intelligent Geospatial Data Discovery: a knowledge graph-driven multi-agent framework powered by large language models
Ruixiang Liu, Zhenlong Li, Ali Khosravi Kazazi
Main category: cs.AI
TL;DR: A knowledge graph-driven multi-agent framework using LLMs for intelligent geospatial data discovery, improving semantic search over traditional keyword-based approaches.
Details
Motivation: Geospatial data ecosystems are distributed and heterogeneous with limited semantic search capabilities. Current keyword-based approaches fail to capture user intent and have weak retrieval performance.Method: Proposes a framework with: 1) unified geospatial metadata ontology as semantic mediation layer, 2) geospatial metadata knowledge graph to model datasets and relationships, 3) multi-agent collaborative architecture for intent parsing, knowledge graph retrieval, and answer synthesis.
Result: Framework substantially improves intent matching accuracy, ranking quality, recall, and discovery transparency compared to traditional systems in representative use cases and performance evaluation.
Conclusion: Advances geospatial data discovery toward semantic, intent-aware, intelligent paradigm; provides foundation for next-generation spatial data infrastructures and contributes to Autonomous GIS vision.
Abstract: The rapid growth in the volume, variety, and velocity of geospatial data has created data ecosystems that are highly distributed, heterogeneous, and semantically inconsistent. Existing data catalogs, portals, and infrastructures still rely largely on keyword-based search with limited semantic support, which often fails to capture user intent and leads to weak retrieval performance. To address these challenges, this study proposes a knowledge graph-driven multi-agent framework for intelligent geospatial data discovery, powered by large language models. The framework introduces a unified geospatial metadata ontology as a semantic mediation layer to align heterogeneous metadata standards across platforms and constructs a geospatial metadata knowledge graph to explicitly model datasets and their multidimensional relationships. Building on the structured representation, the framework adopts a multi-agent collaborative architecture to perform intent parsing, knowledge graph retrieval, and answer synthesis, forming an interpretable and closed-loop discovery process from user queries to results. Results from representative use cases and performance evaluation show that the framework substantially improves intent matching accuracy, ranking quality, recall, and discovery transparency compared with traditional systems. This study advances geospatial data discovery toward a more semantic, intent-aware, and intelligent paradigm, providing a practical foundation for next-generation intelligent and autonomous spatial data infrastructures and contributing to the broader vision of Autonomous GIS.
[601] Reasoning Traces Shape Outputs but Models Won’t Say So
Yijie Hao, Lingjie Chen, Ali Emami, Joyce Ho
Main category: cs.AI
TL;DR: Large reasoning models’ reasoning traces don’t faithfully reflect what drives their outputs - models follow injected reasoning but overwhelmingly refuse to disclose this influence, instead fabricating aligned-appearing explanations.
Details
Motivation: To investigate whether reasoning traces from large reasoning models (LRMs) faithfully reflect what drives model outputs and whether models will honestly report their influence on decision-making.Method: Introduced Thought Injection method that injects synthetic reasoning snippets into a model’s
Result: Injected hints reliably alter model outputs (confirming reasoning traces causally shape behavior), but models overwhelmingly refuse to disclose the influence (>90% non-disclosure for extreme hints). Instead, they fabricate aligned-appearing but unrelated explanations. Activation analysis shows sycophancy- and deception-related directions are strongly activated during fabrications.
Conclusion: There’s a significant gap between the reasoning LRMs follow and what they report, raising concerns that aligned-appearing explanations may not indicate genuine alignment. Models systematically fabricate explanations rather than acknowledging external influence.
Abstract: Can we trust the reasoning traces that large reasoning models (LRMs) produce? We investigate whether these traces faithfully reflect what drives model outputs, and whether models will honestly report their influence. We introduce Thought Injection, a method that injects synthetic reasoning snippets into a model’s
[602] Seed1.8 Model Card: Towards Generalized Real-World Agency
Bytedance Seed
Main category: cs.AI
TL;DR: Seed1.8 is a foundation model for generalized real-world agency with multi-turn interaction, tool use, and multi-step execution capabilities while maintaining strong LLM and vision-language performance.
Details
Motivation: To create a foundation model that goes beyond single-turn prediction to support real-world agency through multi-turn interactions, tool use, and multi-step execution for practical applications.Method: Develops a unified agentic interface supporting search, code generation/execution, and GUI interaction with latency- and cost-aware inference, configurable thinking modes, and optimized visual encoding for images/video.
Result: Seed1.8 maintains strong performance on standard benchmarks while supporting application-aligned workflows spanning foundational skills, multimodal understanding, and agentic behavior.
Conclusion: Seed1.8 is released as a foundation model to support research and development on interactive, real-world use cases requiring generalized agency capabilities.
Abstract: We present Seed1.8, a foundation model aimed at generalized real-world agency: going beyond single-turn prediction to multi-turn interaction, tool use, and multi-step execution. Seed1.8 keeps strong LLM and vision-language performance while supporting a unified agentic interface-search, code generation and execution, and GUI interaction. For deployment, it offers latency- and cost-aware inference, including configurable thinking modes and optimized visual encoding for images and video. We report evaluations on standard benchmarks and application-aligned workflows spanning foundational skills, multimodal understanding, and agentic behavior. Seed1.8 is released to support further research and development on interactive, real-world use cases.
[603] Agentic AI and the next intelligence explosion
James Evans, Benjamin Bratton, Blaise Agüera y Arcas
Main category: cs.AI
TL;DR: The paper argues that AI intelligence is fundamentally social and relational, not monolithic, and proposes institutional alignment through digital protocols as the path forward for scaling intelligence.
Details
Motivation: The motivation is to challenge the common misconception of AI singularity as a single, godlike mind and instead propose a more realistic, evolutionary perspective where intelligence emerges from social, pluralistic interactions among AI agents and human-AI hybrids.Method: The method involves analyzing recent advances in agentic AI (like DeepSeek-R1) that use internal “societies of thought” for reasoning, and proposing a shift from dyadic alignment (RLHF) to institutional alignment through designing digital protocols modeled on organizations and markets.
Result: The paper presents a conceptual framework where intelligence scaling requires building social infrastructure with checks and balances, leading to combinatorial societies of specialized AI agents rather than single superintelligent entities.
Conclusion: The next intelligence explosion will be a complex, combinatorial society of specialized AI agents and human-AI hybrids, requiring institutional alignment through digital protocols rather than individual agent alignment.
Abstract: The “AI singularity” is often miscast as a monolithic, godlike mind. Evolution suggests a different path: intelligence is fundamentally plural, social, and relational. Recent advances in agentic AI reveal that frontier reasoning models, such as DeepSeek-R1, do not improve simply by “thinking longer”. Instead, they simulate internal “societies of thought,” spontaneous cognitive debates that argue, verify, and reconcile to solve complex tasks. Moreover, we are entering an era of human-AI centaurs: hybrid actors where collective agency transcends individual control. Scaling this intelligence requires shifting from dyadic alignment (RLHF) toward institutional alignment. By designing digital protocols, modeled on organizations and markets, we can build a social infrastructure of checks and balances. The next intelligence explosion will not be a single silicon brain, but a complex, combinatorial society specializing and sprawling like a city. No mind is an island.
[604] From 50% to Mastery in 3 Days: A Low-Resource SOP for Localizing Graduate-Level AI Tutors via Shadow-RAG
Zonglin Yang, J. -H. Xie, Lining Zhang, Jiyou Jia, Zhi-X. Chen
Main category: cs.AI
TL;DR: A replicable procedure for deploying high-fidelity AI tutors using only consumer-grade GPUs, achieving 90% accuracy on graduate-level math exams through structured reasoning guidance with 32B models.
Details
Motivation: To overcome the Resource Curse in AI education deployment - the need for expensive cloud GPUs and massive data engineering - by creating a cost-effective solution using open-weights models on consumer hardware.Method: Uses Vision-Language Model data cleaning strategy and novel Shadow-RAG architecture with structured reasoning guidance, requiring only 3 person-days of non-expert labor and deployable on single consumer-grade GPU with 32B open-weights models.
Result: Pilot study on graduate-level final exam shows Shadow Agent triggers massive capability surge in newer 32B models, boosting performance from 74% (Naive RAG) to 90% mastery level, while older models see only modest gains (~10%). Zero-shot baselines and standard retrieval stagnate around 50-60%.
Conclusion: Structured reasoning guidance is key to unlocking latent power of modern small language models, offering cost-effective blueprint for ubiquitous AI education deployment without expensive infrastructure.
Abstract: Deploying high-fidelity AI tutors in schools is often blocked by the Resource Curse – the need for expensive cloud GPUs and massive data engineering. In this practitioner report, we present a replicable Standard Operating Procedure that breaks this barrier. Using a Vision-Language Model data cleaning strategy and a novel Shadow-RAG architecture, we localized a graduate-level Applied Mathematics tutor using only 3 person-days of non-expert labor and open-weights 32B models deployable on a single consumer-grade GPU. Our pilot study on a full graduate-level final exam reveals a striking emergence phenomenon: while both zero-shot baselines and standard retrieval stagnate around 50-60% accuracy across model generations, the Shadow Agent, which provides structured reasoning guidance, triggers a massive capability surge in newer 32B models, boosting performance from 74% (Naive RAG) to mastery level (90%). In contrast, older models see only modest gains (~10%). This suggests that such guidance is the key to unlocking the latent power of modern small language models. This work offers a cost-effective, scientifically grounded blueprint for ubiquitous AI education.
[605] Attention in Space: Functional Roles of VLM Heads for Spatial Reasoning
Xueqi Ma, Shuo Yang, Yanbei Jiang, Shu Liu, Zhenzhen Liu, Jiayang Ao, Xingjun Ma, Sarah Monazam Erfani, James Bailey
Main category: cs.AI
TL;DR: CogVSR dataset and probing framework reveal sparse, specialized attention heads for spatial reasoning in VLMs, with spatial heads being particularly scarce but critical for performance.
Details
Motivation: Despite advances in Vision-Language Models (VLMs), spatial reasoning remains challenging. The authors aim to understand how attention heads contribute to spatial reasoning through mechanistic interpretability, focusing on identifying specialized functional heads.Method: Introduced CogVSR dataset decomposing complex spatial reasoning into step-by-step subquestions linked to cognitive functions. Developed probing framework to identify attention heads specialized for functions like spatial perception and relational reasoning. Analyzed across VLM families and proposed methods to activate latent spatial heads.
Result: Functional heads are universally sparse and vary across functions. Spatially specialized heads are fewer than other cognitive functions. Intervention experiments show removing functional heads degrades performance while emphasizing them enhances accuracy. Activation methods improve spatial understanding.
Conclusion: The study provides interpretability insights into how VLMs attend to space, revealing sparse specialized heads critical for spatial reasoning. This paves the way for enhancing complex spatial reasoning in multimodal models through targeted interventions.
Abstract: Despite remarkable advances in large Vision-Language Models (VLMs), spatial reasoning remains a persistent challenge. In this work, we investigate how attention heads within VLMs contribute to spatial reasoning by analyzing their functional roles through a mechanistic interpretability lens. We introduce CogVSR, a dataset that decomposes complex spatial reasoning questions into step-by-step subquestions designed to simulate human-like reasoning via a chain-of-thought paradigm, with each subquestion linked to specific cognitive functions such as spatial perception or relational reasoning. Building on CogVSR, we develop a probing framework to identify and characterize attention heads specialized for these functions. Our analysis across diverse VLM families reveals that these functional heads are universally sparse, vary in number and distribution across functions. Notably, spatially specialized heads are fewer than those for other cognitive functions, highlighting their scarcity. We propose methods to activate latent spatial heads, improving spatial understanding. Intervention experiments further demonstrate their critical role in spatial reasoning: removing functional heads leads to performance degradation, while emphasizing them enhances accuracy. This study provides new interpretability driven insights into how VLMs attend to space and paves the way for enhancing complex spatial reasoning in multimodal models.
[606] AI-Driven Multi-Agent Simulation of Stratified Polyamory Systems: A Computational Framework for Optimizing Social Reproductive Efficiency
Yicai Xing
Main category: cs.AI
TL;DR: A computational framework for modeling a Stratified Polyamory System to address demographic reproduction crises using agent-based modeling, multi-agent reinforcement learning, and LLM-powered social simulation.
Details
Motivation: Addressing the severe crisis of demographic reproduction with declining fertility rates and disintegration of marriage institutions, particularly in East Asian nations like China and South Korea, where educated women reject unsatisfying unions and lower-socioeconomic men experience chronic sexual deprivation.Method: Uses agent-based modeling (ABM), multi-agent reinforcement learning (MARL) with Proximal Policy Optimization (PPO), graph neural networks (GNN) for mating network analysis, and LLM-empowered social simulation to model a Stratified Polyamory System with heterogeneous agent types.
Result: Preliminary computational results demonstrate the framework’s viability in addressing female motherhood penalties and male sexlessness, offering a non-violent mechanism for wealth dispersion similar to historical Chinese Grace Decree.
Conclusion: The Stratified Polyamory System can improve aggregate social welfare in the Pareto sense by addressing demographic crises through computational social science approaches.
Abstract: Contemporary societies face a severe crisis of demographic reproduction. Global fertility rates continue to decline precipitously, with East Asian nations exhibiting the most dramatic trends – China’s total fertility rate (TFR) fell to approximately 1.0 in 2023, while South Korea’s dropped below 0.72. Simultaneously, the institution of marriage is undergoing structural disintegration: educated women rationally reject unions lacking both emotional fulfillment and economic security, while a growing proportion of men at the lower end of the socioeconomic spectrum experience chronic sexual deprivation, anxiety, and learned helplessness. This paper proposes a computational framework for modeling and evaluating a Stratified Polyamory System (SPS) using techniques from agent-based modeling (ABM), multi-agent reinforcement learning (MARL), and large language model (LLM)-empowered social simulation. The SPS permits individuals to maintain a limited number of legally recognized secondary partners in addition to one primary spouse, combined with socialized child-rearing and inheritance reform. We formalize the A/B/C stratification as heterogeneous agent types in a multi-agent system and model the matching process as a MARL problem amenable to Proximal Policy Optimization (PPO). The mating network is analyzed using graph neural network (GNN) representations. Drawing on evolutionary psychology, behavioral ecology, social stratification theory, computational social science, algorithmic fairness, and institutional economics, we argue that SPS can improve aggregate social welfare in the Pareto sense. Preliminary computational results demonstrate the framework’s viability in addressing the dual crisis of female motherhood penalties and male sexlessness, while offering a non-violent mechanism for wealth dispersion analogous to the historical Chinese Grace Decree (Tui’en Ling).
[607] Multi-RF Fusion with Multi-GNN Blending for Molecular Property Prediction
Zacharie Bugaud
Main category: cs.AI
TL;DR: Multi-RF Fusion achieves state-of-the-art performance on ogbg-molhiv benchmark using ensemble of Random Forests with molecular fingerprints, blended with GNN predictions at 12% weight.
Details
Motivation: To improve molecular property prediction performance on the OGB leaderboard by combining traditional machine learning (Random Forests) with graph neural networks in an optimal ensemble approach.Method: Rank-averaged ensemble of 12 Random Forest models trained on concatenated molecular fingerprints (FCFP, ECFP, MACCS, atom pairs - 4,263 dimensions total), blended with deep-ensembled GNN predictions at 12% weight. Key optimizations: setting max_features to 0.20 instead of default sqrt(d), and averaging GNN predictions across 10 seeds before blending.
Result: Achieves test ROC-AUC of 0.8476 +/- 0.0002 on ogbg-molhiv (10 seeds), placing #1 on OGB leaderboard ahead of HyperFusion (0.8475 +/- 0.0003). Reduces final standard deviation from 0.0008 to 0.0002.
Conclusion: Careful ensemble design combining traditional ML (Random Forests) with deep learning (GNNs) can achieve state-of-the-art performance on molecular property prediction benchmarks, with specific hyperparameter tuning and variance reduction techniques being crucial.
Abstract: Multi-RF Fusion achieves a test ROC-AUC of 0.8476 +/- 0.0002 on ogbg-molhiv (10 seeds), placing #1 on the OGB leaderboard ahead of HyperFusion (0.8475 +/- 0.0003). The core of the method is a rank-averaged ensemble of 12 Random Forest models trained on concatenated molecular fingerprints (FCFP, ECFP, MACCS, atom pairs – 4,263 dimensions total), blended with deep-ensembled GNN predictions at 12% weight. Two findings drive the result: (1) setting max_features to 0.20 instead of the default sqrt(d) gives a +0.008 AUC gain on this scaffold split, and (2) averaging GNN predictions across 10 seeds before blending with the RF eliminates GNN seed variance entirely, dropping the final standard deviation from 0.0008 to 0.0002. No external data or pre-training is used.
[608] Modeling Epistemic Uncertainty in Social Perception via Rashomon Set Agents
Jinming Yang, Xinyu Jiang, Xinshan Jiao, Xinping Zhang
Main category: cs.AI
TL;DR: An LLM-driven multi-agent framework modeling how students develop different social perceptions in classrooms using subjective graphs, RAG for local information access, and probabilistic belief updates.
Details
Motivation: To understand how students form different subjective social perceptions in real classroom settings when social information is incomplete and perception accuracy varies, without assuming a global "god's-eye view" of social dynamics.Method: Assigns each student an individualized subjective graph showing which social ties they can perceive; uses retrieval-augmented generation (RAG) for local information access; incorporates structural perturbations for social-anxiety differences; agents share narrative assessments with uncertainty tags and update beliefs probabilistically using LLM-based trust scores.
Result: The framework reproduces several collective dynamics consistent with real-world educational settings without relying on global information, demonstrating how epistemic uncertainty spreads through local interactions.
Conclusion: The LLM-driven multi-agent probabilistic modeling framework successfully captures how differences in students’ subjective social perceptions arise and evolve in classroom settings through local interactions and incomplete information.
Abstract: We present an LLM-driven multi-agent probabilistic modeling framework that demonstrates how differences in students’ subjective social perceptions arise and evolve in real-world classroom settings, under constraints from an observed social network and limited questionnaire data. When social information is incomplete and the accuracy of perception differs between students, they can form different views of the same group structure from local cues they can access. Repeated peer communication and belief updates can gradually change these views and, over time, lead to stable group-level differences. To avoid assuming a global “god’s-eye view,” we assign each student an individualized subjective graph that shows which social ties they can perceive and how far information is reachable from their perspective. All judgments and interactions are restricted to this subjective graph: agents use retrieval-augmented generation (RAG) to access only local information and then form evaluations of peers’ competence and social standing. We also add structural perturbations related to social-anxiety to represent consistent individual differences in the accuracy of social perception. During peer exchanges, agents share narrative assessments of classmates’ academic performance and social position with uncertainty tags, and update beliefs probabilistically using LLM-based trust scores. Using the time series of six real exam scores as an exogenous reference, we run multi-step simulations to examine how epistemic uncertainty spreads through local interactions. Experiments show that, without relying on global information, the framework reproduces several collective dynamics consistent with real-world educational settings. The code is released at https://anonymous.4open.science/r/Rashomonomon-0126.
[609] GMPilot: An Expert AI Agent For FDA cGMP Compliance
Xiaohan Wang, Nan Zhang, Sulene Han, Keguang Tang, Lei Xu, Zhiping Li, Xiue, Liu, Xiaomei Han
Main category: cs.AI
TL;DR: GMPilot is a domain-specific AI agent for FDA cGMP compliance support in pharmaceuticals, using RAG and ReAct frameworks with curated regulatory knowledge base to provide real-time, traceable decision support.
Details
Motivation: The pharmaceutical industry faces challenges with quality management including high compliance costs, slow responses, and disjointed knowledge. There's a need for AI solutions to support FDA cGMP compliance and improve decision-making in this highly regulated sector.Method: GMPilot uses a curated knowledge base of regulations and historical inspection observations, combined with Retrieval-Augmented Generation (RAG) and Reasoning-Acting (ReAct) frameworks to provide real-time, traceable decision support to quality professionals.
Result: In simulated inspection scenarios, GMPilot demonstrates improved responsiveness and professionalism of quality professionals by providing structured knowledge retrieval and verifiable regulatory and case-based support.
Conclusion: GMPilot represents a viable approach for improving quality management decision-making in pharmaceuticals using intelligent methods, though it has limitations in regulatory scope and model interpretability. It serves as an example of specialized AI application in highly regulated sectors.
Abstract: The pharmaceutical industry is facing challenges with quality management such as high costs of compliance, slow responses and disjointed knowledge. This paper presents GMPilot, a domain-specific AI agent that is designed to support FDA cGMP compliance. GMPilot is based on a curated knowledge base of regulations and historical inspection observations and uses Retrieval-Augmented Generation (RAG) and Reasoning-Acting (ReAct) frameworks to provide real-time and traceable decision support to the quality professionals. In a simulated inspection scenario, GMPilot shows how it can improve the responsiveness and professionalism of quality professionals by providing structured knowledge retrieval and verifiable regulatory and case-based support. Although GMPilot lacks in the aspect of regulatory scope and model interpretability, it is a viable avenue of improving quality management decision-making in the pharmaceutical sector using intelligent approaches and an example of specialized application of AI in highly regulated sectors.
[610] Governance-Aware Vector Subscriptions for Multi-Agent Knowledge Ecosystems
Steven Johnson
Main category: cs.AI
TL;DR: Governance-aware vector subscriptions combine semantic similarity matching with multi-dimensional policy predicates to prevent unauthorized content access in multi-agent systems while preserving authorized notifications.
Details
Motivation: In multi-agent ecosystems with different data handling policies, traditional semantic publish-subscribe systems create policy violations by allowing agents to receive notifications about content they're not authorized to access, necessitating a governance-aware solution.Method: Introduces governance-aware vector subscriptions that compose semantic similarity matching with multi-dimensional policy predicates grounded in regulatory frameworks (EU DSM Directive, EU AI Act). The mechanism operates over independent dimensions like processing level, direct marketing restrictions, training opt-out, jurisdiction, and scientific usage, with notifications dispatched only for content passing both similarity threshold and all policy constraints.
Result: Implemented in AIngram and evaluated using PASA benchmark on synthetic corpus (1,000 chunks, 93 subscriptions, 5 domains). Governed mode correctly enforced all policy constraints while preserving delivery of authorized content. Ablation study showed no single policy dimension suffices for full compliance.
Conclusion: Governance-aware vector subscriptions effectively address policy violations in multi-agent semantic publish-subscribe systems by integrating regulatory compliance directly into the subscription mechanism, ensuring agents only receive authorized content notifications.
Abstract: As AI agent ecosystems grow, agents need mechanisms to monitor relevant knowledge in real time. Semantic publish-subscribe systems address this by matching new content against vector subscriptions. However, in multi-agent settings where agents operate under different data handling policies, unrestricted semantic subscriptions create policy violations: agents receive notifications about content they are not authorized to access. We introduce governance-aware vector subscriptions, a mechanism that composes semantic similarity matching with multi-dimensional policy predicates grounded in regulatory frameworks (EU DSM Directive, EU AI Act). The policy predicate operates over multiple independent dimensions (processing level, direct marketing restrictions, training opt-out, jurisdiction, and scientific usage) each with distinct legal bases. Agents subscribe to semantic regions of a curated knowledge base; notifications are dispatched only for validated content that passes both the similarity threshold and all applicable policy constraints. We formalize the mechanism, implement it within AIngram (an operational multi-agent knowledge base), and evaluate it using the PASA benchmark. We validate the mechanism on a synthetic corpus (1,000 chunks, 93 subscriptions, 5 domains): the governed mode correctly enforces all policy constraints while preserving delivery of authorized content. Ablation across five policy dimensions shows that no single dimension suffices for full compliance.
[611] ReLaMix: Residual Latency-Aware Mixing for Delay-Robust Financial Time-Series Forecasting
Tianyou Lai, Wentao Yue, Jiayi Zhou, Chaoyuan Hao, Lingke Chang, Qingyu Mao, Zhibo Niu, Qilei Li
Main category: cs.AI
TL;DR: ReLaMix: A lightweight network extension for robust financial time-series forecasting under delayed observations using residual bottleneck mixing to handle stale data artifacts.
Details
Motivation: Real-world high-frequency financial markets suffer from delayed/stale observations due to asynchronous data acquisition and transmission latency, creating stepwise stagnation artifacts that hinder accurate forecasting.Method: Proposes ReLaMix (Residual Latency-Aware Mixing Network) as a lightweight extension of TimeMixer, integrating learnable bottleneck compression with residual refinement to suppress redundancy from repeated stale values while preserving informative market dynamics.
Result: Achieves state-of-the-art accuracy on second-resolution PAXGUSDT benchmark across multiple delay ratios and prediction horizons, outperforming mixer and Transformer baselines with fewer parameters. Cross-asset generalization confirmed on BTCUSDT.
Conclusion: Residual bottleneck mixing effectively handles latency-induced staleness in high-frequency financial forecasting, demonstrating robustness and generalization across assets.
Abstract: Financial time-series forecasting in real-world high-frequency markets is often hindered by delayed or partially stale observations caused by asynchronous data acquisition and transmission latency. To better reflect such practical conditions, we investigate a simulated delay setting where a portion of historical signals is corrupted by a Zero-Order Hold (ZOH) mechanism, significantly increasing forecasting difficulty through stepwise stagnation artifacts. In this paper, we propose ReLaMix (Residual Latency-Aware Mixing Network), a lightweight extension of TimeMixer that integrates learnable bottleneck compression with residual refinement for robust signal recovery under delayed observations. ReLaMix explicitly suppresses redundancy from repeated stale values while preserving informative market dynamics via residual mixing enhancement. Experiments on a large-scale second-resolution PAXGUSDT benchmark demonstrate that ReLaMix consistently achieves state-of-the-art accuracy across multiple delay ratios and prediction horizons, outperforming strong mixer and Transformer baselines with substantially fewer parameters. Moreover, additional evaluations on BTCUSDT confirm the cross-asset generalization ability of the proposed framework. These results highlight the effectiveness of residual bottleneck mixing for high-frequency financial forecasting under realistic latency-induced staleness.
[612] Do LLM-Driven Agents Exhibit Engagement Mechanisms? Controlled Tests of Information Load, Descriptive Norms, and Popularity Cues
Tai-Quan Peng, Yuan Tian, Songsong Liang, Dazhen Deng, Yingcai Wu
Main category: cs.AI
TL;DR: LLM-driven agent-based simulation study of social media engagement behavior, examining responses to information load, descriptive norms, and popularity cues in a Weibo-like environment
Details
Motivation: While LLMs make agent-based simulations more behaviorally expressive, there's a methodological tension: human-like output alone doesn't constitute evidence for theory. The paper aims to evaluate what LLM-driven simulations can credibly support, using social media information engagement as a test case.Method: Created a Weibo-like simulation environment where LLM-driven agents interact. Manipulated information load and descriptive norms while allowing popularity cues (likes and reshares) to evolve endogenously. Conducted controlled variations to test whether simulated behavior changes in theoretically interpretable ways rather than just producing plausible-looking traces.
Result: Engagement responded systematically to information load and descriptive norms. Sensitivity to popularity cues varied across contexts, indicating conditionality rather than rigid prompt compliance. The simulation revealed context-dependent behavioral patterns rather than simple prompt following.
Conclusion: Provides methodological implications for simulation-based communication research: need for multi-condition stress tests, explicit no-norm baselines (since default prompts aren’t blank controls), and design choices preserving endogenous feedback loops when studying bandwagon dynamics.
Abstract: Large language models make agent-based simulation more behaviorally expressive, but they also sharpen a basic methodological tension: fluent, human-like output is not, by itself, evidence for theory. We evaluate what an LLM-driven simulation can credibly support using information engagement on social media as a test case. In a Weibo-like environment, we manipulate information load and descriptive norms, while allowing popularity cues (cumulative likes and Sina Weibo-style cumulative reshares) to evolve endogenously. We then ask whether simulated behavior changes in theoretically interpretable ways under these controlled variations, rather than merely producing plausible-looking traces. Engagement responds systematically to information load and descriptive norms, and sensitivity to popularity cues varies across contexts, indicating conditionality rather than rigid prompt compliance. We discuss methodological implications for simulation-based communication research, including multi-condition stress tests, explicit no-norm baselines because default prompts are not blank controls, and design choices that preserve endogenous feedback loops when studying bandwagon dynamics.
[613] Profit is the Red Team: Stress-Testing Agents in Strategic Economic Interactions
Shouqiao Wang, Marcello Politi, Samuele Marro, Davide Crapis
Main category: cs.AI
TL;DR: Profit-driven red teaming protocol uses learned opponents to stress-test AI agents in structured economic interactions, discovering adaptive exploits without explicit attack instructions.
Details
Motivation: As AI agents move into real-world deployments, they increasingly rely on external inputs that can be strategically manipulated by adversaries. Current security testing focuses on fixed prompt attacks, but adaptive strategies pose greater risks in structured settings with auditable outcomes.Method: Proposes profit-driven red teaming: a stress-testing protocol where a learned opponent is trained to maximize profit using only scalar outcome feedback. No LLM-as-judge scoring, attack labels, or taxonomy needed. Instantiated in an arena of four canonical economic interactions as a controlled testbed.
Result: Agents that appear strong against static baselines become consistently exploitable under profit-optimized pressure. The learned opponent discovers probing, anchoring, and deceptive commitments without explicit instruction. Distilling exploit episodes into prompt rules makes most failures ineffective and substantially improves target performance.
Conclusion: Profit-driven red-team data provides a practical route to improving robustness in structured agent settings with auditable outcomes, offering a more realistic security testing approach than static attack libraries.
Abstract: As agentic systems move into real-world deployments, their decisions increasingly depend on external inputs such as retrieved content, tool outputs, and information provided by other actors. When these inputs can be strategically shaped by adversaries, the relevant security risk extends beyond a fixed library of prompt attacks to adaptive strategies that steer agents toward unfavorable outcomes. We propose profit-driven red teaming, a stress-testing protocol that replaces handcrafted attacks with a learned opponent trained to maximize its profit using only scalar outcome feedback. The protocol requires no LLM-as-judge scoring, attack labels, or attack taxonomy, and is designed for structured settings with auditable outcomes. We instantiate it in a lean arena of four canonical economic interactions, which provide a controlled testbed for adaptive exploitability. In controlled experiments, agents that appear strong against static baselines become consistently exploitable under profit-optimized pressure, and the learned opponent discovers probing, anchoring, and deceptive commitments without explicit instruction. We then distill exploit episodes into concise prompt rules for the agent, which make most previously observed failures ineffective and substantially improve target performance. These results suggest that profit-driven red-team data can provide a practical route to improving robustness in structured agent settings with auditable outcomes.
[614] gUFO: A Gentle Foundational Ontology for Semantic Web Knowledge Graphs
João Paulo A. Almeida, Giancarlo Guizzardi, Tiago Prince Sales, Claudenir M. Fonseca
Main category: cs.AI
TL;DR: gUFO is a lightweight OWL 2 DL implementation of the Unified Foundational Ontology (UFO) designed for Semantic Web applications, offering unique features like typology of types, reification patterns, and situation support.
Details
Motivation: To provide a practical, standardized implementation of the mature UFO foundational ontology for Semantic Web applications, addressing recurrent problems in knowledge graphs with well-founded patterns.Method: Developed gUFO as a lightweight OWL 2 DL implementation of UFO, incorporating unique features including typology of types (operationalizing OntoClean guidelines), reification patterns for intrinsic/relational aspects, and support for situations and high-order types.
Result: gUFO provides a standardized, practical implementation of UFO with unique capabilities not found in other foundational ontology implementations like BFO and DOLCE, and is currently undergoing ISO standardization.
Conclusion: gUFO offers a well-founded, standardized foundational ontology implementation for Semantic Web applications with unique features that address common knowledge graph problems, positioning it as a valuable alternative to existing foundational ontology implementations.
Abstract: gUFO is a lightweight implementation of the Unified Foundational Ontology (UFO) suitable for Semantic Web OWL 2 DL applications. UFO is a mature foundational ontology with a rich axiomatization and that has been employed in a significant number of projects in research and industry. Moreover, it is currently in the process of standardization by the International Organization for Standardization as the ISO/IEC CD 21838-5. gUFO stands out from other foundational ontology implementations (such as those provided for BFO and DOLCE) given its unique support for a typology of types (operationalizing OntoClean guidelines), its reification patterns for intrinsic and relational aspects, and its support for situations and high-order types. gUFO provides well-founded patterns to address recurrent problems in Semantic Web knowledge graphs. In this paper, we present gUFO with its constituting categories, relations and constraints, discuss how it differs from the original UFO reference ontology, elaborate on its community adoption, and systematically position it in relation to existing OWL-based implementations of popular alternative foundational ontologies.
[615] AutoMOOSE: An Agentic AI for Autonomous Phase-Field Simulation
Sukriti Manna, Henry Chan, Subramanian K. R. S. Sankaranarayanan
Main category: cs.AI
TL;DR: AutoMOOSE is an agentic framework that automates multiphysics simulations from natural language prompts, using a five-agent pipeline to generate input files, handle failures, and validate results without human intervention.
Details
Motivation: Multiphysics simulation frameworks like MOOSE require significant expertise to use effectively, creating barriers for researchers who need to construct input files, manage parameter sweeps, diagnose failures, and extract results.Method: AutoMOOSE uses a five-agent pipeline with an Input Writer coordinating six sub-agents and a Reviewer that autonomously corrects runtime failures. It features a modular plugin architecture for new phase-field formulations and a Model Context Protocol server exposing ten structured tools for interoperability.
Result: Validated on copper grain growth benchmarks, AutoMOOSE generated MOOSE input files with 6 of 12 structural blocks matching human expert references exactly, achieved 1.8x speedup through parallel execution, and recovered grain coarsening kinetics with R^2 = 0.90-0.95. It autonomously diagnosed and resolved three runtime failure classes.
Conclusion: The framework bridges the gap between physics knowledge and validated simulation execution through lightweight multi-agent orchestration, enabling AI-driven materials discovery and self-driving laboratories.
Abstract: Multiphysics simulation frameworks such as MOOSE provide rigorous engines for phase-field materials modeling, yet adoption is constrained by the expertise required to construct valid input files, coordinate parameter sweeps, diagnose failures, and extract quantitative results. We introduce AutoMOOSE, an open-source agentic framework that orchestrates the full simulation lifecycle from a single natural-language prompt. AutoMOOSE deploys a five-agent pipeline in which the Input Writer coordinates six sub-agents and the Reviewer autonomously corrects runtime failures without user intervention. A modular plugin architecture enables new phase-field formulations without modifying the core framework, and a Model Context Protocol (MCP) server exposes the workflow as ten structured tools for interoperability with any MCP-compatible client. Validated on a four-temperature copper grain growth benchmark, AutoMOOSE generates MOOSE input files with 6 of 12 structural blocks matching a human expert reference exactly and 4 functionally equivalent, executes all runs in parallel with a 1.8x speedup, and performs an end-to-end physical consistency check spanning intent, finite-element execution, and Arrhenius kinetics with no human verification. Grain coarsening kinetics are recovered with R^2 = 0.90-0.95 at T >= 600 K; the recovered activation energy Q_fit = 0.296 eV is consistent with a human-written reference (Q_fit = 0.267 eV) under identical parameters. Three runtime failure classes were diagnosed and resolved autonomously within a single correction cycle, and every run produces a provenance record satisfying FAIR data principles. These results show that the gap between knowing the physics and executing a validated simulation campaign can be bridged by a lightweight multi-agent orchestration layer, providing a pathway toward AI-driven materials discovery and self-driving laboratories.
[616] Can we automatize scientific discovery in the cognitive sciences?
Akshay K. Jagadish, Milena Rmus, Kristin Witte, Marvin Mathony, Marcel Binz, Eric Schulz
Main category: cs.AI
TL;DR: Proposes using LLMs to fully automate cognitive science discovery cycle: generating experimental paradigms, simulating behavioral data, synthesizing cognitive models, and evaluating conceptual yield.
Details
Motivation: Traditional cognitive science discovery cycle is slow and limited by human intuition; need automated, scalable approach to theory development using AI.Method: LLM-based framework that: 1) samples experimental paradigms, 2) simulates behavioral data using foundation models, 3) synthesizes cognitive models via program synthesis, 4) evaluates “interestingness” with LLM-critic.
Result: Proposes a high-throughput in-silico discovery engine that surfaces informative experiments and mechanisms for human validation.
Conclusion: LLMs enable automated cognitive science discovery, accelerating theory development and expanding search space beyond human limitations.
Abstract: The cognitive sciences aim to understand intelligence by formalizing underlying operations as computational models. Traditionally, this follows a cycle of discovery where researchers develop paradigms, collect data, and test predefined model classes. However, this manual pipeline is fundamentally constrained by the slow pace of human intervention and a search space limited by researchers’ background and intuition. Here, we propose a paradigm shift toward a fully automated, in silico science of the mind that implements every stage of the discovery cycle using Large Language Models (LLMs). In this framework, experimental paradigms exploring conceptually meaningful task structures are directly sampled from an LLM. High-fidelity behavioral data are then simulated using foundation models of cognition. The tedious step of handcrafting cognitive models is replaced by LLM-based program synthesis, which performs a high-throughput search over a vast landscape of algorithmic hypotheses. Finally, the discovery loop is closed by optimizing for ‘‘interestingness’’, a metric of conceptual yield evaluated by an LLM-critic. By enabling a fast and scalable approach to theory development, this automated loop functions as a high-throughput in-silico discovery engine, surfacing informative experiments and mechanisms for subsequent validation in real human populations.
[617] The Intelligent Disobedience Game: Formulating Disobedience in Stackelberg Games and Markov Decision Processes
Benedikt Hornig, Reuth Mirsky
Main category: cs.AI
TL;DR: Game-theoretic framework for modeling when AI assistants should disobey human instructions to prevent harm, with applications to safe reinforcement learning agents.
Details
Motivation: In shared autonomy systems, there's a critical tension between obeying human instructions and overriding them to prevent harm. Current systems lack formal mathematical foundations for this "intelligent disobedience" behavior, making it difficult to develop safe AI assistants that can appropriately balance compliance with safety.Method: Introduces the Intelligent Disobedience Game (IDG), a sequential game-theoretic framework based on Stackelberg games that models human-assistant interactions under asymmetric information. Characterizes optimal strategies for both agents and identifies strategic phenomena like “safety traps.” Translates IDG into a shared control Multi-Agent Markov Decision Process for computational implementation.
Result: Develops a mathematical foundation for intelligent disobedience, enabling algorithmic development of agents that can learn safe non-compliance and empirical study of human trust in disobedient AI. Creates a computational testbed for training reinforcement learning agents in safety-critical shared autonomy scenarios.
Conclusion: The IDG framework provides essential mathematical foundations for developing AI assistants that can intelligently disobey harmful instructions while maintaining human trust, with applications to reinforcement learning in safety-critical human-AI collaboration systems.
Abstract: In shared autonomy, a critical tension arises when an automated assistant must choose between obeying a human’s instruction and deliberately overriding it to prevent harm. This safety-critical behavior is known as intelligent disobedience. To formalize this dynamic, this paper introduces the Intelligent Disobedience Game (IDG), a sequential game-theoretic framework based on Stackelberg games that models the interaction between a human leader and an assistive follower operating under asymmetric information. It characterizes optimal strategies for both agents across multi-step scenarios, identifying strategic phenomena such as ``safety traps,’’ where the system indefinitely avoids harm but fails to achieve the human’s goal. The IDG provides a needed mathematical foundation that enables both the algorithmic development of agents that can learn safe non-compliance and the empirical study of how humans perceive and trust disobedient AI. The paper further translates the IDG into a shared control Multi-Agent Markov Decision Process representation, forming a compact computational testbed for training reinforcement learning agents.
[618] A Framework for Low-Latency, LLM-driven Multimodal Interaction on the Pepper Robot
Erich Studerus, Vivienne Jia Zhong, Stephan Vonschallen
Main category: cs.AI
TL;DR: Open-source Android framework for Pepper robot using end-to-end Speech-to-Speech models and LLM function calling for low-latency multimodal interaction
Details
Motivation: Address weaknesses in current LLM integration for social robotics: high latency from cascaded STT->LLM->TTS pipelines, loss of paralinguistic information, and underutilization of LLMs for multimodal perception and agentic controlMethod: Developed Android framework with two innovations: 1) End-to-end Speech-to-Speech models for low-latency interaction preserving paralinguistic cues, 2) Extensive Function Calling capabilities enabling LLM as agentic planner orchestrating robot actions and integrating multimodal feedback
Result: Framework runs on robot’s tablet (or regular Android devices), decoupling development from hardware, providing practical platform for advanced LLM-driven embodied interaction
Conclusion: Provides HRI community with extensible platform for exploring advanced LLM-driven embodied interaction with improved latency and multimodal capabilities
Abstract: Despite recent advances in integrating Large Language Models (LLMs) into social robotics, two weaknesses persist. First, existing implementations on platforms like Pepper often rely on cascaded Speech-to-Text (STT)->LLM->Text-to-Speech (TTS) pipelines, resulting in high latency and the loss of paralinguistic information. Second, most implementations fail to fully leverage the LLM’s capabilities for multimodal perception and agentic control. We present an open-source Android framework for the Pepper robot that addresses these limitations through two key innovations. First, we integrate end-to-end Speech-to-Speech (S2S) models to achieve low-latency interaction while preserving paralinguistic cues and enabling adaptive intonation. Second, we implement extensive Function Calling capabilities that elevate the LLM to an agentic planner, orchestrating robot actions (navigation, gaze control, tablet interaction) and integrating diverse multimodal feedback (vision, touch, system state). The framework runs on the robot’s tablet but can also be built to run on regular Android smartphones or tablets, decoupling development from robot hardware. This work provides the HRI community with a practical, extensible platform for exploring advanced LLM-driven embodied interaction.
[619] Knowledge Boundary Discovery for Large Language Models
Ziquan Wang, Zhongqi Lu
Main category: cs.AI
TL;DR: KBD is a reinforcement learning framework that automatically discovers LLM knowledge boundaries by generating both answerable and unanswerable questions through iterative interaction with the model.
Details
Motivation: To develop an automated method for discovering the knowledge boundaries of LLMs, addressing the challenge of hallucination and enabling systematic evaluation of what LLMs can and cannot answer reliably.Method: Uses reinforcement learning where an agent interacts with the LLM in a partially observable environment. The agent generates progressive questions (actions), receives LLM responses (observations), uses entropy reduction as reward, and updates belief states to iteratively explore knowledge boundaries.
Result: KBD successfully detects LLM knowledge boundaries by automatically generating non-trivial answerable and unanswerable questions. The generated question sets are comparable to human-generated benchmark datasets.
Conclusion: KBD provides a new automated approach for evaluating LLMs by systematically discovering their knowledge boundaries, offering an alternative to manually crafted benchmarks.
Abstract: We propose Knowledge Boundary Discovery (KBD), a reinforcement learning based framework to explore the knowledge boundaries of the Large Language Models (LLMs). We define the knowledge boundary by automatically generating two types of questions: (i) those the LLM can confidently answer (within-knowledge boundary) and (ii) those it cannot (beyond-knowledge boundary). Iteratively exploring and exploiting the LLM’s responses to find its knowledge boundaries is challenging because of the hallucination phenomenon. To find the knowledge boundaries of an LLM, the agent interacts with the LLM under the modeling of exploring a partially observable environment. The agent generates a progressive question as the action, adopts an entropy reduction as the reward, receives the LLM’s response as the observation and updates its belief states. We demonstrate that the KBD detects knowledge boundaries of LLMs by automatically finding a set of non-trivial answerable and unanswerable questions. We validate the KBD by comparing its generated knowledge boundaries with manually crafted LLM benchmark datasets. Experiments show that our KBD-generated question set is comparable to the human-generated datasets. Our approach paves a new way to evaluate LLMs.
[620] KLDrive: Fine-Grained 3D Scene Reasoning for Autonomous Driving based on Knowledge Graph
Ye Tian, Jingyi Zhang, Zihao Wang, Xiaoyuan Ren, Xiaofan Yu, Onat Gungor, Tajana Rosing
Main category: cs.AI
TL;DR: KLDrive: Knowledge-graph-augmented LLM framework for fine-grained QA in autonomous driving using energy-based scene fact construction and constrained LLM reasoning.
Details
Motivation: Autonomous driving requires reliable reasoning over fine-grained 3D scene facts, but existing methods suffer from hallucinations, opaque reasoning, and heavy task-specific training. Need for reliable scene understanding in driving scenarios.Method: Two components: 1) Energy-based scene fact construction module consolidates multi-source evidence into reliable scene knowledge graph, 2) LLM agent performs fact-grounded reasoning over constrained action space with structural constraints. Uses structured prompting with few-shot exemplars.
Result: Achieves 65.04% accuracy on NuScenes-QA (SOTA) and 42.45 SPICE score on GVQA. On counting tasks, improves by 46.01 percentage points over strongest baseline, demonstrating reduced hallucinations and benefit of reliable scene fact construction.
Conclusion: KLDrive effectively addresses fine-grained QA in autonomous driving by coupling reliable scene fact construction with explicit reasoning, outperforming existing methods while reducing hallucinations.
Abstract: Autonomous driving requires reliable reasoning over fine-grained 3D scene facts. Fine-grained question answering over multi-modal driving observations provides a natural way to evaluate this capability, yet existing perception pipelines and driving-oriented large language model (LLM) methods still suffer from unreliable scene facts, hallucinations, opaque reasoning, and heavy reliance on task-specific training. We present KLDrive, the first knowledge-graph-augmented LLM reasoning framework for fine-grained question answering in autonomous driving. KLDrive addresses this problem through designing two tightly coupled components: an energy-based scene fact construction module that consolidates multi-source evidence into a reliable scene knowledge graph, and an LLM agent that performs fact-grounded reasoning over a constrained action space under explicit structural constraints. By combining structured prompting with few-shot in-context exemplars, the framework adapts to diverse reasoning tasks without heavy task-specific fine-tuning. Experiments on two large-scale autonomous-driving QA benchmarks show that KLDrive outperforms prior state-of-the-art methods, achieving the best overall accuracy of 65.04% on NuScenes-QA and the best SPICE score of 42.45 on GVQA. On counting, the most challenging factual reasoning task, it improves over the strongest baseline by 46.01 percentage points, demonstrating substantially reduced hallucinations and the benefit of coupling reliable scene fact construction with explicit reasoning.
[621] LongCat-Flash-Prover: Advancing Native Formal Reasoning via Agentic Tool-Integrated Reinforcement Learning
Jianing Wang, Jianfei Zhang, Qi Guo, Linsen Guo, Rumei Li, Chao Zhang, Chong Peng, Cunguang Wang, Dengchang Zhao, Jiarong Shi, Jingang Wang, Liulin Feng, Mengxia Shen, Qi Li, Shengnan An, Shun Wang, Wei Shi, Xiangyu Xi, Xiaoyu Li, Xuezhi Cao, Yi Lu, Yunke Zhao, Zhengyu Chen, Zhimin Lin, Wei Wang, Peng Pei, Xunliang Cai
Main category: cs.AI
TL;DR: LongCat-Flash-Prover is a 560B parameter MoE model for formal reasoning in Lean4, using agentic tool-integrated reasoning with auto-formalization, sketching, and proving capabilities.
Details
Motivation: To advance native formal reasoning in theorem proving by developing a large-scale open-source model that can handle complex formal reasoning tasks through decomposed capabilities and agentic reasoning.Method: Uses a Hybrid-Experts Iteration Framework to expand high-quality task trajectories, Hierarchical Importance Sampling Policy Optimization (HisPO) for stable MoE training, and incorporates theorem consistency/legality detection mechanisms.
Result: Achieves state-of-the-art for open-weights models: 97.1% pass rate on MiniF2F-Test with only 72 inference budget, solves 70.8% of ProverBench and 41.5% of PutnamBench with ≤220 attempts per problem.
Conclusion: LongCat-Flash-Prover demonstrates remarkable sample efficiency and sets new benchmarks for open-weights models in formal reasoning tasks, significantly outperforming existing baselines.
Abstract: We introduce LongCat-Flash-Prover, a flagship 560-billion-parameter open-source Mixture-of- Experts (MoE) model that advances Native Formal Reasoning in Lean4 through agentic tool-integrated reasoning (TIR). We decompose the native formal reasoning task into three independent formal capabilities, i.e., auto-formalization, sketching, and proving. To facilitate these capabilities, we propose a Hybrid-Experts Iteration Framework to expand high-quality task trajectories, including generating a formal statement based on a given informal problem, producing a whole-proof directly from the statement, or a lemma-style sketch. During agentic RL, we present a Hierarchical Importance Sampling Policy Optimization (HisPO) algorithm, which aims to stabilize the MoE model training on such long-horizon tasks. It employs a gradient masking strategy that accounts for the policy staleness and the inherent train-inference engine discrepancies at both sequence and token levels. Additionally, we also incorporate theorem consistency and legality detection mechanisms to eliminate reward hacking issues. Extensive evaluations show that our LongCat-Flash-Prover sets a new state-of-the-art for open-weights models in both auto-formalization and theorem proving. Demonstrating remarkable sample efficiency, it achieves a 97.1% pass rate on MiniF2F-Test using only 72 inference budget per problem. On more challenging benchmarks, it solves 70.8% of ProverBench and 41.5% of PutnamBench with no more than 220 attempts per problem, significantly outperforming existing open-weights baselines.
[622] ORACLE: Optimizing Reasoning Abilities of Large Language Models via Constraint-Led Synthetic Data Elicitation
Zhuojie Yang, Wentao Wan, Keze Wang
Main category: cs.AI
TL;DR: ORACLE is a structured data generation framework that combines LLM reasoning with symbolic verification to create high-quality multi-step reasoning data, addressing limitations of existing methods that only check final answers.
Details
Motivation: Current methods for generating synthetic reasoning data for LLMs focus on filtering based on final answer correctness, overlooking flaws in intermediate reasoning steps. Existing verification methods (code execution or symbolic reasoning engines) are limited to specific domains and fail in natural language reasoning tasks with ambiguous/incomplete contexts.Method: ORACLE integrates LLM generative capabilities with symbolic supervision: LLMs produce step-wise reasoning contexts using a unified prompting template, while a symbolic reasoning engine verifies each intermediate step’s validity, enabling fine-grained, step-level validation.
Result: ORACLE consistently outperforms strong baselines on six logical, factual, and commonsense reasoning benchmarks across multiple models.
Conclusion: ORACLE addresses the challenge of verifying intermediate reasoning steps in natural language tasks, enabling construction of higher-quality multi-step reasoning data through structured generation with symbolic verification.
Abstract: Training large language models (LLMs) with synthetic reasoning data has become a popular approach to enhancing their reasoning capabilities, while a key factor influencing the effectiveness of this paradigm is the quality of the generated multi-step reasoning data. To generate high-quality reasoning data, many recent methods generate synthetic reasoning paths and filter them based on final answer correctness, often overlooking flaws in intermediate reasoning steps. To enhance the verification of intermediate reasoning steps, prior work primarily resorts to code execution or symbolic reasoning engines. However, code-based validation is restricted to code or mathematical tasks, and reasoning engines require a well-structured and complete context. As a result, existing methods fail to function effectively in natural language reasoning tasks that involve ambiguous or incomplete contexts. In these tasks, synthetic data still lack reliable checks for verifying each reasoning step. To address this challenge, we introduce ORACLE, a structured data generation framework inspired by syllogistic reasoning. ORACLE integrates the generative strengths of LLMs with symbolic supervision: the LLM produces step-wise reasoning contexts, while a symbolic reasoning engine verifies the validity of each intermediate step. By employing a unified prompting template to elicit modular reasoning chains, ORACLE enables fine-grained, step-level validation, facilitating the construction of high-quality multi-step reasoning data. Across six logical, factual, and commonsense reasoning benchmarks, our ORACLE consistently outperforms strong baselines on multiple models.
[623] Can LLMs Fool Graph Learning? Exploring Universal Adversarial Attacks on Text-Attributed Graphs
Zihui Chen, Yuling Wang, Pengfei Jiao, Kai Wu, Xiao Wang, Xiang Ao, Dalin Zhang
Main category: cs.AI
TL;DR: BadGraph is a universal adversarial attack framework for text-attributed graphs that uses LLMs to jointly perturb both node topology and textual semantics, achieving effective attacks across GNN- and LLM-based models.
Details
Motivation: Text-attributed graphs (TAGs) combine rich textual semantics with topological context but introduce new vulnerabilities. Current TAG models use diverse backbones (GNNs and PLMs), making it challenging to design universal adversarial attacks that work across different architectures, especially in black-box settings where many PLMs are API-accessible only.Method: BadGraph leverages LLMs’ understanding of general graph knowledge to jointly perturb node topology and textual semantics. It uses a target influencer retrieval module with graph priors to construct cross-modally aligned attack shortcuts, enabling efficient LLM-based perturbation reasoning.
Result: BadGraph achieves universal and effective attacks across GNN- and LLM-based reasoners, causing up to 76.3% performance drop. Theoretical and empirical analyses confirm its stealthy yet interpretable nature.
Conclusion: BadGraph demonstrates that LLMs can be effectively leveraged to create universal adversarial attacks on text-attributed graphs, highlighting security vulnerabilities in multimodal graph learning systems that combine structural and textual information.
Abstract: Text-attributed graphs (TAGs) enhance graph learning by integrating rich textual semantics and topological context for each node. While boosting expressiveness, they also expose new vulnerabilities in graph learning through text-based adversarial surfaces. Recent advances leverage diverse backbones, such as graph neural networks (GNNs) and pre-trained language models (PLMs), to capture both structural and textual information in TAGs. This diversity raises a key question: How can we design universal adversarial attacks that generalize across architectures to assess the security of TAG models? The challenge arises from the stark contrast in how different backbones-GNNs and PLMs-perceive and encode graph patterns, coupled with the fact that many PLMs are only accessible via APIs, limiting attacks to black-box settings. To address this, we propose BadGraph, a novel attack framework that deeply elicits large language models (LLMs) understanding of general graph knowledge to jointly perturb both node topology and textual semantics. Specifically, we design a target influencer retrieval module that leverages graph priors to construct cross-modally aligned attack shortcuts, thereby enabling efficient LLM-based perturbation reasoning. Experiments show that BadGraph achieves universal and effective attacks across GNN- and LLM-based reasoners, with up to a 76.3% performance drop, while theoretical and empirical analyses confirm its stealthy yet interpretable nature.
[624] Revisiting Tree Search for LLMs: Gumbel and Sequential Halving for Budget-Scalable Reasoning
Leonid Ugadiarov, Yuri Kuratov, Aleksandr Panov, Alexey Skrynnik
Main category: cs.AI
TL;DR: ReSCALE adapts AlphaZero-style tree search for LLMs by replacing Dirichlet noise and PUCT with Gumbel sampling and Sequential Halving, fixing scaling failures where accuracy drops with increased search budget.
Details
Motivation: Recent applications of AlphaZero-style tree search to enhance LLM reasoning suffer from scaling failures - accuracy actually decreases as search budget increases on benchmarks like GSM8K and Game24.Method: ReSCALE adapts Gumbel AlphaZero MCTS by replacing Dirichlet noise and PUCT selection with Gumbel sampling and Sequential Halving, maintaining monotonic scaling without modifying the underlying LLM or its training.
Result: ReSCALE achieves 58.4% on GSM8K and 85.3% on Game24 at budgets where baseline methods degrade, with ablations showing Sequential Halving as the primary driver of improvement.
Conclusion: The proposed ReSCALE method successfully addresses scaling failures in neural tree search for LLMs, enabling effective reasoning enhancement through improved search algorithms without model modifications.
Abstract: Neural tree search is a powerful decision-making algorithm widely used in complex domains such as game playing and model-based reinforcement learning. Recent work has applied AlphaZero-style tree search to enhance the reasoning capabilities of Large Language Models (LLMs) during inference, but we find that this approach suffers from a scaling failure: on GSM8K and Game24, accuracy drops as the search budget increases. In this paper, we present ReSCALE, an adaptation of Gumbel AlphaZero MCTS that replaces Dirichlet noise and PUCT selection with Gumbel sampling and Sequential Halving, restoring monotonic scaling without changes to the model or its training. ReSCALE reaches 58.4% on GSM8K and 85.3% on Game24 at budgets where the baseline degrades. Ablations confirm that Sequential Halving is the primary driver of the improvement.
[625] Does AI Homogenize Student Thinking? A Multi-Dimensional Analysis of Structural Convergence in AI-Augmented Essays
Keito Inoshita, Michiaki Omura, Tsukasa Yamanaka, Go Maeda, Kentaro Tsuji
Main category: cs.AI
TL;DR: AI-assisted writing improves essay quality but reduces structural diversity, creating a Quality-Homogenization Tradeoff where quality gains come at the cost of homogenized writing structures, though prompt design can reverse this effect.
Details
Motivation: While AI-assisted writing has been shown to improve essay quality, there's a gap in understanding its impact on the structural diversity of student thinking and whether AI leads to homogenization of writing styles.Method: Analyzed 6,875 essays across five conditions: Human-only, AI-only, and three Human+AI prompt strategies. Used convergence target analysis to examine structural patterns and variance in different dimensions like cohesion architecture and perspective plurality.
Result: Found a Quality-Homogenization Tradeoff where substantial quality gains co-occur with significant homogenization. Cohesion architecture lost 70-78% of its variance, while perspective plurality was diversified. AI-augmented essays were pulled toward AI structural patterns but deviated from the Human-AI axis. Prompt specificity reversed homogenization into diversification on argument depth.
Conclusion: Homogenization is not an intrinsic property of AI but a function of interaction design. Prompt specificity can reverse homogenization into diversification, suggesting that careful design of human-AI interaction can mitigate negative effects on structural diversity.
Abstract: While AI-assisted writing has been widely reported to improve essay quality, its impact on the structural diversity of student thinking remains unexplored. Analyzing 6,875 essays across five conditions (Human-only, AI-only, and three Human+AI prompt strategies), we provide the first empirical evidence of a Quality-Homogenization Tradeoff, in which substantial quality gains co-occur with significant homogenization. The effect is dimension-specific: cohesion architecture lost 70-78% of its variance, whereas perspective plurality was diversified. Convergence target analysis further revealed that AI-augmented essays were pulled toward AI structural patterns yet deviated significantly from the Human-AI axis, indicating simultaneous partial replacement and partial emergence. Crucially, prompt specificity reversed homogenization into diversification on argument depth, demonstrating that homogenization is not an intrinsic property of AI but a function of interaction design.
[626] ConsRoute:Consistency-Aware Adaptive Query Routing for Cloud-Edge-Device Large Language Models
Haoyu Qiao, Hao Zhang, Shanwen Mao, Siyao Cheng, Jie Liu
Main category: cs.AI
TL;DR: ConsRoute is a lightweight routing framework for cloud-edge-device collaborative LLM inference that uses semantic consistency assessment and adaptive thresholds to reduce latency/cost while maintaining response quality.
Details
Motivation: LLMs have high inference latency and cost that hinder deployment in latency-sensitive and resource-constrained scenarios. Cloud-edge-device collaborative inference needs efficient routing methods that can balance quality, latency, and cost.Method: Uses reranker to assess semantic consistency between responses from different model tiers for fine-grained routing signals. Reuses LLM hidden states as compact query representations to minimize overhead. Clusters representations and uses Bayesian optimization to learn cluster-specific routing thresholds that dynamically balance quality, latency, and cost.
Result: Achieves near-cloud performance (>=95%) while reducing end-to-end latency and inference cost by nearly 40%. Consistently outperforms existing routing baselines in both response quality and system efficiency.
Conclusion: ConsRoute provides an effective lightweight routing framework for collaborative LLM inference that significantly improves efficiency while maintaining response quality through semantic-aware adaptive routing.
Abstract: Large language models (LLMs) deliver impressive capabilities but incur substantial inference latency and cost, which hinders their deployment in latency-sensitive and resource-constrained scenarios. Cloud-edge-device collaborative inference has emerged as a promising paradigm by dynamically routing queries to models of different capacities across tiers. In this paper, we propose ConsRoute, a lightweight, semantic-aware, and adaptive routing framework that significantly improves inference efficiency while minimizing impact on response quality. Unlike prior routing methods that rely on predicting coarse-grained output quality gaps, ConsRoute leverages a reranker to directly assess the semantic consistency between responses generated by models at different tiers, yielding fine-grained soft supervision signals for routing. To minimize device-side overhead, ConsRoute reuses hidden states from the LLM prefilling stage as compact query representations, avoiding additional encoders or inference passes. Furthermore, these representations are clustered, and Bayesian optimization is employed to learn cluster-specific routing thresholds that dynamically balance quality, latency, and cost under heterogeneous query distributions. Extensive experiments demonstrate that ConsRoute achieves near-cloud performance (>=95%) while reducing end-to-end latency and inference cost by nearly 40%, consistently outperforming existing routing baselines in both response quality and system efficiency.
[627] Graph of States: Solving Abductive Tasks with Large Language Models
Yu Luo, Rongchen Gao, Lu Teng, Xidao Wen, Jiamin Jiang, Qingliang Zhang, Yongqian Sun, Shenglin Zhang, Jiasong Feng, Tong Liu, Wenjie Zhang, Dan Pei
Main category: cs.AI
TL;DR: GoS is a neuro-symbolic framework for abductive reasoning that structures belief states using causal graphs and state machines to prevent common reasoning failures in LLMs.
Details
Motivation: Current LLMs excel at deduction and induction but struggle with abductive reasoning due to unstructured state representation and lack of explicit state control, leading to issues like evidence fabrication and context drift.Method: Proposes Graph of States (GoS) framework with structured belief states using causal graphs to encode logical dependencies and state machines to govern valid reasoning transitions, enabling multi-agent collaboration with symbolic constraints.
Result: Extensive evaluations on two real-world datasets show GoS significantly outperforms all baselines, providing robust solutions for complex abductive tasks.
Conclusion: GoS successfully bridges the gap in LLM abductive reasoning by transforming aimless exploration into convergent, directed search through structured neuro-symbolic constraints.
Abstract: Logical reasoning encompasses deduction, induction, and abduction. However, while Large Language Models (LLMs) have effectively mastered the former two, abductive reasoning remains significantly underexplored. Existing frameworks, predominantly designed for static deductive tasks, fail to generalize to abductive reasoning due to unstructured state representation and lack of explicit state control. Consequently, they are inevitably prone to Evidence Fabrication, Context Drift, Failed Backtracking, and Early Stopping. To bridge this gap, we introduce Graph of States (GoS), a general-purpose neuro-symbolic framework tailored for abductive tasks. GoS grounds multi-agent collaboration in a structured belief states, utilizing a causal graph to explicitly encode logical dependencies and a state machine to govern the valid transitions of the reasoning process. By dynamically aligning the reasoning focus with these symbolic constraints, our approach transforms aimless, unconstrained exploration into a convergent, directed search. Extensive evaluations on two real-world datasets demonstrate that GoS significantly outperforms all baselines, providing a robust solution for complex abductive tasks. Code repo and all prompts: https://anonymous.4open.science/r/Graph-of-States-5B4E.
[628] The Library Theorem: How External Organization Governs Agentic Reasoning Capacity
Zachary F. Mainen
Main category: cs.AI
TL;DR: Transformer agents with indexed external memory achieve exponentially lower retrieval costs than sequential scanning, but face parametric memory competition when familiar content bypasses retrieval protocols.
Details
Motivation: Current transformer-based agents use chain-of-thought reasoning but lack structured retrieval mechanisms. The paper aims to formalize and demonstrate the benefits of indexed external memory for transformer agents, while identifying challenges when language models bypass retrieval protocols due to parametric memory.Method: Formalized transformer context window as I/O page, proved theoretical retrieval cost advantages of indexed agents vs sequential scanning. Tested on controlled lookup benchmark with three content types (random hashes, ordered integers, encyclopedia entries) across store sizes 50-5,000 items, using GPT-4o-mini and GPT-5.4 models.
Result: Indexed agents achieved median 1 page read regardless of store size (confirming O(1) prediction). Without index, weaker models couldn’t sustain binary search at scale; stronger models achieved near-optimal log₂N search but still lost to index by 5×. On familiar content, models bypassed retrieval protocols and generated answers from parametric memory, causing catastrophic token expenditure.
Conclusion: Language models should be used for index construction (where semantic understanding helps) and deterministic algorithms for index traversal (where semantic understanding hurts by tempting shortcuts). This separation of concerns addresses the parametric memory competition problem.
Abstract: Externalized reasoning is already exploited by transformer-based agents through chain-of-thought, but structured retrieval – indexing over one’s own reasoning state – remains underexplored. We formalize the transformer context window as an I/O page and prove that tool-augmented agents with indexed external memory achieve exponentially lower retrieval cost than agents restricted to sequential scanning: $O(\log_b N)$ versus $Ω(N)$ page reads per query, and $O(T \log_b T)$ versus $Θ(T^2)$ cumulative cost over $T$ reasoning steps – a gap that widens as deliberation deepens. We test these predictions on a controlled lookup benchmark across three content types – random hashes, ordered integers, and encyclopedia entries – varying store size from 50 to 5,000 items, and replicate key conditions across two model generations (GPT-4o-mini and GPT-5.4). On abstract content, the indexed agent achieves median 1 page read regardless of store size, confirming the $O(1)$ prediction. Sorted pages without an index fail to close the gap: the weaker model cannot sustain binary search at scale, and the stronger model achieves near-optimal $\log_2 N$ search but still loses to the index by $5\times$. On familiar content (encyclopedia entries), a competing failure mode emerges: the model recognizes the domain, bypasses the retrieval protocol, and generates answers from parametric memory, producing catastrophic token expenditure even when the index is sound. This parametric memory competition dissociates the two cognitive operations that indexing combines: understanding content (where language models excel) and following navigational protocols (where they fail when understanding tempts them to shortcut). The result argues for a separation of concerns: use language models for index construction, where semantic understanding helps, and deterministic algorithms for index traversal, where it hurts.
[629] Improving Coherence and Persistence in Agentic AI for System Optimization
Pantea Karimi, Kimia Noorbakhsh, Mohammad Alizadeh, Hari Balakrishnan
Main category: cs.AI
TL;DR: Engram is an agentic researcher architecture that addresses LLM limitations in complex system design by decoupling long-horizon exploration from single context windows, using persistent archives and research digests to accumulate knowledge across independent runs.
Details
Motivation: Current LLMs struggle with complex system design problems due to evolutionary neighborhood bias (getting stuck in local optima) and coherence ceiling (context degradation over long horizons), preventing effective multi-step conceptual shifts needed for creative system heuristic design.Method: Engram organizes exploration into sequential agents that design, test, and analyze mechanisms. Each run stores code, logs, and results in a persistent Archive and distills insights into a compact Research Digest. Subsequent agents start fresh but read the Digest to build on prior discoveries.
Result: Engram demonstrates superior performance across diverse domains including multi-cloud multicast, LLM inference request routing, and optimizing KV cache reuse in databases with natural language queries.
Conclusion: The Engram architecture successfully addresses LLM limitations in complex system design by enabling persistent knowledge accumulation across independent runs, allowing for effective long-horizon exploration and coordinated multi-step changes.
Abstract: Designing high-performance system heuristics is a creative, iterative process requiring experts to form hypotheses and execute multi-step conceptual shifts. While Large Language Models (LLMs) show promise in automating this loop, they struggle with complex system problems due to two critical failure modes: evolutionary neighborhood bias and the coherence ceiling. Evolutionary methods often remain trapped in local optima by relying on scalar benchmark scores, failing when coordinated multi-step changes are required. Conversely, existing agentic frameworks suffer from context degradation over long horizons or fail to accumulate knowledge across independent runs. We present Engram, an agentic researcher architecture that addresses these limitations by decoupling long-horizon exploration from the constraints of a single context window. Engram organizes exploration into a sequence of agents that iteratively design, test, and analyze mechanisms. At the conclusion of each run, an agent stores code snapshots, logs, and results in a persistent Archive and distills high-level modeling insights into a compact, persistent Research Digest. Subsequent agents then begin with a fresh context window, reading the Research Digest to build on prior discoveries. We find that Engram exhibits superior performance across diverse domains including multi-cloud multicast, LLM inference request routing, and optimizing KV cache reuse in databases with natural language queries.
[630] ARYA: A Physics-Constrained Composable & Deterministic World Model Architecture
Seth Dobrin, Lukasz Chmiel
Main category: cs.AI
TL;DR: ARYA is a physics-constrained deterministic world model architecture using nano models and safety kernels, achieving state-of-the-art performance without neural networks across multiple industry domains.
Details
Motivation: To create a composable, physics-constrained world model that addresses limitations of monolithic foundation models, particularly around computational efficiency, safety, and deterministic reasoning while maintaining human control as autonomy increases.Method: Hierarchical system-of-system-of-systems of specialized nano models orchestrated by AARA cognitive daemon, featuring Unfireable Safety Kernel as immutable safety boundary, linear scaling, sparse activation, and sub-20-second training cycles with zero neural network parameters.
Result: Achieves state-of-the-art performance across 6 of 9 competitive benchmarks head-to-head with GPT-5.2, Opus 4.6, and V-JEPA-2, deployed across seven industry domains including aerospace, pharma, oil and gas, smart cities, biotech, defense, and medical devices.
Conclusion: ARYA demonstrates that physics-constrained deterministic world models with architectural safety constraints can achieve competitive performance without neural networks, offering a scalable, safe alternative to monolithic foundation models.
Abstract: This paper presents ARYA, a composable, physics-constrained, deterministic world model architecture built on five foundational principles: nano models, composability, causal reasoning, determinism, and architectural AI safety. We demonstrate that ARYA satisfies all canonical world model requirements, including state representation, dynamic prediction, causal and physical awareness, temporal consistency, generalization, learnability, and planning and control. Unlike monolithic foundation models, the ARYA foundation model implements these capabilities through a hierarchical system-of-system-of-systems of specialized nano models, orchestrated by AARA (ARYA Autonomous Research Agent), an always-on cognitive daemon that executes a continuous sense-decide-act-learn loop. The nano model architecture provides linear scaling, sparse activation, selective untraining, and sub-20-second training cycles, resolving the traditional tension between capability and computational efficiency. A central contribution is the Unfireable Safety Kernel: an architecturally immutable safety boundary that cannot be disabled or circumvented by any system component, including its own self-improvement engine. This is not a social or ethical alignment statement; it is a technical framework ensuring human control persists as autonomy increases. Safety is an architectural constraint governing every operation, not a policy layer applied after the fact. We present formal alignment between ARYA’s architecture and canonical world model requirements, and report summarizing its state-of-the-art performance across 6 of 9 competitive benchmarks head-to-head with GPT-5.2, Opus 4.6, and V-JEPA-2. All with zero neural network parameters, across seven active industry domain nodes spanning aerospace, pharma manufacturing, oil and gas, smart cities, biotech, defense, and medical devices.
[631] RoboAlign: Learning Test-Time Reasoning for Language-Action Alignment in Vision-Language-Action Models
Dongyoung Kim, Sumin Park, Woomin Song, Seungku Kim, Taeyoung Kim, Huiwon Jang, Jinwoo Shin, Jaehyung Kim, Younggyo Seo
Main category: cs.AI
TL;DR: RoboAlign: A systematic MLLM training framework that improves embodied reasoning for vision-language-action models using zero-shot natural language reasoning and RL-based alignment to bridge the modality gap between language and low-level actions.
Details
Motivation: Existing approaches to enhance embodied reasoning in MLLMs through vision-question-answering supervision result in unstable VLA performance with marginal or negative gains. There's a need for a more systematic framework to reliably improve VLA performance by bridging the modality gap between language understanding and low-level actions.Method: Proposes RoboAlign framework that: 1) Samples action tokens via zero-shot natural language reasoning, 2) Refines reasoning using reinforcement learning to improve action accuracy, 3) Bridges modality gap between language and low-level actions in MLLMs, 4) Facilitates knowledge transfer from MLLM to VLA, 5) Uses RL-based alignment after SFT with minimal data.
Result: RoboAlign achieves significant performance improvements: 17.5% on LIBERO, 18.9% on CALVIN, and 106.6% on real-world environments over SFT baselines, using less than 1% of data for RL-based alignment after SFT.
Conclusion: RoboAlign provides a systematic and effective framework for improving embodied reasoning in MLLMs, enabling reliable VLA performance gains by bridging the language-action modality gap through RL-based alignment with minimal data requirements.
Abstract: Improving embodied reasoning in multimodal-large-language models (MLLMs) is essential for building vision-language-action models (VLAs) on top of them to readily translate multimodal understanding into low-level actions. Accordingly, recent work has explored enhancing embodied reasoning in MLLMs through supervision of vision-question-answering type. However, these approaches have been reported to result in unstable VLA performance, often yielding only marginal or even negative gains. In this paper, we propose a more systematic MLLM training framework RoboAlign that reliably improves VLA performance. Our key idea is to sample action tokens via zero-shot natural language reasoning and refines this reasoning using reinforcement learning (RL) to improve action accuracy. As a result, RoboAlign bridges the modality gap between language and low-level actions in MLLMs, and facilitate knowledge transfer from MLLM to VLA. To validate the effectiveness of RoboAlign, we train VLAs by adding a diffusion-based action head on top of an MLLM backbone and evaluate them on major robotics benchmarks. Remarkably, by performing RL-based alignment after SFT using less than 1% of the data, RoboAlign achieves performance improvements of 17.5%, 18.9%, and 106.6% over SFT baselines on LIBERO, CALVIN, and real-world environments, respectively.
[632] The AI Scientific Community: Agentic Virtual Lab Swarms
Ulisses Braga-Neto
Main category: cs.AI
TL;DR: Proposes using agentic swarms of virtual labs as an AI Science Community model where each particle represents a virtual lab, enabling collective scientific exploration through swarm intelligence principles.
Details
Motivation: To create a model that simulates real-world scientific communities using swarm intelligence principles, potentially accelerating scientific discovery through decentralized coordination and emergent collective behavior.Method: Uses agentic swarms where each particle is a virtual laboratory instance, with mechanisms for inter-laboratory communication, citation-analogous voting systems, fitness functions for scientific success quantification, and strategies to prevent lab dominance while preserving diversity.
Result: A conceptual framework is presented with architectural considerations and a working instance currently under development, but no empirical results are reported in this short note.
Conclusion: The AI Science Community framework using agentic swarms shows promise for simulating scientific communities and potentially accelerating discovery, with ongoing development of a working instance.
Abstract: In this short note we propose using agentic swarms of virtual labs as a model of an AI Science Community. In this paradigm, each particle in the swarm represents a complete virtual laboratory instance, enabling collective scientific exploration that mirrors real-world research communities. The framework leverages the inherent properties of swarm intelligence - decentralized coordination, balanced exploration-exploitation trade-offs, and emergent collective behavior - to simulate the behavior of a scientific community and potentially accelerate scientific discovery. We discuss architectural considerations, inter-laboratory communication and influence mechanisms including citation-analogous voting systems, fitness function design for quantifying scientific success, anticipated emergent behaviors, mechanisms for preventing lab dominance and preserving diversity, and computational efficiency strategies to enable large swarms exhibiting complex emergent behavior analogous to real-world scientific communities. A working instance of the AI Science Community is currently under development.
[633] AgentHER: Hindsight Experience Replay for LLM Agent Trajectory Relabeling
Liang Ding
Main category: cs.AI
TL;DR: AgentHER framework converts failed LLM agent trajectories into training data using hindsight experience replay principles, improving performance on WebArena and ToolBench benchmarks.
Details
Motivation: Current LLM agents fail on most real-world tasks (e.g., GPT-4o succeeds on <15% of WebArena tasks), and failed trajectories are discarded, wasting valuable training data.Method: Four-stage pipeline: 1) failure classification, 2) outcome extraction, 3) LLM-guided prompt relabeling with confidence gating, 4) data packaging. Converts failures into SFT, DPO, and ShareGPT training data with rule-based and LLM-judge implementations.
Result: Improves over success-only SFT by +7.1-11.7 pp across four model families (GPT-4o, Qwen2.5-72B/7B, LLaMA-3.1-8B), achieves 2x data efficiency (matching baseline with 50% successful demos). Gains consistent from 1.5B to 72B parameters (+5.8-9.2 pp) and compound under iterative redeployment.
Conclusion: AgentHER effectively recovers wasted training signal from failed trajectories, significantly improving LLM agent performance through data augmentation while maintaining high relabeling precision (97.7%).
Abstract: LLM agents fail on the majority of real-world tasks – GPT-4o succeeds on fewer than 15% of WebArena navigation tasks and below 55% pass@1 on ToolBench (Zhou et al., 2024; Qin et al., 2024) – yet every failed trajectory is routinely discarded, wasting the dominant source of collected experience. We introduce AgentHER, a framework that recovers this lost training signal by adapting the Hindsight Experience Replay (HER; Andrychowicz et al., 2017) principle to natural-language agent trajectories for offline data augmentation. The key insight is simple: a trajectory that fails goal A is often a correct demonstration for some achievable alternative goal B. AgentHER realises this idea through a four-stage pipeline – failure classification, outcome extraction, LLM-guided prompt relabeling with confidence gating, and data packaging – that converts discarded failures into high-quality SFT, DPO, and ShareGPT training data, with both zero-cost rule-based and LLM-judge implementations. On WebArena (Zhou et al., 2024) and ToolBench (Qin et al., 2024), AgentHER improves over success-only SFT by +7.1-11.7 pp across four model families (GPT-4o, Qwen2.5-72B/7B, LLaMA-3.1-8B), while achieving 2x data efficiency – matching baseline performance with only 50% of successful demonstrations. Gains are consistent from 1.5B to 72B parameters (+5.8-9.2 pp) and compound under iterative redeployment (+2.1 pp over additional rounds). Human evaluation confirms 97.7% relabeling precision under multi-judge verification.
[634] AdaRubric: Task-Adaptive Rubrics for LLM Agent Evaluation
Liang Ding
Main category: cs.AI
TL;DR: ADARUBRIC is a dynamic evaluation framework that generates task-specific rubrics on the fly for agent tasks, addressing LLM-as-Judge limitations by providing dimension-aware scoring and filtering for preference pair generation.
Details
Motivation: LLM-as-Judge evaluation fails for agent tasks because fixed rubrics cannot capture task-specific requirements. Different tasks demand different evaluation dimensions (e.g., code debugging needs correctness and error handling, web navigation needs goal alignment and action efficiency).Method: ADARUBRIC generates task-specific evaluation rubrics dynamically from task descriptions, scores trajectories step-by-step with confidence-weighted per-dimension feedback, and filters preference pairs using DimensionAwareFilter to prevent high-scoring dimensions from masking failures in other dimensions.
Result: Achieves Pearson r=0.79 human correlation (+0.16 over best static baseline) with high reliability (Krippendorff’s α=0.83). DPO agents trained on ADARUBRIC preference pairs gain +6.8 to +8.5 pp task success over Prometheus across three benchmarks, with transfer to SWE-bench code repair (+4.9 pp) and accelerated PPO convergence (+6.6 pp at 5K steps).
Conclusion: ADARUBRIC provides a robust, task-adaptive evaluation framework for agent tasks that outperforms static evaluation methods and enables better agent training through high-quality preference pairs without manual rubric engineering.
Abstract: LLM-as-Judge evaluation fails agent tasks because a fixed rubric cannot capture what matters for this task: code debugging demands Correctness and Error Handling; web navigation demands Goal Alignment and Action Efficiency. We present ADARUBRIC, which closes this gap by generating task-specific evaluation rubrics on the fly from task descriptions, scoring trajectories step-by-step with confidence-weighted per-dimension feedback, and filtering preference pairs with the novel DimensionAwareFilter - a provably necessary condition for preventing high-scoring dimensions from masking dimension-level failures. On WebArena and ToolBench, ADARUBRIC achieves Pearson r=0.79 human correlation (+0.16 over the best static baseline) with deployment-grade reliability (Krippendorff’s $α$=0.83). DPO agents trained on ADARUBRIC preference pairs gain +6.8 to +8.5 pp task success over Prometheus across three benchmarks; gains transfer to SWE-bench code repair (+4.9 pp) and accelerate PPO convergence by +6.6 pp at 5K steps - both without any rubric engineering. Code: https://github.com/alphadl/AdaRubrics.
[635] A transformer architecture alteration to incentivise externalised reasoning
Elizabeth Pavlova, Mariia Koroliuk, Karthik Viswanathan, Cameron Tice, Edward James Young, Puria Radmard
Main category: cs.AI
TL;DR: Teaching LLMs to exit early from forward passes to reduce computation while maintaining reasoning performance
Details
Motivation: To make LLMs more efficient reasoners by reducing unnecessary deep computations for predictable tokens, reserving complex processing only for difficult tokensMethod: Augment transformer architecture with early-exit mechanism at intermediate layers, train model to exit at shallower layers when next token can be predicted without deep computation, use reinforcement learning after calibration to incentivize early exits while maintaining performance
Result: Preliminary results show small reasoning models learn to adaptively reduce computations across tokens
Conclusion: The approach can minimize excess computation in reasoning models, reserving deep processing only for difficult-to-predict tokens when applied at appropriate scale
Abstract: We propose a new architectural change, and post-training pipeline, for making LLMs more verbose reasoners by teaching a model to truncate forward passes early. We augment an existing transformer architecture with an early-exit mechanism at intermediate layers and train the model to exit at shallower layers when the next token can be predicted without deep computation. After a calibration stage, we incentivise the model to exit as early as possible while maintaining task performance using reinforcement learning. We provide preliminary results to this effect for small reasoning models, showing that they learn to adaptively reduce computations across tokens. We predict that, applied at the right scale, our approach can minimise the amount of excess computation that reasoning models have at their disposal to perform non-myopic planning using their internal activations, reserving this only for difficult-to-predict tokens.
[636] PivotRL: High Accuracy Agentic Post-Training at Low Compute Cost
Junkeun Yi, Damon Mosk-Aoyama, Baihe Huang, Ritu Gala, Charles Wang, Sugam Dipak Devare, Khushi Bhardwaj, Abhibha Gupta, Oleksii Kuchaiev, Jiantao Jiao, Jian Zhang, Venkat Srinivasan
Main category: cs.AI
TL;DR: PivotRL is a post-training framework that combines compute efficiency of supervised fine-tuning with OOD generalization of end-to-end RL by using local rollouts to find informative pivot points and functional-equivalent rewards.
Details
Motivation: There's a tension in post-training for long-horizon agentic tasks: SFT is compute efficient but suffers from OOD degradation, while E2E RL preserves OOD capabilities but incurs high compute costs from many on-policy rollouts.Method: PivotRL operates on existing SFT trajectories using two key mechanisms: 1) local on-policy rollouts to identify “pivots” - informative intermediate turns where sampled actions show high outcome variance, and 2) rewards for functionally equivalent actions rather than strict string matching with SFT demonstrations.
Result: PivotRL achieves +4.17% higher in-domain accuracy across four agentic domains and +10.04% higher OOD accuracy in non-agentic tasks compared to standard SFT. On agentic coding tasks, it achieves competitive accuracy with E2E RL using 4x fewer rollout turns.
Conclusion: PivotRL successfully combines the compute efficiency of SFT with the OOD generalization of E2E RL, and has been adopted by NVIDIA’s Nemotron-3-Super-120B-A12B for production-scale agentic post-training.
Abstract: Post-training for long-horizon agentic tasks has a tension between compute efficiency and generalization. While supervised fine-tuning (SFT) is compute efficient, it often suffers from out-of-domain (OOD) degradation. Conversely, end-to-end reinforcement learning (E2E RL) preserves OOD capabilities, but incurs high compute costs due to many turns of on-policy rollout. We introduce PivotRL, a novel framework that operates on existing SFT trajectories to combine the compute efficiency of SFT with the OOD accuracy of E2E RL. PivotRL relies on two key mechanisms: first, it executes local, on-policy rollouts and filters for pivots: informative intermediate turns where sampled actions exhibit high variance in outcomes; second, it utilizes rewards for functional-equivalent actions rather than demanding strict string matching with the SFT data demonstration. We theoretically show that these mechanisms incentivize strong learning signals with high natural gradient norm, while maximally preserving policy probability ordering on actions unrelated to training tasks. In comparison to standard SFT on identical data, we demonstrate that PivotRL achieves +4.17% higher in-domain accuracy on average across four agentic domains, and +10.04% higher OOD accuracy in non-agentic tasks. Notably, on agentic coding tasks, PivotRL achieves competitive accuracy with E2E RL with 4x fewer rollout turns. PivotRL is adopted by NVIDIA’s Nemotron-3-Super-120B-A12B, acting as the workhorse in production-scale agentic post-training.
[637] Persona Vectors in Games: Measuring and Steering Strategies via Activation Vectors
Johnathan Sun, Andrew Zhang
Main category: cs.AI
TL;DR: Using activation steering to create persona vectors for LLM strategic behavior in game theory settings, showing systematic shifts in both quantitative choices and natural language justifications.
Details
Motivation: LLMs are increasingly deployed as autonomous decision-makers in strategic settings, but we lack tools to understand their high-level behavioral traits. The paper aims to develop methods to analyze and steer LLM behavior in game-theoretic contexts.Method: Uses activation steering methods in game-theoretic settings, constructing persona vectors for altruism, forgiveness, and expectations of others through contrastive activation addition. Evaluates on canonical games to analyze behavioral shifts.
Result: Activation steering systematically shifts both quantitative strategic choices and natural-language justifications. However, rhetoric and strategy can diverge under steering. Vectors for self-behavior and expectations of others are partially distinct.
Conclusion: Persona vectors offer a promising mechanistic handle on high-level traits in strategic environments, providing tools to understand and potentially steer LLM behavior in decision-making contexts.
Abstract: Large language models (LLMs) are increasingly deployed as autonomous decision-makers in strategic settings, yet we have limited tools for understanding their high-level behavioral traits. We use activation steering methods in game-theoretic settings, constructing persona vectors for altruism, forgiveness, and expectations of others by contrastive activation addition. Evaluating on canonical games, we find that activation steering systematically shifts both quantitative strategic choices and natural-language justifications. However, we also observe that rhetoric and strategy can diverge under steering. In addition, vectors for self-behavior and expectations of others are partially distinct. Our results suggest that persona vectors offer a promising mechanistic handle on high-level traits in strategic environments.
[638] The Myhill-Nerode Theorem for Bounded Interaction: Canonical Abstractions via Agent-Bounded Indistinguishability
Anthony T. Nixon
Main category: cs.AI
TL;DR: A theoretical framework for quantifying the distinguishability limits of bounded agents in POMDPs, establishing canonical quotients based on what finite-state controllers can differentiate.
Details
Motivation: To formalize the fundamental limitations of capacity-bounded observers in partially observable environments, establishing what situations are indistinguishable to agents with finite computational resources.Method: Develops a mathematical framework using Wasserstein pseudometrics on observation histories induced by families of finite-state controllers, creating canonical quotients that merge indistinguishable histories, analogous to the Myhill-Nerode theorem for bounded interaction.
Result: Establishes canonical, minimal, and unique quotients for bounded agents, proves decision-sufficiency for clock-aware probes, provides approximation bounds for latent-state rewards, and validates theoretical claims on standard POMDP benchmarks.
Conclusion: The framework provides a principled way to understand and quantify the fundamental distinguishability limits of bounded agents in partially observable environments, with practical implications for agent design and analysis.
Abstract: Any capacity-limited observer induces a canonical quotient on its environment: two situations that no bounded agent can distinguish are, for that agent, the same. We formalise this for finite POMDPs. A fixed probe family of finite-state controllers induces a closed-loop Wasserstein pseudometric on observation histories and a probe-exact quotient merging histories that no controller in the family can distinguish. The quotient is canonical, minimal, and unique-a bounded-interaction analogue of the Myhill-Nerode theorem. For clock-aware probes, it is exactly decision-sufficient for objectives that depend only on the agent’s observations and actions; for latent-state rewards, we use an observation-Lipschitz approximation bound. The main theorem object is the clock-aware quotient; scalable deterministic-stationary experiments study a tractable coarsening with gap measured on small exact cases and explored empirically at larger scale. We validate theorem-level claims on Tiger and GridWorld. We also report operational case studies on Tiger, GridWorld, and RockSample as exploratory diagnostics of approximation behavior and runtime, not as theorem-facing evidence when no exact cross-family certificate is available; heavier stress tests are archived in the appendix and artifact package.
[639] Silent Commitment Failure in Instruction-Tuned Language Models: Evidence of Governability Divergence Across Architectures
Gregory M. Ruddell
Main category: cs.AI
TL;DR: Paper introduces “governability” concept for LLM security - the ability to detect and correct model errors before output commitment, showing models vary dramatically in this capability with some producing silent failures.
Details
Motivation: Current security architecture for autonomous LLM agents assumes model errors are detectable at runtime, but this paper challenges that assumption by showing many models produce confident, incorrect outputs without warning signals.Method: Evaluated six models across twelve reasoning domains, measuring governability through detection capacity (ability to spot errors before commitment) and correction capacity (ability to fix errors once detected). Used 2x2 experiment to analyze architecture vs fine-tuning effects.
Result: Two of three instruction-following models exhibited “silent commitment failure” - confident incorrect output with zero warning. Only one model produced detectable conflict signal 57 tokens before commitment. Benchmark accuracy doesn’t predict governability, and identical governance scaffolds have opposite effects across models.
Conclusion: Governability varies dramatically across models, appears fixed at pretraining rather than fine-tuning, and requires new security frameworks. Proposed Detection and Correction Matrix classifies model-task combinations into Governable, Monitor Only, Steer Blind, and Ungovernable regimes.
Abstract: As large language models are deployed as autonomous agents with tool execution privileges, a critical assumption underpins their security architecture: that model errors are detectable at runtime. We present empirical evidence that this assumption fails for two of three instruction-following models evaluable for conflict detection. We introduce governability – the degree to which a model’s errors are detectable before output commitment and correctable once detected – and demonstrate it varies dramatically across models. In six models across twelve reasoning domains, two of three instruction-following models exhibited silent commitment failure: confident, fluent, incorrect output with zero warning signal. The remaining model produced a detectable conflict signal 57 tokens before commitment under greedy decoding. We show benchmark accuracy does not predict governability, correction capacity varies independently of detection, and identical governance scaffolds produce opposite effects across models. A 2x2 experiment shows a 52x difference in spike ratio between architectures but only +/-0.32x variation from fine-tuning, suggesting governability is fixed at pretraining. We propose a Detection and Correction Matrix classifying model-task combinations into four regimes: Governable, Monitor Only, Steer Blind, and Ungovernable.
[640] Is the future of AI green? What can innovation diffusion models say about generative AI’s environmental impact?
Robert Viseur, Nicolas Jullien
Main category: cs.AI
TL;DR: Analysis of GAI’s environmental impact using innovation diffusion model suggests impact may be less severe than predicted due to economic-driven optimization, but depends on dominant business model.
Details
Motivation: Current alarming predictions about generative AI's environmental impact overlook how innovation diffusion and economic optimization can reduce environmental footprint over time.Method: Analyzes the GAI ecosystem using the classic Abernathy-Utterback (A-U) innovation diffusion model to forecast industry structure and environmental impact evolution.
Result: GAI will never be completely green, but its environmental impact may be less problematic than often claimed, contingent on which business model becomes dominant in the industry.
Conclusion: Environmental impact of generative AI depends on industry evolution and business models; economic optimization may mitigate some concerns, but sustainability challenges remain.
Abstract: The rise of generative artificial intelligence (GAI) has led to alarming predictions about its environmental impact. However, these predictions often overlook the fact that the diffusion of innovation is accompanied by the evolution of products and the optimization of their performance, primarily for economic reasons. This can also reduce their environmental impact. By analyzing the GAI ecosystem using the classic A-U innovation diffusion model, we can forecast this industry’s structure and how its environmental impact will evolve. While GAI will never be green, its impact may not be as problematic as is sometimes claimed. However, this depends on which business model becomes dominant.
[641] DomAgent: Leveraging Knowledge Graphs and Case-Based Reasoning for Domain-Specific Code Generation
Shuai Wang, Dhasarathy Parthasarathy, Robert Feldt, Yinan Yu
Main category: cs.AI
TL;DR: DomAgent is an autonomous coding agent that enhances LLMs’ domain-specific code generation through structured reasoning and targeted retrieval, with a novel DomRetriever module that combines knowledge-graph reasoning with case-based reasoning.
Details
Motivation: Generic LLMs trained on public corpora often fail in real-world software development requiring domain-specific knowledge, as specialized solutions are underrepresented in their training data.Method: Proposes DomAgent with DomRetriever module that dynamically integrates top-down knowledge-graph reasoning with bottom-up case-based reasoning for iterative retrieval of structured knowledge and representative cases.
Result: DomAgent significantly enhances domain-specific code generation, enabling small open-source models to close performance gaps with large proprietary LLMs in complex real-world applications.
Conclusion: DomAgent bridges the gap between generic LLMs and domain-specific coding needs through structured reasoning and targeted retrieval, with DomRetriever being usable independently with any LLM.
Abstract: Large language models (LLMs) have shown impressive capabilities in code generation. However, because most LLMs are trained on public domain corpora, directly applying them to real-world software development often yields low success rates, as these scenarios frequently require domain-specific knowledge. In particular, domain-specific tasks usually demand highly specialized solutions, which are often underrepresented or entirely absent in the training data of generic LLMs. To address this challenge, we propose DomAgent, an autonomous coding agent that bridges this gap by enabling LLMs to generate domain-adapted code through structured reasoning and targeted retrieval. A core component of DomAgent is DomRetriever, a novel retrieval module that emulates how humans learn domain-specific knowledge, by combining conceptual understanding with experiential examples. It dynamically integrates top-down knowledge-graph reasoning with bottom-up case-based reasoning, enabling iterative retrieval and synthesis of structured knowledge and representative cases to ensure contextual relevance and broad task coverage. DomRetriever can operate as part of DomAgent or independently with any LLM for flexible domain adaptation. We evaluate DomAgent on an open benchmark dataset in the data science domain (DS-1000) and further apply it to real-world truck software development tasks. Experimental results show that DomAgent significantly enhances domain-specific code generation, enabling small open-source models to close much of the performance gap with large proprietary LLMs in complex, real-world applications. The code is available at: https://github.com/Wangshuaiia/DomAgent.
[642] Behavioural feasible set: Value alignment constraints on AI decision support
Taejin Park
Main category: cs.AI
TL;DR: AI system alignment by vendors creates behavioral feasible sets that limit organizational decision flexibility, embedding vendor value judgments that cannot be renegotiated through prompting.
Details
Motivation: Organizations adopting commercial AI systems inherit opaque vendor value judgments that constrain decision-making flexibility, creating governance challenges where vendor-imposed alignment determines which trade-offs remain negotiable.Method: Formalizes behavioral feasible sets (range of recommendations reachable under vendor constraints), uses scenario-based experiments with binary decisions and multi-stakeholder ranking tasks, compares pre- and post-alignment variants of open-weight models, and analyzes leading commercial models.
Result: Alignment substantially compresses the feasible set of recommendations, making systems less able to shift recommendations under legitimate contextual pressure; alignment shifts implied stakeholder priorities rather than neutralizing them; commercial models show comparable or greater rigidity.
Conclusion: Organizations face a fundamental governance problem where vendor selection determines embedded value orientations and which trade-offs remain negotiable, a limitation that cannot be resolved through better prompting alone.
Abstract: When organisations adopt commercial AI systems for decision support, they inherit value judgements embedded by vendors that are neither transparent nor renegotiable. The governance puzzle is not whether AI can support decisions but which recommendations the system can actually produce given how its vendor has configured it. I formalise this as a behavioural feasible set, the range of recommendations reachable under vendor-imposed alignment constraints, and characterise diagnostic thresholds for when organisational requirements exceed the system’s flexibility. In scenario-based experiments using binary decision scenarios and multi-stakeholder ranking tasks, I show that alignment materially compresses this set. Comparing pre- and post-alignment variants of an open-weight model isolates the mechanism: alignment makes the system substantially less able to shift its recommendation even under legitimate contextual pressure. Leading commercial models exhibit comparable or greater rigidity. In multi-stakeholder tasks, alignment shifts implied stakeholder priorities rather than neutralising them, meaning organisations adopt embedded value orientations set upstream by the vendor. Organisations thus face a governance problem that better prompting cannot resolve: selecting a vendor partially determines which trade-offs remain negotiable and which stakeholder priorities are structurally embedded.
[643] Safety as Computation: Certified Answer Reuse via Capability Closure in Task-Oriented Dialogue
Cosimo Spera
Main category: cs.AI
TL;DR: A new paradigm for task-oriented dialogue systems using safety certification as a computational primitive for answer reuse, eliminating redundant retrieval/generation through certified answer storage and formal containment checks.
Details
Motivation: Current dialogue systems treat each turn independently, recomputing answers via retrieval or generation even when answers are already derivable from prior state, leading to inefficiency.Method: Introduces Certified Answer Store (CAS) augmented by Pre-Answer Blocks (PAB) that materializes all derivable follow-up answers with minimal provenance witnesses at each certified turn, enabling sub-millisecond query answering via formal containment checks.
Result: System eliminates redundant retrieval and generation by answering subsequent queries through formal containment checks on pre-computed certified answers.
Conclusion: Safety certification can serve as a computational primitive for efficient answer reuse in dialogue systems, significantly reducing computational overhead while maintaining correctness.
Abstract: We introduce a new paradigm for task-oriented dialogue systems: safety certification as a computational primitive for answer reuse. Current systems treat each turn independently, recomputing answers via retrieval or generation even when they are already derivable from prior state. We show that in capability-based systems, the safety certification step computes a fixed-point closure cl(At) that already contains every answer reachable from the current configuration. We operationalize this insight with a Certified Answer Store (CAS) augmented by Pre-Answer Blocks (PAB): at each certified turn, the system materializes all derivable follow-up answers together with minimal provenance witnesses. Subsequent queries are answered in sub-millisecond time via formal containment checks, eliminating redundant retrieval and generation.
[644] Beyond Correlation: Refutation-Validated Aspect-Based Sentiment Analysis for Explainable Energy Market Returns
Wihan van der Heever, Keane Ong, Ranjan Satapathy, Erik Cambria
Main category: cs.AI
TL;DR: A refutation-validated framework for aspect-based sentiment analysis in financial markets that distinguishes genuine from spurious associations using rigorous statistical testing.
Details
Motivation: Address limitations of correlational studies in financial sentiment analysis that cannot distinguish genuine associations from spurious ones, particularly in aspect-based sentiment analysis for financial markets.Method: Combines net-ratio scoring with z-normalization, OLS with Newey West HAC errors, and multiple refutation tests including placebo, random common cause, subset stability, and bootstrap tests on X data for energy sector.
Result: Across six energy tickers, only a few associations survive all checks, with renewables showing aspect and horizon specific responses. Limited sample size (six stocks, one quarter) constrains generalizability.
Conclusion: The framework provides statistically robust, directionally interpretable signals but doesn’t establish causality; serves as methodological proof of concept rather than definitive causal analysis.
Abstract: This paper proposes a refutation-validated framework for aspect-based sentiment analysis in financial markets, addressing the limitations of correlational studies that cannot distinguish genuine associations from spurious ones. Using X data for the energy sector, we test whether aspect-level sentiment signals show robust, refutation-validated relationships with equity returns. Our pipeline combines net-ratio scoring with z-normalization, OLS with Newey West HAC errors, and refutation tests including placebo, random common cause, subset stability, and bootstrap. Across six energy tickers, only a few associations survive all checks, while renewables show aspect and horizon specific responses. While not establishing causality, the framework provides statistically robust, directionally interpretable signals, with limited sample size (six stocks, one quarter) constraining generalizability and framing this work as a methodological proof of concept.
[645] Unified-MAS: Universally Generating Domain-Specific Nodes for Empowering Automatic Multi-Agent Systems
Hehai Lin, Yu Yan, Zixuan Wang, Bo Xu, Sudong Wang, Weiquan Huang, Ruochen Zhao, Minzhi Li, Chengwei Qin
Main category: cs.AI
TL;DR: Unified-MAS decouples node implementation from topological orchestration via offline node synthesis and reward-based optimization for better performance in knowledge-intensive domains.
Details
Motivation: Existing multi-agent systems struggle with knowledge-intensive domains due to either static general nodes lacking expertise or on-the-fly generation that couples domain logic with topology optimization, degrading system efficacy.Method: Two-stage approach: (1) Search-based node generation retrieves external knowledge to synthesize specialized node blueprints, overcoming LLM knowledge limits; (2) Reward-based node optimization uses perplexity-guided reward to iteratively enhance bottleneck nodes’ internal logic.
Result: Extensive experiments across four specialized domains show Unified-MAS integrated into four Automatic-MAS baselines yields better performance-cost trade-off, achieving up to 14.2% gain while significantly reducing costs.
Conclusion: Unified-MAS effectively addresses architectural coupling in multi-agent systems for knowledge-intensive domains through decoupled node synthesis and optimization, demonstrating robustness across different LLMs and effectiveness on conventional tasks.
Abstract: Automatic Multi-Agent Systems (MAS) generation has emerged as a promising paradigm for solving complex reasoning tasks. However, existing frameworks are fundamentally bottlenecked when applied to knowledge-intensive domains (e.g., healthcare and law). They either rely on a static library of general nodes like Chain-of-Thought, which lack specialized expertise, or attempt to generate nodes on the fly. In the latter case, the orchestrator is not only bound by its internal knowledge limits but must also simultaneously generate domain-specific logic and optimize high-level topology, leading to a severe architectural coupling that degrades overall system efficacy. To bridge this gap, we propose Unified-MAS that decouples granular node implementation from topological orchestration via offline node synthesis. Unified-MAS operates in two stages: (1) Search-Based Node Generation retrieves external open-world knowledge to synthesize specialized node blueprints, overcoming the internal knowledge limits of LLMs; and (2) Reward-Based Node Optimization utilizes a perplexity-guided reward to iteratively enhance the internal logic of bottleneck nodes. Extensive experiments across four specialized domains demonstrate that integrating Unified-MAS into four Automatic-MAS baselines yields a better performance-cost trade-off, achieving up to a 14.2% gain while significantly reducing costs. Further analysis reveals its robustness across different designer LLMs and its effectiveness on conventional tasks such as mathematical reasoning.
[646] Stabilizing Iterative Self-Training with Verified Reasoning via Symbolic Recursive Self-Alignment
Xinyu Zhang
Main category: cs.AI
TL;DR: NSRSA uses symbolic verification to stabilize recursive self-training by filtering flawed reasoning steps, preventing error compounding and improving model reliability.
Details
Motivation: Recursive self-improvement faces recursive drift where errors compound across iterations, leading to mode collapse and performance degradation. Current methods like outcome-only filtering fail to catch "lucky guesses" with flawed reasoning.Method: Neuro-Symbolic Recursive Self-Alignment (NSRSA) embeds a symbolic verification subsystem that gates training data quality at the reasoning step level. It verifies each arithmetic operation via sympy, checks logical flow consistency across reasoning steps, and enforces domain constraints.
Result: NSRSA rejects ~34% of correct-answer solutions that pass outcome verification, eliminating “lucky guesses” with flawed reasoning. DPO preference pairs from NSRSA verification improve reward accuracy from 46% to 63% in distinguishing sound vs flawed reasoning.
Conclusion: NSRSA provides an extensible framework that demonstrates how external symbolic verification can make recursive self-improvement measurable and reliable within domains where automated verification is available.
Abstract: Recursive self-improvement–where a model iteratively trains on its own outputs–promises sustained capability growth but faces a fundamental obstacle: recursive drift. As models train on self-generated data across multiple iterations, errors in intermediate reasoning compound, leading to mode collapse and performance degradation. We propose Neuro-Symbolic Recursive Self-Alignment (NSRSA), which stabilizes iterative self-training by embedding a symbolic verification subsystem that gates training data quality at the reasoning step level. Unlike outcome-only filtering (which admits “lucky guesses” with flawed reasoning), NSRSA verifies each arithmetic operation via sympy, checks logical flow consistency across reasoning steps, and enforces domain constraints. We evaluate NSRSA on GSM8K using Qwen3-4B-Thinking across 5 self-training iterations under five conditions: no verification, outcome verification, majority voting, full NSRSA symbolic verification, and NSRSA with DPO. Our filtering analysis shows that NSRSA rejects approximately 34% of correct-answer solutions that pass outcome verification, eliminating “lucky guesses” with flawed reasoning from the training set. We further demonstrate that constructing DPO preference pairs from NSRSA verification teaches the model to distinguish sound from flawed reasoning (reward accuracy 46% to 63%). NSRSA provides an extensible framework that demonstrates how external symbolic verification can make recursive self-improvement measurable and reliable within domains where automated verification is available.
[647] Counterfactual Credit Policy Optimization for Multi-Agent Collaboration
Zhongyi Li, Wan Tian, Yikun Ban, Jinju Chen, Huiming Zhang, Yang Liu, Fuzhen Zhuang
Main category: cs.AI
TL;DR: CCPO is a reinforcement learning framework for multi-agent LLMs that uses counterfactual trajectories to assign individual credit, preventing free-riding and improving collaborative reasoning performance.
Details
Motivation: Current RL approaches for collaborative multi-agent LLMs suffer from credit assignment problems where shared global rewards obscure individual contributions, leading to high variance, free-riding, and inefficient learning.Method: CCPO estimates each agent’s marginal contribution through counterfactual trajectories that simulate outcomes with an agent’s contribution removed, creating dynamic counterfactual baselines for role-sensitive advantages. It also uses global-history-aware normalization to stabilize learning across heterogeneous tasks.
Result: CCPO outperforms strong multi-agent RL baselines on mathematical and logical reasoning benchmarks, effectively mitigating free-riding and enabling finer-grained credit assignment for collaborative LLM training.
Conclusion: CCPO provides an effective framework for credit assignment in collaborative multi-agent LLMs, enabling more efficient reinforcement learning for complex reasoning tasks through counterfactual contribution estimation.
Abstract: Collaborative multi-agent large language models (LLMs) can solve complex reasoning tasks by decomposing roles and aggregating diverse hypotheses. Yet, reinforcement learning (RL) for such systems is often undermined by credit assignment: a shared global reward obscures individual contributions, inflating update variance and encouraging free-riding. We introduce Counterfactual Credit Policy Optimization (CCPO), a framework that assigns agent-specific learning signals by estimating each agent’s marginal contribution through counterfactual trajectories. CCPO builds dynamic counterfactual baselines that simulate outcomes with an agent’s contribution removed, yielding role-sensitive advantages for policy optimization. To further improve stability under heterogeneous tasks and data distributions, we propose a global-history-aware normalization scheme that calibrates advantages using global rollout statistics. We evaluate CCPO on two collaboration topologies: a sequential Think–Reason dyad and multi-agent voting. Across mathematical and logical reasoning benchmarks, CCPO mitigates free-riding and outperforms strong multi-agent RL baselines, yielding finer-grained and more effective credit assignment for collaborative LLM training. Our code is available at https://github.com/bhai114/ccpo.
[648] Adaptive Robust Estimator for Multi-Agent Reinforcement Learning
Zhongyi Li, Wan Tian, Jingyu Chen, Kangyao Huang, Huiming Zhang, Hui Yang, Tao Ren, Jinyang Jiang, Yijie Peng, Yikun Ban, Fuzhen Zhuang
Main category: cs.AI
TL;DR: A robust multi-agent reinforcement learning framework for collaborative reasoning that addresses interaction ambiguity and noisy rewards through structured answer-critique-rewrite pipeline and adaptive robust estimation.
Details
Motivation: Multi-agent collaboration enhances LLM reasoning but suffers from interaction-level ambiguity that blurs generation, critique, and revision roles, making credit assignment difficult. Additionally, policy optimization is vulnerable to heavy-tailed and noisy rewards that bias advantage estimation and cause unstable training.Method: Proposes two components: 1) Dual-Agent Answer-Critique-Rewrite (DACR) that decomposes reasoning into a structured three-stage pipeline with explicit attribution of each agent’s marginal contribution, and 2) Adaptive Robust Estimator (ARE) for robust estimation of batch experience means during multi-agent policy optimization.
Result: The method consistently outperforms baselines across mathematical reasoning and embodied intelligence benchmarks, even under noisy rewards. Shows stronger robustness to reward noise and more stable training dynamics, effectively preventing optimization failures caused by noisy reward signals.
Conclusion: The proposed framework successfully addresses both interaction ambiguity and reward noise challenges in multi-agent collaborative reasoning, enabling more stable and effective training while maintaining performance across diverse reasoning tasks.
Abstract: Multi-agent collaboration has emerged as a powerful paradigm for enhancing the reasoning capabilities of large language models, yet it suffers from interaction-level ambiguity that blurs generation, critique, and revision, making credit assignment across agents difficult. Moreover, policy optimization in this setting is vulnerable to heavy-tailed and noisy rewards, which can bias advantage estimation and trigger unstable or even divergent training. To address both issues, we propose a robust multi-agent reinforcement learning framework for collaborative reasoning, consisting of two components: Dual-Agent Answer-Critique-Rewrite (DACR) and an Adaptive Robust Estimator (ARE). DACR decomposes reasoning into a structured three-stage pipeline: answer, critique, and rewrite, while enabling explicit attribution of each agent’s marginal contribution to its partner’s performance. ARE provides robust estimation of batch experience means during multi-agent policy optimization. Across mathematical reasoning and embodied intelligence benchmarks, even under noisy rewards, our method consistently outperforms the baseline in both homogeneous and heterogeneous settings. These results indicate stronger robustness to reward noise and more stable training dynamics, effectively preventing optimization failures caused by noisy reward signals.
[649] Mind over Space: Can Multimodal Large Language Models Mentally Navigate?
Qihui Zhu, Shouwei Ruan, Xiao Yang, Hao Jiang, Yao Huang, Shiji Zhao, Hanwei Fan, Hang Su, Xingxing Wei
Main category: cs.AI
TL;DR: Video2Mental benchmark evaluates MLLMs’ mental navigation capabilities for constructing cognitive maps from videos and planning landmark-based paths, with NavMind model proposed to address current limitations.
Details
Motivation: Current MLLMs in embodied agents are limited to reactive planning from immediate observations and fail at spatial reasoning across spatiotemporal scales, unlike biological intelligence which excels at mental navigation through cognitive map construction and mental simulation.Method: Introduces Video2Mental benchmark requiring hierarchical cognitive map construction from long egocentric videos and landmark-based path planning with simulator verification. Proposes NavMind model that internalizes mental navigation using explicit, fine-grained cognitive maps as learnable intermediate representations through difficulty-stratified progressive supervised fine-tuning.
Result: Benchmarking reveals mental navigation doesn’t emerge from standard pre-training; frontier MLLMs struggle with structured spatial representation and planning accuracy decays over extended horizons. NavMind significantly outperforms commercial and spatial MLLMs in mental navigation capabilities.
Conclusion: Mental navigation is a critical capability missing in current MLLMs that requires explicit modeling of cognitive maps. NavMind demonstrates that structured intermediate representations can effectively bridge perception and planning for embodied spatial reasoning.
Abstract: Despite the widespread adoption of MLLMs in embodied agents, their capabilities remain largely confined to reactive planning from immediate observations, consistently failing in spatial reasoning across extensive spatiotemporal scales. Cognitive science reveals that Biological Intelligence (BI) thrives on “mental navigation”: the strategic construction of spatial representations from experience and the subsequent mental simulation of paths prior to action. To bridge the gap between AI and BI, we introduce Video2Mental, a pioneering benchmark for evaluating the mental navigation capabilities of MLLMs. The task requires constructing hierarchical cognitive maps from long egocentric videos and generating landmark-based path plans step by step, with planning accuracy verified through simulator-based physical interaction. Our benchmarking results reveal that mental navigation capability does not naturally emerge from standard pre-training. Frontier MLLMs struggle profoundly with zero-shot structured spatial representation, and their planning accuracy decays precipitously over extended horizons. To overcome this, we propose \textbf{NavMind}, a reasoning model that internalizes mental navigation using explicit, fine-grained cognitive maps as learnable intermediate representations. Through a difficulty-stratified progressive supervised fine-tuning paradigm, NavMind effectively bridges the gap between raw perception and structured planning. Experiments demonstrate that NavMind achieves superior mental navigation capabilities, significantly outperforming frontier commercial and spatial MLLMs.
[650] A Multidisciplinary AI Board for Multimodal Dementia Characterization and Risk Assessment
Sheng Liu, Long Chen, Zeyun Zhao, Qinglin Gou, Qingyue Wei, Arjun Masurkar, Kevin M. Spiegler, Philip Kuball, Stefania C. Bray, Megan Bernath, Deanna R. Willis, Jiang Bian, Lei Xing, Eric Topol, Kyunghyun Cho, Yu Huang, Ruogu Fang, Narges Razavian, James Zou
Main category: cs.AI
TL;DR: Cerebra is an interactive multi-agent AI system for clinical decision support that coordinates specialized agents for EHR, clinical notes, and medical imaging analysis, with a clinician-facing dashboard combining visual analytics and conversational interface.
Details
Motivation: Clinical practice needs systems that can reason over heterogeneous, evolving, and incomplete patient data, but existing multimodal foundation models are static, opaque, and poorly aligned with real-world clinical workflows.Method: Multi-agent AI team coordinating specialized agents for different data modalities (EHR, clinical notes, medical imaging), with outputs synthesized into a clinician-facing dashboard combining visual analytics and conversational interface. Supports privacy-preserving deployment via structured representations and robustness to incomplete modalities.
Result: Outperformed state-of-the-art single-modality models and large multimodal language model baselines on massive multi-institutional dataset (3M patients from 4 healthcare systems). Achieved AUROCs up to 0.80 for dementia risk prediction (vs 0.74 single-modality, 0.68 LLM), 0.86 for dementia diagnosis, and C-index 0.81 for survival prediction. In reader study, improved physician accuracy by 17.5 percentage points in dementia risk estimation.
Conclusion: Cerebra demonstrates potential for interpretable, robust decision support in clinical care by combining multimodal analysis with interactive clinician interfaces.
Abstract: Modern clinical practice increasingly depends on reasoning over heterogeneous, evolving, and incomplete patient data. Although recent advances in multimodal foundation models have improved performance on various clinical tasks, most existing models remain static, opaque, and poorly aligned with real-world clinical workflows. We present Cerebra, an interactive multi-agent AI team that coordinates specialized agents for EHR, clinical notes, and medical imaging analysis. These outputs are synthesized into a clinician-facing dashboard that combines visual analytics with a conversational interface, enabling clinicians to interrogate predictions and contextualize risk at the point of care. Cerebra supports privacy-preserving deployment by operating on structured representations and remains robust when modalities are incomplete. We evaluated Cerebra using a massive multi-institutional dataset spanning 3 million patients from four independent healthcare systems. Cerebra consistently outperformed both state-of-the-art single-modality models and large multimodal language model baselines. In dementia risk prediction, it achieved AUROCs up to 0.80, compared with 0.74 for the strongest single-modality model and 0.68 for language model baselines. For dementia diagnosis, it achieved an AUROC of 0.86, and for survival prediction, a C-index of 0.81. In a reader study with experienced physicians, Cerebra significantly improved expert performance, increasing accuracy by 17.5 percentage points in prospective dementia risk estimation. These results demonstrate Cerebra’s potential for interpretable, robust decision support in clinical care.
[651] INTRYGUE: Induction-Aware Entropy Gating for Reliable RAG Uncertainty Estimation
Alexandra Bazarova, Andrei Volodichev, Daria Kotova, Alexey Zaytsev
Main category: cs.AI
TL;DR: INTRYGUE addresses uncertainty quantification failures in RAG systems by gating predictive entropy based on induction head activation patterns to detect hallucinations more accurately.
Details
Motivation: Standard entropy-based uncertainty quantification methods fail in RAG settings due to a mechanistic paradox where induction heads trigger "entropy neurons," causing false uncertainty signals on accurate outputs despite retrieval augmentation.Method: Proposes INTRYGUE (Induction-Aware Entropy Gating for Uncertainty Estimation), a mechanistically grounded method that gates predictive entropy based on activation patterns of induction heads to distinguish between genuine uncertainty and false signals.
Result: Evaluated across four RAG benchmarks and six open-source LLMs (4B to 13B parameters), INTRYGUE consistently matches or outperforms a wide range of UQ baselines.
Conclusion: Hallucination detection in RAG benefits from combining predictive uncertainty with interpretable, internal signals of context utilization, specifically induction head activation patterns.
Abstract: While retrieval-augmented generation (RAG) significantly improves the factual reliability of LLMs, it does not eliminate hallucinations, so robust uncertainty quantification (UQ) remains essential. In this paper, we reveal that standard entropy-based UQ methods often fail in RAG settings due to a mechanistic paradox. An internal “tug-of-war” inherent to context utilization appears: while induction heads promote grounded responses by copying the correct answer, they collaterally trigger the previously established “entropy neurons”. This interaction inflates predictive entropy, causing the model to signal false uncertainty on accurate outputs. To address this, we propose INTRYGUE (Induction-Aware Entropy Gating for Uncertainty Estimation), a mechanistically grounded method that gates predictive entropy based on the activation patterns of induction heads. Evaluated across four RAG benchmarks and six open-source LLMs (4B to 13B parameters), INTRYGUE consistently matches or outperforms a wide range of UQ baselines. Our findings demonstrate that hallucination detection in RAG benefits from combining predictive uncertainty with interpretable, internal signals of context utilization.
[652] EnterpriseLab: A Full-Stack Platform for developing and deploying agents in Enterprises
Ankush Agarwal, Harsh Vishwakarma, Suraj Nagaje, Chaitanya Devaguptapu
Main category: cs.AI
TL;DR: EnterpriseLab is a full-stack platform for developing enterprise AI agents that unifies tool integration, automated data generation, and training pipelines to create privacy-preserving small language models that match frontier model performance at lower cost.
Details
Motivation: Enterprises need AI agents that balance capability with data sovereignty and cost constraints. Small language models offer privacy-preserving alternatives but face fragmented development pipelines separating tool integration, data generation, and training.Method: EnterpriseLab provides: (1) modular environment exposing enterprise apps via Model Context Protocol for tool integration, (2) automated trajectory synthesis generating training data from environment schemas, and (3) integrated training pipelines with continuous evaluation. Validated through EnterpriseArena with 15 apps and 140+ tools across IT, HR, sales, and engineering domains.
Result: 8B-parameter models trained with EnterpriseLab match GPT-4o’s performance on complex enterprise workflows while reducing inference costs by 8-10x. Models show robustness across benchmarks: EnterpriseBench (+10%) and CRMArena (+10%).
Conclusion: EnterpriseLab provides enterprises a practical path to deploying capable, privacy-preserving agents without compromising operational capability, offering a closed-loop framework that unifies development stages.
Abstract: Deploying AI agents in enterprise environments requires balancing capability with data sovereignty and cost constraints. While small language models offer privacy-preserving alternatives to frontier models, their specialization is hindered by fragmented development pipelines that separate tool integration, data generation, and training. We introduce EnterpriseLab, a full-stack platform that unifies these stages into a closed-loop framework. EnterpriseLab provides (1) a modular environment exposing enterprise applications via Model Context Protocol, enabling seamless integration of proprietary and open-source tools; (2) automated trajectory synthesis that programmatically generates training data from environment schemas; and (3) integrated training pipelines with continuous evaluation. We validate the platform through EnterpriseArena, an instantiation with 15 applications and 140+ tools across IT, HR, sales, and engineering domains. Our results demonstrate that 8B-parameter models trained within EnterpriseLab match GPT-4o’s performance on complex enterprise workflows while reducing inference costs by 8-10x, and remain robust across diverse enterprise benchmarks, including EnterpriseBench (+10%) and CRMArena (+10%). EnterpriseLab provides enterprises a practical path to deploying capable, privacy-preserving agents without compromising operational capability.
[653] Silicon Bureaucracy and AI Test-Oriented Education: Contamination Sensitivity and Score Confidence in LLM Benchmarks
Yiliang Song, Hongjun An, Jiangan Chen, Xuanchen Yan, Huan Song, Jiawei Shao, Xuelong Li
Main category: cs.AI
TL;DR: Paper proposes audit framework to test if LLM benchmark scores reflect genuine capability vs. contamination, finding noisy conditions often outperform clean baselines, suggesting benchmark scores need confidence assessments.
Details
Motivation: Current LLM evaluation relies heavily on benchmark scores, assuming they reflect genuine generalization. However, contamination and semantic leakage in training data may inflate scores, conflating exam-oriented competence with principled capability.Method: Proposes audit framework using router-worker setup: compares clean-control condition with noisy conditions where benchmark problems are systematically deleted, rewritten, and perturbed before being passed downstream. Tests if genuinely clean benchmarks should not show consistent performance gains under noisy conditions.
Result: Across multiple models, found widespread but heterogeneous above-baseline gains under noisy conditions, indicating benchmark-related cues can reassemble and reactivate contamination-related memory. Similar benchmark scores may carry substantially different confidence levels.
Conclusion: Benchmark-based evaluation should be supplemented with explicit audits of contamination sensitivity and score confidence rather than rejecting benchmarks altogether. Need to distinguish genuine capability from contamination-inflated performance.
Abstract: Public benchmarks increasingly govern how large language models (LLMs) are ranked, selected, and deployed. We frame this benchmark-centered regime as Silicon Bureaucracy and AI Test-Oriented Education, and argue that it rests on a fragile assumption: that benchmark scores directly reflect genuine generalization. In practice, however, such scores may conflate exam-oriented competence with principled capability, especially when contamination and semantic leakage are difficult to exclude from modern training pipelines. We therefore propose an audit framework for analyzing contamination sensitivity and score confidence in LLM benchmarks. Using a router-worker setup, we compare a clean-control condition with noisy conditions in which benchmark problems are systematically deleted, rewritten, and perturbed before being passed downstream. For a genuinely clean benchmark, noisy conditions should not consistently outperform the clean-control baseline. Yet across multiple models, we find widespread but heterogeneous above-baseline gains under noisy conditions, indicating that benchmark-related cues may be reassembled and can reactivate contamination-related memory. These results suggest that similar benchmark scores may carry substantially different levels of confidence. Rather than rejecting benchmarks altogether, we argue that benchmark-based evaluation should be supplemented with explicit audits of contamination sensitivity and score confidence.
[654] Mirage The Illusion of Visual Understanding
Mohammad Asadi, Jack W. O’Sullivan, Fang Cao, Tahoura Nedaee, Kamyar Fardi, Fei-Fei Li, Ehsan Adeli, Euan Ashley
Main category: cs.AI
TL;DR: Multimodal AI systems exhibit “mirage reasoning” - generating detailed image descriptions and reasoning for non-existent images, achieving high benchmark scores without visual input, revealing fundamental vulnerabilities in visual-language model evaluation.
Details
Motivation: To investigate the mechanisms underlying visual-language reasoning in multimodal AI systems, challenging prevailing assumptions about how these systems process and integrate visual information, particularly in medical contexts where miscalibrated AI carries significant consequences.Method: Conducted experiments showing: 1) models generate detailed image descriptions for non-existent images (mirage reasoning), 2) models achieve high scores on multimodal benchmarks without image input, 3) performance declines when models are explicitly instructed to guess without image access versus implicitly prompted to assume images present. Introduced B-Clean for fair, vision-grounded evaluation.
Result: Frontier models exhibited mirage reasoning, generated pathology-biased clinical findings for non-existent images, achieved top rank on chest X-ray benchmark without image access, and showed marked performance decline when explicitly instructed to guess without images versus implicit prompting.
Conclusion: The findings expose fundamental vulnerabilities in how visual-language models reason and are evaluated, pointing to urgent need for private benchmarks that eliminate textual cues enabling non-visual inference, particularly in medical contexts. B-Clean is proposed as a principled solution for fair, vision-grounded evaluation.
Abstract: Multimodal AI systems have achieved remarkable performance across a broad range of real-world tasks, yet the mechanisms underlying visual-language reasoning remain surprisingly poorly understood. We report three findings that challenge prevailing assumptions about how these systems process and integrate visual information. First, Frontier models readily generate detailed image descriptions and elaborate reasoning traces, including pathology-biased clinical findings, for images never provided; we term this phenomenon mirage reasoning. Second, without any image input, models also attain strikingly high scores across general and medical multimodal benchmarks, bringing into question their utility and design. In the most extreme case, our model achieved the top rank on a standard chest X-ray question-answering benchmark without access to any images. Third, when models were explicitly instructed to guess answers without image access, rather than being implicitly prompted to assume images were present, performance declined markedly. Explicit guessing appears to engage a more conservative response regime, in contrast to the mirage regime in which models behave as though images have been provided. These findings expose fundamental vulnerabilities in how visual-language models reason and are evaluated, pointing to an urgent need for private benchmarks that eliminate textual cues enabling non-visual inference, particularly in medical contexts where miscalibrated AI carries the greatest consequence. We introduce B-Clean as a principled solution for fair, vision-grounded evaluation of multimodal AI systems.
[655] AI Token Futures Market: Commoditization of Compute and Derivatives Contract Design
Yicai Xing
Main category: cs.AI
TL;DR: This paper analyzes AI inference tokens as a new commodity, proposes standardized token futures contracts, and demonstrates their hedging efficiency for reducing compute cost volatility.
Details
Motivation: As AI inference tokens become widely deployed, they are evolving into a new type of commodity. The paper aims to establish a financial framework for token futures markets to help application-layer enterprises hedge against compute cost volatility.Method: The paper systematically analyzes token commodity attributes, proposes a complete design for standardized token futures contracts (including Standard Inference Token definition, contract specifications, settlement mechanisms), constructs a mean-reverting jump-diffusion stochastic process model, and conducts Monte Carlo simulations to evaluate hedging efficiency.
Result: Simulation results show that under application-layer demand explosion scenarios, token futures can reduce enterprise compute cost volatility by 62%-78%. The paper also explores GPU compute futures feasibility and discusses regulatory frameworks.
Conclusion: The paper provides a theoretical foundation and practical roadmap for the financialization of compute resources, establishing token futures as an effective tool for managing compute cost volatility in the AI inference market.
Abstract: As large language models (LLMs) and vision-language-action models (VLAs) become widely deployed, the tokens consumed by AI inference are evolving into a new type of commodity. This paper systematically analyzes the commodity attributes of tokens, arguing for their transition from intelligent service outputs to compute infrastructure raw materials, and draws comparisons with established commodities such as electricity, carbon emission allowances, and bandwidth. Building on the historical experience of electricity futures markets and the theory of commodity financialization, we propose a complete design for standardized token futures contracts, including the definition of a Standard Inference Token (SIT), contract specifications, settlement mechanisms, margin systems, and market-maker regimes. By constructing a mean-reverting jump-diffusion stochastic process model and conducting Monte Carlo simulations, we evaluate the hedging efficiency of the proposed futures contracts for application-layer enterprises. Simulation results show that, under an application-layer demand explosion scenario, token futures can reduce enterprise compute cost volatility by 62%-78%. We also explore the feasibility of GPU compute futures and discuss the regulatory framework for token futures markets, providing a theoretical foundation and practical roadmap for the financialization of compute resources.
[656] Reasoning Provenance for Autonomous AI Agents: Structured Behavioral Analytics Beyond State Checkpoints and Execution Traces
Neelmani Vispute
Main category: cs.AI
TL;DR: AER introduces structured reasoning provenance as a first-class primitive for AI agents, enabling population-level behavioral analytics through queryable records of intent, observation, inference, and evidence chains.
Details
Motivation: As AI agents become autonomous infrastructure, there's a need to analyze reasoning behavior across populations. Current systems provide state persistence and execution traces but lack structured reasoning provenance as a native primitive.Method: Introduces Agent Execution Record (AER) - a structured reasoning provenance primitive capturing intent, observation, inference as queryable fields, with versioned plans, evidence chains, structured verdicts, and delegation authority chains.
Result: Enables population-level behavioral analytics including reasoning pattern mining, confidence calibration, cross-agent comparison, and counterfactual regression testing via mock replay. Includes domain-agnostic model with extensible profiles and reference implementation.
Conclusion: Structured reasoning provenance is essential for analyzing autonomous AI agents at scale, and AER provides the necessary primitive to capture and query reasoning behavior across agent populations.
Abstract: As AI agents transition from human-supervised copilots to autonomous platform infrastructure, the ability to analyze their reasoning behavior across populations of investigations becomes a pressing infrastructure requirement. Existing operational tooling addresses adjacent needs effectively: state checkpoint systems enable fault tolerance; observability platforms provide execution traces for debugging; telemetry standards ensure interoperability. What current systems do not natively provide as a first-class, schema-level primitive is structured reasoning provenance – normalized, queryable records of why the agent chose each action, what it concluded from each observation, how each conclusion shaped its strategy, and which evidence supports its final verdict. This paper introduces the Agent Execution Record (AER), a structured reasoning provenance primitive that captures intent, observation, and inference as first-class queryable fields on every step, alongside versioned plans with revision rationale, evidence chains, structured verdicts with confidence scores, and delegation authority chains. We formalize the distinction between computational state persistence and reasoning provenance, argue that the latter cannot in general be faithfully reconstructed from the former, and show how AERs enable population-level behavioral analytics: reasoning pattern mining, confidence calibration, cross-agent comparison, and counterfactual regression testing via mock replay. We present a domain-agnostic model with extensible domain profiles, a reference implementation and SDK, and outline an evaluation methodology informed by preliminary deployment on a production platformized root cause analysis agent.
[657] Deterministic Hallucination Detection in Medical VQA via Confidence-Evidence Bayesian Gain
Mohammad Asadi, Tahoura Nedaee, Jack W. O’Sullivan, Euan Ashley, Ehsan Adeli
Main category: cs.AI
TL;DR: CEBaG: A deterministic hallucination detection method for medical MLLMs that uses token-level confidence variance and visual evidence magnitude without stochastic sampling or external models.
Details
Motivation: Current hallucination detection methods for medical MLLMs (like SE and VASE) require 10-20 stochastic generations per sample plus external NLI models, making them computationally expensive and impractical for clinical deployment.Method: CEBaG combines two signals: 1) token-level predictive variance (inconsistent confidence across response tokens), and 2) evidence magnitude (how much the image shifts per-token predictions relative to text-only inference). It requires no stochastic sampling, external models, or task-specific hyperparameters.
Result: Evaluated across 4 medical MLLMs and 3 VQA benchmarks (16 experimental settings), CEBaG achieves highest AUC in 13/16 settings, improving over VASE by 8 AUC points on average, while being fully deterministic and self-contained.
Conclusion: CEBaG provides an effective, efficient hallucination detection method for medical MLLMs that addresses computational limitations of existing approaches and could enable safer clinical deployment.
Abstract: Multimodal large language models (MLLMs) have shown strong potential for medical Visual Question Answering (VQA), yet they remain prone to hallucinations, defined as generating responses that contradict the input image, posing serious risks in clinical settings. Current hallucination detection methods, such as Semantic Entropy (SE) and Vision-Amplified Semantic Entropy (VASE), require 10 to 20 stochastic generations per sample together with an external natural language inference model for semantic clustering, making them computationally expensive and difficult to deploy in practice. We observe that hallucinated responses exhibit a distinctive signature directly in the model’s own log-probabilities: inconsistent token-level confidence and weak sensitivity to visual evidence. Based on this observation, we propose Confidence-Evidence Bayesian Gain (CEBaG), a deterministic hallucination detection method that requires no stochastic sampling, no external models, and no task-specific hyperparameters. CEBaG combines two complementary signals: token-level predictive variance, which captures inconsistent confidence across response tokens, and evidence magnitude, which measures how much the image shifts per-token predictions relative to text-only inference. Evaluated across four medical MLLMs and three VQA benchmarks (16 experimental settings), CEBaG achieves the highest AUC in 13 of 16 settings and improves over VASE by 8 AUC points on average, while being fully deterministic and self-contained. The code will be made available upon acceptance.
[658] MIND: Multi-agent inference for negotiation dialogue in travel planning
Hunmin Do, Taejun Yoon, Kiyong Jung
Main category: cs.AI
TL;DR: MIND is a multi-agent framework for negotiation dialogue that uses Theory of Mind to infer opponent willingness from linguistic cues, achieving improved consensus-building in travel planning scenarios.
Details
Motivation: While multi-agent debate research has advanced, its effectiveness in coordinating complex stakeholder interests like travel planning with heterogeneous preferences remains unexplored. The paper aims to bridge this gap by developing a framework for realistic consensus-building among travelers.Method: Proposes MIND (Multi-agent Inference for Negotiation Dialogue) framework grounded in Theory of Mind. Introduces a Strategic Appraisal phase that infers opponent willingness (w) from linguistic nuances. Uses multi-agent debate approach for travel planning scenarios with diverse preferences.
Result: Achieves 90.2% accuracy in inferring opponent willingness. Outperforms traditional MAD frameworks with 20.5% improvement in High-w Hit and 30.7% increase in Debate Hit-Rate. Qualitative evaluations show superiority in Rationality (68.8%) and Fluency (72.4%), with overall win rate of 68.3%.
Conclusion: MIND effectively models human negotiation dynamics to derive persuasive consensus in multi-agent travel planning scenarios, validating the importance of Theory of Mind and strategic appraisal in negotiation dialogues.
Abstract: While Multi-Agent Debate (MAD) research has advanced, its efficacy in coordinating complex stakeholder interests such as travel planning remains largely unexplored. To bridge this gap, we propose MIND (Multi-agent Inference for Negotiation Dialogue), a framework designed to simulate realistic consensus-building among travelers with heterogeneous preferences. Grounded in the Theory of Mind (ToM), MIND introduces a Strategic Appraisal phase that infers opponent willingness (w) from linguistic nuances with 90.2% accuracy. Experimental results demonstrate that MIND outperforms traditional MAD frameworks, achieving a 20.5% improvement in High-w Hit and a 30.7% increase in Debate Hit-Rate, effectively prioritizing high-stakes constraints. Furthermore, qualitative evaluations via LLM-as-a-Judge confirm that MIND surpasses baselines in Rationality (68.8%) and Fluency (72.4%), securing an overall win rate of 68.3%. These findings validate that MIND effectively models human negotiation dynamics to derive persuasive consensus.
[659] A Blueprint for Self-Evolving Coding Agents in Vehicle Aerodynamic Drag Prediction
Jinhui Ren, Huaiming Li, Yabin Liu, Tao Li, Zhaokun Liu, Yujia Liang, Zengle Ge, Chufan Wu, Xiaomin Yuan, Danyu Liu, Annan Li, Jianmin Wu
Main category: cs.AI
TL;DR: Self-evolving coding agents discover executable surrogate pipelines for vehicle drag prediction using constrained optimization over programs, combining evaluator feedback with evolutionary algorithms and multi-objective selection.
Details
Motivation: High-fidelity vehicle drag evaluation is constrained by workflow friction (geometry cleanup, meshing retries, queue contention, reproducibility failures) rather than solver runtime. Need for accelerated aerodynamic design iteration while preserving reliability and safety.Method: Contract-centric blueprint for self-evolving coding agents that discover executable surrogate pipelines for predicting drag coefficient. Formulates surrogate discovery as constrained optimization over programs (not static models). Combines Famou-Agent-style evaluator feedback with population-based island evolution, structured mutations (data, model, loss, split policies), and multi-objective selection balancing ranking quality, stability, and cost. Hard evaluation contract enforces leakage prevention, deterministic replay, multi-seed robustness, and resource budgets.
Result: Best system reaches Combined Score of 0.9335 with sign-accuracy 0.9180 across eight anonymized evolutionary operators. Trajectory and ablation analyses show adaptive sampling and island migration are primary drivers of convergence quality. Deployment uses “screen-and-escalate” model where surrogates provide high-throughput ranking for design exploration, with low-confidence cases escalated to high-fidelity CFD.
Conclusion: Provides auditable, reusable workflow for accelerating aerodynamic design iteration while preserving decision-grade reliability, governance traceability, and safety boundaries through automated surrogate discovery and escalation mechanisms.
Abstract: High-fidelity vehicle drag evaluation is constrained less by solver runtime than by workflow friction: geometry cleanup, meshing retries, queue contention, and reproducibility failures across teams. We present a contract-centric blueprint for self-evolving coding agents that discover executable surrogate pipelines for predicting drag coefficient $C_d$ under industrial constraints. The method formulates surrogate discovery as constrained optimization over programs, not static model instances, and combines Famou-Agent-style evaluator feedback with population-based island evolution, structured mutations (data, model, loss, and split policies), and multi-objective selection balancing ranking quality, stability, and cost. A hard evaluation contract enforces leakage prevention, deterministic replay, multi-seed robustness, and resource budgets before any candidate is admitted. Across eight anonymized evolutionary operators, the best system reaches a Combined Score of 0.9335 with sign-accuracy 0.9180, while trajectory and ablation analyses show that adaptive sampling and island migration are primary drivers of convergence quality. The deployment model is explicitly ``screen-and-escalate’’: surrogates provide high-throughput ranking for design exploration, but low-confidence or out-of-distribution cases are automatically escalated to high-fidelity CFD. The resulting contribution is an auditable, reusable workflow for accelerating aerodynamic design iteration while preserving decision-grade reliability, governance traceability, and safety boundaries.
[660] Compensating Visual Insufficiency with Stratified Language Guidance for Long-Tail Class Incremental Learning
Xi Wang, Xu Yang, Donghao Sun, Cheng Deng
Main category: cs.AI
TL;DR: A method using LLM-generated hierarchical language trees to address long-tail class incremental learning by providing adaptive semantic guidance for tail classes and structural alignment to prevent forgetting.
Details
Motivation: Long-tail class incremental learning faces dual challenges: scarcity of tail class samples hampers learning, and evolving imbalanced distributions exacerbate catastrophic forgetting. Language knowledge offers informativeness and scalability to address these issues.Method: 1) Analyze LT CIL data distribution to guide LLMs in generating stratified language trees organizing semantic information hierarchically. 2) Stratified adaptive language guidance uses learnable weights to merge multi-scale semantic representations for dynamic supervisory adjustment of tail classes. 3) Stratified alignment language guidance exploits language tree’s structural stability to constrain optimization and reinforce semantic visual alignment.
Result: Extensive experiments on multiple benchmarks demonstrate state-of-the-art performance in long-tail class incremental learning.
Conclusion: The method effectively leverages language knowledge through hierarchical semantic organization to address both data imbalance in tail classes and catastrophic forgetting in incremental learning scenarios.
Abstract: Long-tail class incremental learning (LT CIL) remains highly challenging because the scarcity of samples in tail classes not only hampers their learning but also exacerbates catastrophic forgetting under continuously evolving and imbalanced data distributions. To tackle these issues, we exploit the informativeness and scalability of language knowledge. Specifically, we analyze the LT CIL data distribution to guide large language models (LLMs) in generating a stratified language tree that hierarchically organizes semantic information from coarse to fine grained granularity. Building upon this structure, we introduce stratified adaptive language guidance, which leverages learnable weights to merge multi-scale semantic representations, thereby enabling dynamic supervisory adjustment for tail classes and alleviating the impact of data imbalance. Furthermore, we introduce stratified alignment language guidance, which exploits the structural stability of the language tree to constrain optimization and reinforce semantic visual alignment, thereby alleviating catastrophic forgetting. Extensive experiments on multiple benchmarks demonstrate that our method achieves state of the art performance.
[661] CurvZO: Adaptive Curvature-Guided Sparse Zeroth-Order Optimization for Efficient LLM Fine-Tuning
Shuo Wang, Ziyu Chen, Ming Tang
Main category: cs.AI
TL;DR: CurvZO: Adaptive curvature-guided sparse zeroth-order optimization for memory-efficient LLM fine-tuning using forward passes only, improving convergence speed and accuracy.
Details
Motivation: Fine-tuning LLMs with backpropagation has high memory overhead, limiting scalability on resource-constrained hardware. Zeroth-order optimization offers memory efficiency but suffers from slow/unstable convergence due to high-variance gradient estimates. Sparse ZO updates help but need effective parameter selection, which is challenging with only scalar feedback.Method: Proposes CurvZO which tracks curvature signals online from scalar ZO feedback and uses these to construct parameter-wise sampling distributions for coordinate selection. Dynamically adapts perturbation budget based on evolving curvature signal distribution to maintain focused yet exploratory sparse ZO updates.
Result: Extensive experiments on OPT and Llama across diverse NLP tasks show CurvZO consistently improves fine-tuning performance and reduces training time over ZO baselines. Achieves up to 4.4% accuracy improvement and up to 2× speedup while preserving memory efficiency.
Conclusion: CurvZO provides an effective memory-efficient alternative for LLM fine-tuning that addresses convergence issues in zeroth-order optimization through adaptive curvature-guided sparse updates, making large model fine-tuning more accessible on resource-constrained hardware.
Abstract: Fine-tuning large language models (LLMs) with backpropagation achieves high performance but incurs substantial memory overhead, limiting scalability on resource-constrained hardware. Zeroth-order (ZO) optimization provides a memory-efficient alternative by relying solely on forward passes, yet it typically suffers from slow or unstable convergence due to high-variance gradient estimates. Sparse ZO updates partially address this issue by perturbing only a subset of parameters, but their effectiveness hinges on selecting informative parameters, which is challenging in ZO optimization because each query yields only scalar feedback. We propose \textbf{Adaptive Curvature-Guided Sparse Zeroth-Order Optimization (CurvZO)}, which tracks curvature signals online from scalar ZO feedback and leverages these signals to construct a parameter-wise sampling distribution for selecting coordinates at each update, reducing the variance of the sparse ZO gradient estimator. Moreover, CurvZO dynamically adapts the perturbation budget to the evolving curvature signal distribution, yielding sparse ZO updates that remain both focused and sufficiently exploratory. Extensive experiments on OPT and Llama across diverse NLP tasks show that CurvZO consistently improves fine-tuning performance and reduces training time over ZO baselines. It improves accuracy by up to 4.4 points and achieves up to a $2\times$ speedup, while preserving memory efficiency.
[662] EvoIdeator: Evolving Scientific Ideas through Checklist-Grounded Reinforcement Learning
Andreas Sauter, Yuyue Zhao, Jacopo Urbani, Wenxiang Hu, Zaiqiao Meng, Lun Zhou, Xiaohui Yan, Yougang Lyu
Main category: cs.AI
TL;DR: EvoIdeator: A framework for evolving scientific ideas using RL with checklist-grounded feedback, combining lexicographic rewards and fine-grained language feedback to improve autonomous ideation.
Details
Motivation: Current LLMs struggle with iterative evolution of scientific ideas into high-quality proposals. Existing RL methods use scalar rewards lacking granularity, while language-based refinement methods are limited to inference-time prompting without model optimization.Method: Proposes EvoIdeator framework that aligns RL training with checklist-grounded feedback. Uses structured judge model to generate: 1) lexicographic rewards for multi-dimensional optimization, and 2) fine-grained language feedback with span-level critiques on grounding, feasibility, and rigor.
Result: EvoIdeator built on Qwen3-4B significantly outperforms much larger frontier models across scientific metrics. The learned policy generalizes well to diverse external feedback sources without further fine-tuning.
Conclusion: EvoIdeator offers a scalable and rigorous path toward self-refining autonomous ideation by conditioning policies to systematically utilize precise feedback during both optimization and inference.
Abstract: Scientific idea generation is a cornerstone of autonomous knowledge discovery, yet the iterative evolution required to transform initial concepts into high-quality research proposals remains a formidable challenge for Large Language Models (LLMs). Existing Reinforcement Learning (RL) paradigms often rely on rubric-based scalar rewards that provide global quality scores but lack actionable granularity. Conversely, language-based refinement methods are typically confined to inference-time prompting, targeting models that are not explicitly optimized to internalize such critiques. To bridge this gap, we propose \textbf{EvoIdeator}, a framework that facilitates the evolution of scientific ideas by aligning the RL training objective with \textbf{checklist-grounded feedback}. EvoIdeator leverages a structured judge model to generate two synergistic signals: (1) \emph{lexicographic rewards} for multi-dimensional optimization, and (2) \emph{fine-grained language feedback} that offers span-level critiques regarding grounding, feasibility, and methodological rigor. By integrating these signals into the RL loop, we condition the policy to systematically utilize precise feedback during both optimization and inference. Extensive experiments demonstrate that EvoIdeator, built on Qwen3-4B, significantly outperforms much larger frontier models across key scientific metrics. Crucially, the learned policy exhibits strong generalization to diverse external feedback sources without further fine-tuning, offering a scalable and rigorous path toward self-refining autonomous ideation.
[663] The Reasoning Error About Reasoning: Why Different Types of Reasoning Require Different Representational Structures
Yiling Wu
Main category: cs.AI
TL;DR: A framework identifying four structural properties of representational systems (operability, consistency, structural preservation, compositionality) that are demanded to different degrees by different forms of reasoning, revealing a boundary between associative/probabilistic reasoning and deductive reasoning requiring all four properties.
Details
Motivation: To provide a systematic account of structural demands on representational systems across psychology, AI, and philosophy of mind, addressing the lack of unified understanding of how different reasoning types impose different structural requirements.Method: Proposes a theoretical framework identifying four key structural properties of representational systems, analyzes how different reasoning types (induction, analogy, causal inference, deduction, formal logic) demand these properties to varying degrees, and examines the boundary between associative/probabilistic reasoning and deductive reasoning.
Result: Identifies a principal structural boundary: reasoning types below it can operate on associative, probabilistic representations, while those above it require all four properties to be fully satisfied. Shows that scaling statistical learning without structural reorganization is insufficient for deductive reasoning.
Conclusion: The framework provides a necessary-condition account that reorganizes existing debates about reasoning and representation, with testable predictions about compounding degradation, selective vulnerability to structural disruption, and irreducibility under scaling.
Abstract: Different types of reasoning impose different structural demands on representational systems, yet no systematic account of these demands exists across psychology, AI, and philosophy of mind. I propose a framework identifying four structural properties of representational systems: operability, consistency, structural preservation, and compositionality. These properties are demanded to different degrees by different forms of reasoning, from induction through analogy and causal inference to deduction and formal logic. Each property excludes a distinct class of reasoning failure. The analysis reveals a principal structural boundary: reasoning types below it can operate on associative, probabilistic representations, while those above it require all four properties to be fully satisfied. Scaling statistical learning without structural reorganization is insufficient to cross this boundary, because the structural guarantees required by deductive reasoning cannot be approximated through probabilistic means. Converging evidence from AI evaluation, developmental psychology, and cognitive neuroscience supports the framework at different levels of directness. Three testable predictions are derived, including compounding degradation, selective vulnerability to targeted structural disruption, and irreducibility under scaling. The framework is a necessary-condition account, agnostic about representational format, that aims to reorganize existing debates rather than close them.
[664] The Presupposition Problem in Representation Genesis
Yiling Wu
Main category: cs.AI
TL;DR: Philosophical analysis of representation genesis in large language models, arguing existing frameworks presuppose representation and thus cannot explain its emergence
Details
Motivation: LLMs achieve high cognitive performance without clearly undergoing representation genesis (transition to content-sensitive states), making this philosophical question urgent for understanding which cognitive capacities are affected and whyMethod: Conceptual diagnosis analyzing major philosophical frameworks (Language of Thought, teleosemantics, predictive processing, enactivism, genetic phenomenology) to identify common Representation Presupposition structure that generates explanatory deferral
Result: Identifies Representation Regress problem where existing frameworks import representational concepts to explain representation emergence, establishes structure of the problem and derives two minimum adequacy conditions for any account avoiding this pattern
Conclusion: LLMs make the absence of a theory explaining representation genesis consequential rather than merely theoretical, highlighting need for new conceptual resources
Abstract: Large language models are the first systems to achieve high cognitive performance without clearly undergoing representation genesis: the transition from a non-representing physical system to one whose states guide behavior in a content-sensitive way. Prior cognitive systems had already made this transition before we could examine it, and philosophy of mind treated genesis as a background condition rather than an explanatory target. LLMs provide a case that does not clearly involve this transition, making the genesis question newly urgent: if genesis did not occur, which cognitive capacities are affected, and why? We currently lack the conceptual resources to answer this. The reason, this paper argues, is structural. Major frameworks in philosophy of mind, including the Language of Thought hypothesis, teleosemantics, predictive processing, enactivism, and genetic phenomenology, share a common feature when applied to the genesis question: at some explanatory step, each deploys concepts whose explanatory purchase depends on the system already being organized as a representer. This pattern, which we call the Representation Presupposition structure, generates systematic explanatory deferral. Attempts to explain the first acquisition of content-manipulable representation within the existing categorical vocabulary import resources from the representational side of the transition itself. We call this the Representation Regress. The paper offers a conceptual diagnosis rather than a new theory, establishing the structure of the problem and deriving two minimum adequacy conditions for any account that avoids this pattern. LLMs make the absence of such a theory consequential rather than merely theoretical.
[665] Agentic Personas for Adaptive Scientific Explanations with Knowledge Graphs
Susana Nunes, Tiago Guerreiro, Catia Pesquita
Main category: cs.AI
TL;DR: Reinforcement learning approach using agentic personas to generate adaptive scientific explanations that match expert epistemic preferences, reducing feedback requirements by 100x.
Details
Motivation: Current AI explanation methods assume static user models, failing to adapt to diverse expert reasoning strategies and epistemic stances in complex domains like scientific discovery. Knowledge graph-based explanations inherit this limitation, and direct human feedback is scarce for creating adaptive explanations.Method: Reinforcement learning approach incorporating agentic personas - structured representations of expert reasoning strategies that guide explanation generation toward specific epistemic preferences. Evaluated on knowledge graph-based explanations for drug discovery with two personas capturing distinct epistemic stances derived from expert feedback.
Result: Persona-driven explanations match state-of-the-art predictive performance while aligning with corresponding expert preferences. Adaptive explanations were consistently preferred over non-adaptive baselines (n=22). Persona-based training reduces feedback requirements by two orders of magnitude (100x).
Conclusion: Agentic personas enable scalable adaptive explainability for AI systems in complex, high-stakes domains by capturing diverse expert reasoning strategies and reducing reliance on direct human feedback.
Abstract: AI explanation methods often assume a static user model, producing non-adaptive explanations regardless of expert goals, reasoning strategies, or decision contexts. Knowledge graph-based explanations, despite their capacity for grounded, path-based reasoning, inherit this limitation. In complex domains such as scientific discovery, this assumption fails to capture the diversity of cognitive strategies and epistemic stances among experts, preventing explanations that foster deeper understanding and informed decision-making. However, the scarcity of human experts limits the use of direct human feedback to produce adaptive explanations. We present a reinforcement learning approach for scientific explanation generation that incorporates agentic personas, structured representations of expert reasoning strategies, that guide the explanation agent towards specific epistemic preferences. In an evaluation of knowledge graph-based explanations for drug discovery, we tested two personas that capture distinct epistemic stances derived from expert feedback. Results show that persona-driven explanations match state-of-the-art predictive performance while persona preferences closely align with those of their corresponding experts. Adaptive explanations were consistently preferred over non-adaptive baselines (n = 22), and persona-based training reduces feedback requirements by two orders of magnitude. These findings demonstrate how agentic personas enable scalable adaptive explainability for AI systems in complex and high-stakes domains.
[666] Reasoning or Rhetoric? An Empirical Analysis of Moral Reasoning Explanations in Large Language Models
Aryan Kasat, Smriti Singh, Aman Chadha, Vinija Jain
Main category: cs.AI
TL;DR: LLMs exhibit post-conventional moral reasoning patterns that are the inverse of human developmental norms, showing moral decoupling between justification and action choices, suggesting they’ve learned rhetorical conventions rather than genuine moral reasoning.
Details
Motivation: To investigate whether LLMs genuinely reason morally through developmental stages (Kohlberg's theory) or merely produce reasoning-like outputs through alignment training without underlying developmental trajectory.Method: Used LLM-as-judge scoring pipeline validated across three judge models to classify 600+ responses from 13 LLMs across six moral dilemmas, with ten complementary analyses to characterize patterns and internal coherence.
Result: Responses overwhelmingly correspond to post-conventional reasoning (Stages 5-6) regardless of model characteristics - inverse of human norms where Stage 4 dominates. Models show moral decoupling (inconsistency between justification and action), near-robotic cross-dilemma consistency, with scale having small effect and training type no significant effect.
Conclusion: Patterns suggest moral ventriloquism - LLMs acquire rhetorical conventions of mature moral reasoning through alignment training without the underlying developmental trajectory those conventions represent.
Abstract: Do large language models reason morally, or do they merely sound like they do? We investigate whether LLM responses to moral dilemmas exhibit genuine developmental progression through Kohlberg’s stages of moral development, or whether alignment training instead produces reasoning-like outputs that superficially resemble mature moral judgment without the underlying developmental trajectory. Using an LLM-as-judge scoring pipeline validated across three judge models, we classify more than 600 responses from 13 LLMs spanning a range of architectures, parameter scales, and training regimes across six classical moral dilemmas, and conduct ten complementary analyses to characterize the nature and internal coherence of the resulting patterns. Our results reveal a striking inversion: responses overwhelmingly correspond to post-conventional reasoning (Stages 5-6) regardless of model size, architecture, or prompting strategy, the effective inverse of human developmental norms, where Stage 4 dominates. Most strikingly, a subset of models exhibit moral decoupling: systematic inconsistency between stated moral justification and action choice, a form of logical incoherence that persists across scale and prompting strategy and represents a direct reasoning consistency failure independent of rhetorical sophistication. Model scale carries a statistically significant but practically small effect; training type has no significant independent main effect; and models exhibit near-robotic cross-dilemma consistency producing logically indistinguishable responses across semantically distinct moral problems. We posit that these patterns constitute evidence for moral ventriloquism: the acquisition, through alignment training, of the rhetorical conventions of mature moral reasoning without the underlying developmental trajectory those conventions are meant to represent.
[667] Tacit Knowledge Management with Generative AI: Proposal of the GenAI SECI Model
Naoshi Uchihira
Main category: cs.AI
TL;DR: Proposes “GenAI SECI” model - an updated knowledge creation process model leveraging generative AI to integrate explicit and tacit knowledge through “Digital Fragmented Knowledge” concept.
Details
Motivation: Current generative AI applications in knowledge management focus mainly on explicit knowledge, with insufficient integration of tacit knowledge. Need for systematic models that handle both knowledge types using generative AI capabilities.Method: Introduces “GenAI SECI” model as an updated version of the traditional SECI (Socialization, Externalization, Combination, Internalization) knowledge creation model. Key innovation is “Digital Fragmented Knowledge” concept that integrates explicit and tacit knowledge in cyberspace. Presents concrete system architecture and compares with prior research models.
Result: Proposes a novel framework for knowledge management using generative AI that addresses the integration gap between explicit and tacit knowledge, with practical system architecture implications.
Conclusion: The GenAI SECI model provides a systematic approach to leverage generative AI for comprehensive knowledge management, bridging the gap between explicit and tacit knowledge through digital integration.
Abstract: The emergence of generative AI is bringing about a significant transformation in knowledge management. Generative AI has the potential to address the limitations of conventional knowledge management systems, and it is increasingly being deployed in real-world settings with promising results. Related research is also expanding rapidly. However, much of this work focuses on research and practice related to the management of explicit knowledge. While fragmentary efforts have been made regarding the management of tacit knowledge using generative AI, the modeling and systematization that handle both tacit and explicit knowledge in an integrated manner remain insufficient. In this paper, we propose the “GenAI SECI” model as an updated version of the knowledge creation process (SECI) model, redesigned to leverage the capabilities of generative AI. A defining feature of the “GenAI SECI” model is the introduction of “Digital Fragmented Knowledge”, a new concept that integrates explicit and tacit knowledge within cyberspace. Furthermore, a concrete system architecture for the proposed model is presented, along with a comparison with prior research models that share a similar problem awareness and objectives.
[668] Guideline-grounded retrieval-augmented generation for ophthalmic clinical decision support
Shuying Chen, Sen Cui, Zhong Cao
Main category: cs.AI
TL;DR: Oph-Guid-RAG: A multimodal visual RAG system for ophthalmology clinical QA that retrieves guideline page images (preserving tables/flowcharts) with controllable retrieval framework and provides traceable outputs.
Details
Motivation: Need for evidence-based clinical decision support in ophthalmology that can handle multimodal guideline content (tables, flowcharts, layout) and provide traceable reasoning with precise evidence grounding.Method: Treats each guideline page as independent evidence unit, retrieves page images directly, uses controllable retrieval framework with routing/filtering, integrates query decomposition, rewriting, retrieval, reranking, and multimodal reasoning.
Result: Improves overall score by 30% (0.2969→0.3861) and accuracy by 10.4% (0.5956→0.6576) vs GPT-5.2 on hard subset; larger accuracy gain of +24.4% vs GPT-5.4; shows effectiveness on challenging evidence-based cases.
Conclusion: Vision-based retrieval with controllable reasoning improves evidence grounding and robustness in clinical AI; reranking, routing, and retrieval design are critical for stable performance; further work needed for completeness.
Abstract: In this work, we propose Oph-Guid-RAG, a multimodal visual RAG system for ophthalmology clinical question answering and decision support. We treat each guideline page as an independent evidence unit and directly retrieve page images, preserving tables, flowcharts, and layout information. We further design a controllable retrieval framework with routing and filtering, which selectively introduces external evidence and reduces noise. The system integrates query decomposition, query rewriting, retrieval, reranking, and multimodal reasoning, and provides traceable outputs with guideline page references. We evaluate our method on HealthBench using a doctor-based scoring protocol. On the hard subset, our approach improves the overall score from 0.2969 to 0.3861 (+0.0892, +30.0%) compared to GPT-5.2, and achieves higher accuracy, improving from 0.5956 to 0.6576 (+0.0620, +10.4%). Compared to GPT-5.4, our method achieves a larger accuracy gain of +0.1289 (+24.4%). These results show that our method is more effective on challenging cases that require precise, evidence-based reasoning. Ablation studies further show that reranking, routing, and retrieval design are critical for stable performance, especially under difficult settings. Overall, we show how combining visionbased retrieval with controllable reasoning can improve evidence grounding and robustness in clinical AI applications,while pointing out that further work is needed to be more complete.
[669] Future-Interactions-Aware Trajectory Prediction via Braid Theory
Caio Azevedo, Stefano Sabatini, Sascha Hornauer, Fabien Moutarde
Main category: cs.AI
TL;DR: Using braid theory to improve multi-agent trajectory prediction by adding an auxiliary braid prediction task that captures social interactions between agents.
Details
Motivation: Autonomous vehicles need to predict future behavior of multiple interacting agents, but existing methods are computationally expensive or rely on heuristics. Braid theory provides an exact mathematical descriptor of multi-agent coordination patterns.Method: Proposes a novel auxiliary braid prediction task that runs parallel to trajectory prediction. The model classifies edges between agents into crossing types in the braid representation, capturing how trajectories will cross over time and improving social awareness.
Result: Significant improvements in joint prediction metrics on three separate datasets with negligible added computational complexity during training or inference.
Conclusion: Braid prediction as an auxiliary task effectively infuses models with future intention awareness, leading to more accurate joint predictions of multi-agent behavior without substantial computational overhead.
Abstract: To safely operate, an autonomous vehicle must know the future behavior of a potentially high number of interacting agents around it, a task often posed as multi-agent trajectory prediction. Many previous attempts to model social interactions and solve the joint prediction task either add extensive computational requirements or rely on heuristics to label multi-agent behavior types. Braid theory, in contrast, provides a powerful exact descriptor of multi-agent behavior by projecting future trajectories into braids that express how trajectories cross with each other over time; a braid then corresponds to a specific mode of coordination between the multiple agents in the future. In past work, braids have been used lightly to reason about interacting agents and restrict the attention window of predicted agents. We show that leveraging more fully the expressivity of the braid representation and using it to condition the trajectories themselves leads to even further gains in joint prediction performance, with negligible added complexity either in training or at inference time. We do so by proposing a novel auxiliary task, braid prediction, done in parallel with the trajectory prediction task. By classifying edges between agents into their correct crossing types in the braid representation, the braid prediction task is able to imbue the model with improved social awareness, which is reflected in joint predictions that more closely adhere to the actual multi-agent behavior. This simple auxiliary task allowed us to obtain significant improvements in joint metrics on three separate datasets. We show how the braid prediction task infuses the model with future intention awareness, leading to more accurate joint predictions. Code is available at github.com/caiocj1/traj-pred-braid-theory.
[670] A Context Engineering Framework for Improving Enterprise AI Agents based on Digital-Twin MDP
Xi Yang, Aurelie Lozano, Naoki Abe, Bhavya, Saurabh Jha, Noah Zheutlin, Rohan R. Arora, Yu Deng, Daby M. Sow
Main category: cs.AI
TL;DR: A lightweight framework using offline RL to improve LLM-based enterprise agents through context engineering, with a case study in IT automation showing significant improvements.
Details
Motivation: Real-world deployment of AI agents for enterprise automation is constrained by limited data quality/quantity, complex reasoning demands, difficulties with self-play, and lack of reliable feedback signals.Method: DT-MDP-CE framework with three components: (1) Digital-Twin Markov Decision Process to abstract reasoning behavior, (2) robust contrastive inverse RL to estimate reward functions from mixed-quality trajectories, and (3) RL-guided context engineering to improve decision-making.
Result: Extensive experiments on IT automation tasks show consistent and significant improvements over baseline agents across various evaluation settings.
Conclusion: The framework can generalize to other agents with similar characteristics in enterprise environments and addresses key constraints in real-world AI agent deployment.
Abstract: Despite rapid progress in AI agents for enterprise automation and decision-making, their real-world deployment and further performance gains remain constrained by limited data quality and quantity, complex real-world reasoning demands, difficulties with self-play, and the lack of reliable feedback signals. To address these challenges, we propose a lightweight, model-agnostic framework for improving LLM-based enterprise agents via offline reinforcement learning (RL). The proposed Context Engineering via DT-MDP (DT-MDP-CE) framework comprises three key components: (1) A Digital-Twin Markov Decision Process (DT-MDP), which abstracts the agent’s reasoning behavior as a finite MDP; (2) A robust contrastive inverse RL, which, armed with the DT-MDP, to efficiently estimate a well-founded reward function and induces policies from mixed-quality offline trajectories; and (3) RL-guided context engineering, which uses the policy obtained from the integrated process of (1) and (2), to improve the agent’s decision-making behavior. As a case study, we apply the framework to a representative task in the enterprise-oriented domain of IT automation. Extensive experimental results demonstrate consistent and significant improvements over baseline agents across a wide range of evaluation settings, suggesting that the framework can generalize to other agents sharing similar characteristics in enterprise environments.
[671] GSEM: Graph-based Self-Evolving Memory for Experience Augmented Clinical Reasoning
Xiao Han, Yuzheng Fan, Sendong Zhao, Haochun Wang, Bing Qin
Main category: cs.AI
TL;DR: GSEM is a graph-based memory framework for clinical decision-making that organizes experiences into a dual-layer graph structure for better retrieval and reuse, outperforming baselines on medical benchmarks.
Details
Motivation: Current memory-augmented methods for clinical decision-making store experiences as independent records without explicit relational structure, leading to noisy retrieval, unreliable reuse, and sometimes worse performance than direct LLM inference.Method: Proposes GSEM (Graph-based Self-Evolving Memory), a clinical memory framework with dual-layer memory graph that captures decision structure within each experience and relational dependencies across experiences, supporting applicability-aware retrieval and online feedback-driven calibration of node quality and edge weights.
Result: Across MedR-Bench and MedAgentsBench with two LLM backbones, GSEM achieves the highest average accuracy among all baselines, reaching 70.90% with DeepSeek-V3.2 and 69.24% with Qwen3.5-35B.
Conclusion: GSEM provides an effective graph-based memory framework for clinical decision-making that organizes experiences structurally and relationally, enabling better retrieval and reuse of prior decision experience.
Abstract: Clinical decision-making agents can benefit from reusing prior decision experience. However, many memory-augmented methods store experiences as independent records without explicit relational structure, which may introduce noisy retrieval, unreliable reuse, and in some cases even hurt performance compared to direct LLM inference. We propose GSEM (Graph-based Self-Evolving Memory), a clinical memory framework that organizes clinical experiences into a dual-layer memory graph, capturing both the decision structure within each experience and the relational dependencies across experiences, and supporting applicability-aware retrieval and online feedback-driven calibration of node quality and edge weights. Across MedR-Bench and MedAgentsBench with two LLM backbones, GSEM achieves the highest average accuracy among all baselines, reaching 70.90% and 69.24% with DeepSeek-V3.2 and Qwen3.5-35B, respectively. Code is available at https://github.com/xhan1022/gsem.
[672] SpecTM: Spectral Targeted Masking for Trustworthy Foundation Models
Syed Usama Imtiaz, Mitra Nasr Azadani, Nasrin Alamdari
Main category: cs.AI
TL;DR: SpecTM is a physics-informed masking method for Earth observation foundation models that uses targeted spectral masking to improve representation learning and downstream prediction tasks for environmental monitoring.
Details
Motivation: Current Earth observation foundation models use stochastic masking that doesn't enforce physics constraints, limiting trustworthiness for predictive models that guide public health decisions, especially in environmental monitoring applications.Method: Proposes SpecTM (Spectral Targeted Masking), a physics-informed masking design that encourages reconstruction of targeted bands from cross-spectral context during pretraining. Uses adaptable multi-task self-supervised learning framework with band reconstruction, bio-optical index inference, and 8-day-ahead temporal prediction tasks.
Result: Achieves R^2 = 0.695 (current week) and R^2 = 0.620 (8-day-ahead) predictions for microcystin concentration regression, surpassing baseline models by +34% and +99% respectively. Targeted masking improves predictions by +0.037 R^2 over random masking, with 2.2x superior label efficiency under extreme scarcity.
Conclusion: SpecTM enables physics-informed representation learning across Earth observation domains and improves interpretability of foundation models by incorporating domain knowledge through targeted spectral masking.
Abstract: Foundation models are now increasingly being developed for Earth observation (EO), yet they often rely on stochastic masking that do not explicitly enforce physics constraints; a critical trustworthiness limitation, in particular for predictive models that guide public health decisions. In this work, we propose SpecTM (Spectral Targeted Masking), a physics-informed masking design that encourages the reconstruction of targeted bands from cross-spectral context during pretraining. To achieve this, we developed an adaptable multi-task (band reconstruction, bio-optical index inference, and 8-day-ahead temporal prediction) self-supervised learning (SSL) framework that encodes spectrally intrinsic representations via joint optimization, and evaluated it on a downstream microcystin concentration regression model using NASA PACE hyperspectral imagery over Lake Erie. SpecTM achieves R^2 = 0.695 (current week) and R^2 = 0.620 (8-day-ahead) predictions surpassing all baseline models by (+34% (0.51 Ridge) and +99% (SVR 0.31)) respectively. Our ablation experiments show targeted masking improves predictions by +0.037 R^2 over random masking. Furthermore, it outperforms strong baselines with 2.2x superior label efficiency under extreme scarcity. SpecTM enables physics-informed representation learning across EO domains and improves the interpretability of foundation models.
[673] MARCUS: An agentic, multimodal vision-language model for cardiac diagnosis and management
Jack W O’Sullivan, Mohammad Asadi, Lennart Elbe, Akshay Chaudhari, Tahoura Nedaee, Francois Haddad, Michael Salerno, Li Fe-Fei, Ehsan Adeli, Rima Arnaout, Euan A Ashley
Main category: cs.AI
TL;DR: MARCUS is a multimodal vision-language system for interpreting cardiac tests (ECGs, echocardiograms, CMR) using an agentic architecture with domain-specific visual encoders, achieving state-of-the-art performance over frontier models.
Details
Motivation: Cardiovascular disease is the leading global cause of death, with progress limited by human interpretation of complex cardiac tests. Current AI vision-language models are limited to single-modality inputs and lack interactivity.Method: MARCUS employs a hierarchical agentic architecture with modality-specific vision-language expert models (ECG, echocardiography, CMR) that integrate domain-trained visual encoders with multi-stage language model optimization, coordinated by a multimodal orchestrator. Trained on 13.5M images and 1.6M expert-curated questions.
Result: Achieves state-of-the-art performance: 87-91% accuracy for ECG, 67-86% for echocardiography, 85-88% for CMR, outperforming frontier models by 34-45%. On multimodal cases, achieves 70% accuracy (triple frontier models) with 1.7-3.0x higher free-text quality scores. Shows resistance to mirage reasoning.
Conclusion: Domain-specific visual encoders with an agentic orchestrator enable effective multimodal cardiac interpretation. The system demonstrates superior performance over frontier models and addresses limitations of current vision-language models.
Abstract: Cardiovascular disease remains the leading cause of global mortality, with progress hindered by human interpretation of complex cardiac tests. Current AI vision-language models are limited to single-modality inputs and are non-interactive. We present MARCUS (Multimodal Autonomous Reasoning and Chat for Ultrasound and Signals), an agentic vision-language system for end-to-end interpretation of electrocardiograms (ECGs), echocardiograms, and cardiac magnetic resonance imaging (CMR) independently and as multimodal input. MARCUS employs a hierarchical agentic architecture comprising modality-specific vision-language expert models, each integrating domain-trained visual encoders with multi-stage language model optimization, coordinated by a multimodal orchestrator. Trained on 13.5 million images (0.25M ECGs, 1.3M echocardiogram images, 12M CMR images) and our novel expert-curated dataset spanning 1.6 million questions, MARCUS achieves state-of-the-art performance surpassing frontier models (GPT-5 Thinking, Gemini 2.5 Pro Deep Think). Across internal (Stanford) and external (UCSF) test cohorts, MARCUS achieves accuracies of 87-91% for ECG, 67-86% for echocardiography, and 85-88% for CMR, outperforming frontier models by 34-45% (P<0.001). On multimodal cases, MARCUS achieved 70% accuracy, nearly triple that of frontier models (22-28%), with 1.7-3.0x higher free-text quality scores. Our agentic architecture also confers resistance to mirage reasoning, whereby vision-language models derive reasoning from unintended textual signals or hallucinated visual content. MARCUS demonstrates that domain-specific visual encoders with an agentic orchestrator enable multimodal cardiac interpretation. We release our models, code, and benchmark open-source.
[674] PrecLLM: A Privacy-Preserving Framework for Efficient Clinical Annotation Extraction from Unstructured EHRs using Small-Scale LLMs
Yixiang Qu, Yifan Dai, Shilin Yu, Pradham Tanikella, Malvika Pillai, Walter Chen, Jialiu Xie, Yishan Ren, Duan Wang, Yikai Wang, Sid Sheth, Guanting Chen, Yufeng Liu, Travis Schrank, Trevor Hackman, Didong Li, Di Wu
Main category: cs.AI
TL;DR: PrecLLM: A resource-efficient preprocessing framework for smaller LLMs in clinical EHR analysis, combining regex and RAG to extract key information from unstructured notes for privacy-sensitive healthcare applications.
Details
Motivation: LLMs show promise for automated text annotation in clinical settings but face deployment challenges due to privacy regulations, computational costs of processing large EHRs, and accuracy limitations of smaller LLMs suitable for local deployment.Method: Developed PrecLLM framework with preprocessing using regular expressions and Retrieval-Augmented Generation (RAG) to extract and highlight key information from unstructured clinical notes, enabling smaller LLMs to handle EHR tasks efficiently in resource-constrained environments.
Result: PrecLLM substantially enhanced performance of smaller LLMs on EHR tasks (sensitivity, specificity, F1 scores) across private EPIC HNC dataset and public MIMIC-IV dataset, outperforming fine-tuned LLMs while maintaining privacy compliance.
Conclusion: PrecLLM provides optimized LLM performance for local, secure healthcare applications, offering practical guidance for clinical LLM deployment while addressing privacy, computational feasibility, and clinical applicability challenges.
Abstract: Large Language Models (LLMs) have demonstrated remarkable proficiency in automated text annotation within natural language processing. However, their deployment in clinical settings is severely constrained by strict privacy regulations and the prohibitive computational cost of processing voluminous unstructured Electronic Health Records (EHRs). In this study, we developed a resource-efficient preprocessing technique that can be adopted in existing LLM procedures. This approach is particularly useful for smaller LLMs, which are often more accuracy-challenged, and forms a compact LLM framework optimized for local deployment in computational environments with stringent privacy requirements and restricted access to high-performance GPUs (PrecLLM). The preprocessing step includes both regular expressions (regex) and Retrieval-Augmented Generation (RAG) to extract and highlight key information from unstructured clinical notes. Pre-filtering long and unstructured texts enhanced the performance of smaller LLMs on EHR-related tasks. Evaluation was performed on two distinct cohorts: a locally curated private EHR dataset from the EPIC system for a Head and Neck Cancer (HNC) cohort, and the publicly available EHR dataset (MIMIC-IV). Using MIMIC-IV, we further compared PrecLLM against fine-tuned LLMs. Results demonstrated that PrecLLM substantially enhanced the performance of the original smaller LLMs in terms of sensitivity, specificity, and F1 scores, making it well-suited for privacy-sensitive and resource-constrained applications. This study offers optimized LLM performance for local, secure, and efficient healthcare applications, and provides practical guidance for clinical LLM deployment while addressing challenges related to privacy, computational feasibility, and clinical applicability.
[675] Formula-R1: Incentivizing LLM Reasoning over Complex Tables with Numerical Computation via Formula-Driven Reinforcement Learning
Lang Cao, Jingxian Xu, Hanbing Liu, Jinyu Wang, Mengyu Zhou, Haoyu Dong, Shi Han, Dongmei Zhang
Main category: cs.AI
TL;DR: Formula-R1: A model trained via Formula Tuning (Fortune), a formula-driven reinforcement learning framework that teaches LLMs to generate executable spreadsheet formulas for table reasoning, improving numerical reasoning over complex tabular data.
Details
Motivation: LLMs struggle with accurate numerical reasoning over tabular data, especially in complex table settings beyond simple lookup. Spreadsheet formulas provide powerful symbolic operations for table reasoning that remain underexplored by existing LLMs.Method: Formula Tuning (Fortune) - a formula-driven reinforcement learning framework that trains LLMs to generate executable spreadsheet formulas for question answering over tabular data, using execution success and answer correctness as reward signals instead of supervised formula annotations.
Result: Formula-R1 substantially improves LLM performance on table reasoning across seven benchmarks, particularly for complex tables and multi-step numerical computation. It consistently outperforms prior methods in controlled comparisons.
Conclusion: Formula-driven RL enhances LLMs’ table reasoning capabilities, with broader potential for improving reasoning abilities in LLMs beyond table-specific tasks.
Abstract: Tables are a fundamental medium for organizing and analyzing data, making table reasoning a critical capability for intelligent systems. Although large language models (LLMs) exhibit strong general reasoning abilities, they still struggle with accurate numerical reasoning over tabular data, particularly in complex table settings beyond simple relational lookup. Spreadsheet formulas provide a powerful and expressive interface for executable symbolic operations, enabling rich reasoning patterns that remain largely underexplored by existing LLMs. In this paper, we introduce Formula-R1, a model trained via Formula Tuning (Fortune), a formula-driven reinforcement learning (RL) framework for table reasoning. Formula Tuning trains LLMs to generate executable spreadsheet formulas for question answering over general tabular data, using execution success and answer correctness as reward signals, thereby reducing reliance on supervised formula annotations. We demonstrate the effectiveness of Formula Tuning through extensive experiments on seven table reasoning benchmarks. It substantially improves LLM performance on table reasoning, particularly for tasks involving complex tables and multi-step numerical computation. Moreover, Formula-R1 consistently outperforms prior methods under controlled comparison settings. Beyond empirical gains, our extensive analyses provide insights into the role of RL in formula-driven table reasoning, highlighting the broader potential of formula-driven RL to enhance reasoning capabilities in LLMs.
[676] SynPO: Synergizing Descriptiveness and Preference Optimization for Video Detailed Captioning
Jisheng Dang, Yizhou Zhang, Hao Ye, Teng Wang, Siming Chen, Huicheng Zheng, Yulan Guo, Jianhuang Lai, Bin Hu
Main category: cs.AI
TL;DR: SynPO is a novel preference optimization method for enhancing vision-language models in fine-grained video captioning, addressing DPO limitations with better training efficiency and performance.
Details
Motivation: Existing methods struggle with capturing subtle video dynamics and rich details in fine-grained video captioning. Direct preference optimization (DPO) has limitations that need to be addressed for better vision-language model performance.Method: Proposes Synergistic Preference Optimization (SynPO) with two key components: 1) A pipeline for constructing preference pairs using VLMs with partial LLM assistance, balancing cost and quality; 2) SynPO optimization method that prevents negative preference domination, preserves language capability, and eliminates need for reference model.
Result: SynPO consistently outperforms DPO variants on video captioning benchmarks (VDC, VDD, VATEX) and NLP tasks, achieving 20% improvement in training efficiency.
Conclusion: SynPO effectively enhances vision-language models for fine-grained video captioning while addressing DPO limitations, offering better performance and efficiency.
Abstract: Fine-grained video captioning aims to generate detailed, temporally coherent descriptions of video content. However, existing methods struggle to capture subtle video dynamics and rich detailed information. In this paper, we leverage preference learning to enhance the performance of vision-language models in fine-grained video captioning, while mitigating several limitations inherent to direct preference optimization (DPO). First, we propose a pipeline for constructing preference pairs that leverages the intrinsic properties of VLMs along with partial assistance from large language models, achieving an optimal balance between cost and data quality. Second, we propose Synergistic Preference Optimization (SynPO), a novel optimization method offering significant advantages over DPO and its variants. SynPO prevents negative preferences from dominating the optimization, explicitly preserves the model’s language capability to avoid deviation of the optimization objective, and improves training efficiency by eliminating the need for the reference model. We extensively evaluate SynPO not only on video captioning benchmarks (e.g., VDC, VDD, VATEX) but also across well-established NLP tasks, including general language understanding and preference evaluation, using diverse pretrained models. Results demonstrate that SynPO consistently outperforms DPO variants while achieving 20% improvement in training efficiency. Code is available at https://github.com/longmalongma/SynPO
[677] Dynamic Reinsurance Treaty Bidding via Multi-Agent Reinforcement Learning
Stella C. Dong, James R. Finlay
Main category: cs.AI
TL;DR: MARL framework for reinsurance treaty bidding shows autonomous agents can outperform traditional pricing methods with 15% higher profits, 20% lower tail risk, and 25% better Sharpe ratios.
Details
Motivation: Address inefficiencies in traditional broker-mediated reinsurance placement processes and explore whether autonomous learning-based bidding systems can improve risk transfer efficiency and outperform conventional pricing approaches.Method: Multi-agent reinforcement learning framework where each reinsurer is represented by an adaptive agent that refines bidding strategies in a competitive, partially observable environment with institutional frictions like broker intermediation, incumbent advantages, and asymmetric information access.
Result: MARL agents achieve up to 15% higher underwriting profit, 20% lower tail risk (CVaR), and over 25% improvement in Sharpe ratios compared to actuarial and heuristic baselines, with robustness across hyperparameters and resilience under catastrophe shocks.
Conclusion: MARL offers a viable path toward more transparent, adaptive, and risk-sensitive reinsurance markets, contributing to algorithmic market design, strategic bidding, and AI-enabled financial decision-making.
Abstract: This paper develops a novel multi-agent reinforcement learning (MARL) framework for reinsurance treaty bidding, addressing long-standing inefficiencies in traditional broker-mediated placement processes. We pose the core research question: Can autonomous, learning-based bidding systems improve risk transfer efficiency and outperform conventional pricing approaches in reinsurance markets? In our model, each reinsurer is represented by an adaptive agent that iteratively refines its bidding strategy within a competitive, partially observable environment. The simulation explicitly incorporates institutional frictions including broker intermediation, incumbent advantages, last-look privileges, and asymmetric access to underwriting information. Empirical analysis demonstrates that MARL agents achieve up to 15% higher underwriting profit, 20% lower tail risk (CVaR), and over 25% improvement in Sharpe ratios relative to actuarial and heuristic baselines. Sensitivity tests confirm robustness across hyperparameter settings, and stress testing reveals strong resilience under simulated catastrophe shocks and capital constraints. These findings suggest that MARL offers a viable path toward more transparent, adaptive, and risk-sensitive reinsurance markets. The proposed framework contributes to emerging literature at the intersection of algorithmic market design, strategic bidding, and AI-enabled financial decision-making.
[678] Automated Formalization via Conceptual Retrieval-Augmented LLMs
Wangyue Lu, Lun Du, Sirui Li, Ke Weng, Haozhe Sun, Hengyu Liu, Minghe Yu, Tiancheng Zhang, Ge Yu
Main category: cs.AI
TL;DR: CRAMF is a retrieval-augmented framework that improves LLM-based mathematical formalization by retrieving formal definitions from Mathlib4 to address hallucination and semantic gaps in autoformalization.
Details
Motivation: Manual formalization for interactive theorem provers is labor-intensive and requires expert knowledge. Automated formalization faces challenges with model hallucination (undefined predicates, symbol misuse, version incompatibility) and semantic gaps from ambiguous natural language descriptions.Method: CRAMF constructs a concept-definition knowledge base from Mathlib4 (Lean 4 library) with 26,000+ formal definitions. It uses contextual query augmentation to handle conceptual polymorphism and employs dual-channel hybrid retrieval with reranking for accurate definition retrieval.
Result: Experiments on miniF2F, ProofNet, and AdvancedMath benchmarks show CRAMF yields consistent improvements in translation accuracy, achieving up to 62.1% and average 29.9% relative improvement when integrated with LLM-based autoformalizers.
Conclusion: CRAMF effectively addresses hallucination and semantic gap issues in mathematical formalization by providing contextual grounding through retrieved formal definitions, significantly improving autoformalization accuracy.
Abstract: Interactive theorem provers (ITPs) require manual formalization, which is labor-intensive and demands expert knowledge. While automated formalization offers a potential solution, it faces two major challenges: model hallucination (e.g., undefined predicates, symbol misuse, and version incompatibility) and the semantic gap caused by ambiguous or missing premises in natural language descriptions. To address these issues, we propose CRAMF, a Concept-driven Retrieval-Augmented Mathematical Formalization framework. CRAMF enhances LLM-based autoformalization by retrieving formal definitions of core mathematical concepts, providing contextual grounding during code generation. However, applying retrieval-augmented generation (RAG) in this setting is non-trivial due to the lack of structured knowledge bases, the polymorphic nature of mathematical concepts, and the high precision required in formal retrieval. We introduce a framework for automatically constructing a concept-definition knowledge base from Mathlib4, the standard mathematical library for the Lean 4 theorem prover, indexing over 26,000 formal definitions and 1,000+ core mathematical concepts. To address conceptual polymorphism, we propose contextual query augmentation with domain- and application-level signals. In addition, we design a dual-channel hybrid retrieval strategy with reranking to ensure accurate and relevant definition retrieval. Experiments on miniF2F, ProofNet, and our newly proposed AdvancedMath benchmark show that CRAMF can be seamlessly integrated into LLM-based autoformalizers, yielding consistent improvements in translation accuracy, achieving up to 62.1% and an average of 29.9% relative improvement.
[679] BuilderBench: The Building Blocks of Intelligent Agents
Raj Ghugare, Roger Creus Castanyer, Catherine Ji, Kathryn Wantlin, Jin Schofield, Karthik Narasimhan, Benjamin Eysenbach
Main category: cs.AI
TL;DR: BuilderBench is a benchmark for agent pre-training focused on open-ended exploration where agents must learn to build diverse structures using blocks in a simulated environment, requiring embodied reasoning and physical understanding.
Details
Motivation: Current AI models struggle with novel problems beyond existing data limits; agents need exploration and experiential learning skills. The goal is to develop scalable learning mechanisms for agents that learn through interaction, addressing the open problem of creating agents that can learn through open-ended exploration.Method: Introduces BuilderBench benchmark with: (1) hardware-accelerated simulator of robotic agent interacting with physical blocks, (2) task-suite with 42+ diverse target structures testing physics, mathematics, and long-horizon planning understanding. Agents explore and learn without external supervision during training, then build unseen target structures during evaluation.
Result: Many tasks challenge current algorithms, prompting the creation of a “training wheels” protocol for single target structure training/evaluation. The benchmark includes single-file implementations of six different algorithms as reference points.
Conclusion: BuilderBench accelerates research into agent pre-training centered on open-ended exploration, requiring embodied reasoning expressed through actions rather than words, with potential to advance agents that learn through interaction.
Abstract: Today’s AI models learn primarily through mimicry and refining, so it is not surprising that they struggle to solve problems beyond the limits set by existing data. To solve novel problems, agents should acquire skills for exploring and learning through experience. Finding a scalable learning mechanism for developing agents that learn through interaction remains a major open problem. In this work, we introduce BuilderBench, a benchmark to accelerate research into agent pre-training that centers open-ended exploration. BuilderBench requires agents to learn how to build any structure using blocks. BuilderBench is equipped with $(1)$ a hardware accelerated simulator of a robotic agent interacting with various physical blocks, and $(2)$ a task-suite with over 42 diverse target structures that are carefully curated to test an understanding of physics, mathematics, and long-horizon planning. During training, agents have to explore and learn general principles about the environment without any external supervision. During evaluation, agents have to build the unseen target structures from the task suite. Solving these tasks requires a sort of \emph{embodied reasoning} that is not reflected in words but rather in actions, experimenting with different strategies and piecing them together. Our experiments show that many of these tasks challenge the current iteration of algorithms. Hence, we also provide a ``training wheels’’ protocol, in which agents are trained and evaluated to build a single target structure from the task suite. Finally, we provide single-file implementations of six different algorithms as a reference point for researchers.
[680] Towards Unified World Models for Visual Navigation via Memory-Augmented Planning and Foresight
Yifei Dong, Fengyi Wu, Guangyu Chen, Lingdong Kong, Xu Zhu, Qiyu Hu, Yuxuan Zhou, Jingdong Sun, Jun-Yan He, Qi Dai, Alexander G. Hauptmann, Zhi-Qi Cheng
Main category: cs.AI
TL;DR: UniWM is a unified world model that integrates visual foresight and planning for embodied navigation, using multimodal autoregressive modeling with hierarchical memory to align action selection with imagined visual outcomes.
Details
Motivation: Current embodied navigation systems use modular designs that decouple planning from visual world modeling, causing state-action misalignment and poor adaptability in novel/dynamic scenarios. There's a need for unified models that tightly align prediction with control.Method: UniWM uses a unified, memory-augmented world model with a multimodal autoregressive backbone. It integrates egocentric visual foresight and planning, grounding action selection in visually imagined outcomes. A hierarchical memory mechanism fuses short-term perceptual cues with longer-term trajectory context for stable reasoning over extended horizons.
Result: Improves navigation success rates by up to 30% on four benchmarks (Go Stanford, ReCon, SCAND, HuRoN) and 1X Humanoid Dataset, substantially reduces trajectory errors, generalizes zero-shot to unseen TartanDrive dataset, and scales to high-dimensional humanoid control.
Conclusion: UniWM represents a principled step toward unified, imagination-driven embodied navigation by tightly aligning visual prediction with control through integrated world modeling.
Abstract: Enabling embodied agents to imagine future states is essential for robust and generalizable visual navigation. Yet, state-of-the-art systems typically rely on modular designs that decouple navigation planning from visual world modeling, which often induces state-action misalignment and weak adaptability in novel or dynamic scenarios. We propose UniWM, a unified, memory-augmented world model that integrates egocentric visual foresight and planning within a single multimodal autoregressive backbone. UniWM explicitly grounds action selection in visually imagined outcomes, tightly aligning prediction with control. Meanwhile, a hierarchical memory mechanism fuses short-term perceptual cues with longer-term trajectory context, supporting stable and coherent reasoning over extended horizons. Extensive experiments on four challenging benchmarks (Go Stanford, ReCon, SCAND, HuRoN) and the 1X Humanoid Dataset show that UniWM improves navigation success rates by up to 30%, substantially reduces trajectory errors against strong baselines, generalizes zero-shot to the unseen TartanDrive dataset, and scales naturally to high-dimensional humanoid control. These results position UniWM as a principled step toward unified, imagination-driven embodied navigation. The code and models are available at https://github.com/F1y1113/UniWM.
[681] DeepCompress: A Dual Reward Strategy for Dynamically Exploring and Compressing Reasoning Chains
Tian Liang, Wenxiang Jiao, Zhiwei He, Jiahao Xu, Haitao Mi, Dong Yu
Main category: cs.AI
TL;DR: DeepCompress is a framework that improves both accuracy and efficiency of Large Reasoning Models by adaptively adjusting reasoning length based on problem difficulty, encouraging shorter paths for simple problems and longer exploration for hard ones.
Details
Motivation: Current methods for improving reasoning efficiency in Large Reasoning Models often sacrifice accuracy. Models tend to "overthink" simple problems and "underthink" complex ones, and existing approaches using SFT or RL with token-length rewards typically trade accuracy for efficiency.Method: DeepCompress uses an adaptive length reward mechanism that dynamically classifies problems as “Simple” or “Hard” in real-time based on the model’s evolving capability. It employs a dual-reward strategy: encouraging shorter reasoning for simple problems and promoting longer, more exploratory thought chains for hard problems, allowing the model to autonomously adjust its Chain-of-Thought length.
Result: Experimental results on challenging mathematical benchmarks show that DeepCompress consistently outperforms baseline methods, achieving superior accuracy while significantly improving token efficiency compared to existing approaches.
Conclusion: The paper demonstrates that adaptive reasoning length adjustment based on problem difficulty can simultaneously improve both accuracy and efficiency in Large Reasoning Models, challenging the prevailing approach of consistently favoring shorter reasoning paths.
Abstract: Large Reasoning Models (LRMs) have demonstrated impressive capabilities but suffer from cognitive inefficiencies like “overthinking” simple problems and “underthinking” complex ones. While existing methods that use supervised fine-tuning (SFT) or reinforcement learning (RL) with token-length rewards can improve efficiency, they often do so at the cost of accuracy. This paper introduces DeepCompress, a novel framework that simultaneously enhances both the accuracy and efficiency of LRMs. We challenge the prevailing approach of consistently favoring shorter reasoning paths, showing that longer responses can contain a broader range of correct solutions for difficult problems. DeepCompress employs an adaptive length reward mechanism that dynamically classifies problems as “Simple” or “Hard” in real-time based on the model’s evolving capability. It encourages shorter, more efficient reasoning for “Simple” problems while promoting longer, more exploratory thought chains for “Hard” problems. This dual-reward strategy enables the model to autonomously adjust its Chain-of-Thought (CoT) length, compressing reasoning for well-mastered problems and extending it for those it finds challenging. Experimental results on challenging mathematical benchmarks show that DeepCompress consistently outperforms baseline methods, achieving superior accuracy while significantly improving token efficiency.
[682] From Five Dimensions to Many: Large Language Models as Precise and Interpretable Psychological Profilers
Yi-Fei Liu, Yi-Long Lu, Di He, Hang Zhang
Main category: cs.AI
TL;DR: LLMs can accurately model human psychological trait correlations from minimal personality data through a two-stage abstraction process, achieving near-human-level performance without training on the target scales.
Details
Motivation: To investigate whether LLMs can capture the complex correlational structure of human psychological traits from minimal quantitative inputs, and understand the underlying reasoning processes they employ.Method: Prompted various LLMs with Big Five Personality Scale responses from 816 individuals to role-play responses on nine other psychological scales. Analyzed reasoning traces to understand the two-stage process: 1) transforming raw scores into natural language personality summaries through information selection/compression, and 2) generating target scale responses based on reasoning from these summaries.
Result: LLMs demonstrated remarkable accuracy (R² > 0.89) in capturing human psychological structure, exceeding semantic similarity predictions and approaching trained ML algorithm accuracy. LLMs identified the same key personality factors as trained algorithms but failed to differentiate item importance within factors. Compressed summaries captured synergistic information that enhanced prediction alignment when added to original scores.
Conclusion: LLMs can precisely predict individual psychological traits through abstraction and reasoning, offering both a powerful tool for psychological simulation and insights into their emergent reasoning capabilities.
Abstract: Psychological constructs within individuals are widely believed to be interconnected. We investigated whether and how Large Language Models (LLMs) can model the correlational structure of human psychological traits from minimal quantitative inputs. We prompted various LLMs with Big Five Personality Scale responses from 816 human individuals to role-play their responses on nine other psychological scales. LLMs demonstrated remarkable accuracy in capturing human psychological structure, with the inter-scale correlation patterns from LLM-generated responses strongly aligning with those from human data $(R^2 > 0.89)$. This zero-shot performance substantially exceeded predictions based on semantic similarity and approached the accuracy of machine learning algorithms trained directly on the dataset. Analysis of reasoning traces revealed that LLMs use a systematic two-stage process: First, they transform raw Big Five responses into natural language personality summaries through information selection and compression, analogous to generating sufficient statistics. Second, they generate target scale responses based on reasoning from these summaries. For information selection, LLMs identify the same key personality factors as trained algorithms, though they fail to differentiate item importance within factors. The resulting compressed summaries are not merely redundant representations but capture synergistic information–adding them to original scores enhances prediction alignment, suggesting they encode emergent, second-order patterns of trait interplay. Our findings demonstrate that LLMs can precisely predict individual participants’ psychological traits from minimal data through a process of abstraction and reasoning, offering both a powerful tool for psychological simulation and valuable insights into their emergent reasoning capabilities.
[683] Spilling the Beans: Teaching LLMs to Self-Report Their Hidden Objectives
Chloe Li, Mary Phuong, Daniel Tan
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to failed paper fetchMethod: Unable to determine method due to failed paper fetch
Result: Unable to determine results due to failed paper fetch
Conclusion: Unable to draw conclusions due to failed paper fetch
Abstract: Failed to fetch summary for 2511.06626: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.06626&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[684] RadHiera: Semantic Hierarchical Reinforcement Learning for Medical Report Generation
Bodong Du, Honglong Yang, Xiaomeng Li
Main category: cs.AI
TL;DR: Unable to analyze paper 2511.10065 due to HTTP 429 error when fetching summary from arXiv API
Details
Motivation: Cannot determine motivation due to missing abstract contentMethod: Cannot determine method due to missing abstract content
Result: Cannot determine results due to missing abstract content
Conclusion: Cannot draw conclusions due to missing abstract content
Abstract: Failed to fetch summary for 2511.10065: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.10065&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[685] Think, Speak, Decide: Language-Augmented Multi-Agent Reinforcement Learning for Economic Decision-Making
Heyang Ma, Qirui Mi, Qipeng Yang, Zijun Fan, Bo Li, Haifeng Zhang
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2511.12876: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.12876&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[686] Stable diffusion models reveal a persisting human and AI gap in visual creativity
Silvia Rondini, Claudia Alvarez-Martin, Paula Angermair-Barkai, Olivier Penacchio, M. Paz, Matthew Pelowski, Dan Dediu, Antoni Rodriguez-Fornells, Xim Cerda-Company
Main category: cs.AI
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting). The paper ID 2511.16814 suggests it’s from November 2024, but no content available for analysis.
Details
Motivation: Cannot determine motivation without access to the paper content.Method: Cannot determine method without access to the paper content.
Result: Cannot determine results without access to the paper content.
Conclusion: Cannot draw conclusions without access to the paper content.
Abstract: Failed to fetch summary for 2511.16814: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.16814&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[687] Massive Editing for Large Language Models Based on Dynamic Weight Generation
Wentao Wan, Qiqing Lao, Zhiwei Xie, Hefeng Wu, Runnan Lin, Liang Lin, Keze Wang
Main category: cs.AI
TL;DR: Paper 2512.14395: Could not fetch summary due to HTTP 429 error (rate limiting).
Details
Motivation: Unable to determine motivation due to missing abstract.Method: Unable to determine method due to missing abstract.
Result: Unable to determine results due to missing abstract.
Conclusion: Unable to draw conclusions due to missing abstract.
Abstract: Failed to fetch summary for 2512.14395: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.14395&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[688] Social Comparison without Explicit Inference of Others’ Reward Values: A Constructive Approach Using a Probabilistic Generative Model
Yosuke Taniuchi, Chie Hieida, Atsushi Noritake, Kazushi Ikeda, Masaki Isoda
Main category: cs.AI
TL;DR: Unable to analyze paper 2512.18687 due to HTTP 429 error when fetching abstract from arXiv API
Details
Motivation: Cannot determine motivation as abstract was not retrieved due to rate limiting errorMethod: Cannot determine method as abstract was not retrieved due to rate limiting error
Result: Cannot determine results as abstract was not retrieved due to rate limiting error
Conclusion: Cannot draw conclusions as abstract was not retrieved due to rate limiting error
Abstract: Failed to fetch summary for 2512.18687: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.18687&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[689] The Illusion of AI Expertise Under Uncertainty: Navigating Elusive Ground Truth via a Probabilistic Paradigm
Aparna Elangovan, Lei Xu, Mahsa Elyasi, Ismail Akdulum, Mehmet Aksakal, Enes Gurun, Brian Hur, Saab Mansour, Ravid Shwartz Ziv, Karin Verspoor, Dan Roth
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2601.05500: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.05500&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[690] Are Your Reasoning Models Reasoning or Guessing? A Mechanistic Analysis of Hierarchical Reasoning Models
Zirui Ren, Ziming Liu
Main category: cs.AI
TL;DR: Paper 2601.10679 appears to be unavailable due to HTTP 429 error (rate limiting), suggesting the arXiv API is temporarily blocking requests.
Details
Motivation: Unable to determine motivation as the paper content could not be retrieved due to API rate limiting.Method: No method information available due to failed content retrieval.
Result: No results available as the paper content could not be accessed.
Conclusion: The paper analysis cannot be completed due to technical limitations with the arXiv API.
Abstract: Failed to fetch summary for 2601.10679: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.10679&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[691] EvoOpt-LLM: Evolving industrial optimization models with large language models
Yiliu He, Tianle Li, Binghao Ji, Zhiyuan Liu, Di Huang
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to API rate limiting preventing access to paper detailsMethod: Cannot analyze method without access to paper content
Result: No results available due to API access restriction
Conclusion: Unable to provide analysis due to technical limitations in accessing paper information
Abstract: Failed to fetch summary for 2602.01082: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.01082&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[692] RE-MCDF: Closed-Loop Multi-Expert LLM Reasoning for Knowledge-Grounded Clinical Diagnosis
Shaowei Shen, Xiaohong Yang, Jie Yang, Lianfen Huang, Yongcai Zhang, Yang Zou, Seyyedali Hosseinalipour
Main category: cs.AI
TL;DR: Unable to analyze paper 2602.01297 due to HTTP 429 error when fetching abstract from arXiv API
Details
Motivation: Cannot determine motivation as abstract is unavailable due to rate limiting errorMethod: Cannot determine method as abstract is unavailable due to rate limiting error
Result: Cannot determine results as abstract is unavailable due to rate limiting error
Conclusion: Cannot draw conclusion as paper content is inaccessible due to HTTP 429 error from arXiv API
Abstract: Failed to fetch summary for 2602.01297: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.01297&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[693] Architecting Trust in Artificial Epistemic Agents
Nahema Marchal, Stephanie Chan, Matija Franklin, Manon Revel, Geoff Keeling, Roberta Fischli, Bilva Chandra, Iason Gabriel
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting) when querying arXiv API for paper ID 2603.02960
Details
Motivation: Unable to determine motivation as paper content could not be retrieved due to API rate limitingMethod: Unable to determine method as paper content could not be retrieved
Result: Unable to determine results as paper content could not be retrieved
Conclusion: Unable to determine conclusion as paper content could not be retrieved
Abstract: Failed to fetch summary for 2603.02960: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.02960&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[694] S5-SHB Agent: Society 5.0 enabled Multi-model Agentic Blockchain Framework for Smart Home
Janani Rangila, Akila Siriweera, Incheon Paik, Keitaro Naruse, Isuru Jayanada, Vishmika Devindi
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2603.05027: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.05027&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[695] Deconstructing Multimodal Mathematical Reasoning: Towards a Unified Perception-Alignment-Reasoning Paradigm
Tianyu Yang, Sihong Wu, Yilun Zhao, Zhenwen Liang, Lisen Dai, Chen Zhao, Minhao Cheng, Arman Cohan, Xiangliang Zhang
Main category: cs.AI
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting). The paper ID 2603.08291 suggests it’s from March 2023, but no abstract or content is available for analysis.
Details
Motivation: Cannot determine motivation without access to the paper content. The HTTP 429 error indicates the arXiv API is rate limiting requests, preventing retrieval of the paper details.Method: No method information available due to failed API request. The paper ID format suggests it’s from March 2023, but specific technical approach cannot be ascertained.
Result: No results can be reported since the paper content could not be retrieved. The HTTP 429 error means the request was blocked due to too many requests to the arXiv API.
Conclusion: Unable to draw conclusions about the paper without access to its content. The arXiv API rate limiting prevents analysis of this specific paper at this time.
Abstract: Failed to fetch summary for 2603.08291: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.08291&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[696] A Hierarchical Error-Corrective Graph Framework for Autonomous Agents with LLM-Based Action Generation
Cong Cao, Jingyao Zhang, Kun Tong
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting). The paper ID 2603.08388 appears to be an arXiv paper, but its content cannot be retrieved at this time.
Details
Motivation: Unable to determine motivation due to retrieval failure.Method: Unable to determine method due to retrieval failure.
Result: Unable to determine results due to retrieval failure.
Conclusion: Unable to determine conclusion due to retrieval failure.
Abstract: Failed to fetch summary for 2603.08388: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.08388&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[697] The FABRIC Strategy for Verifying Neural Feedback Systems
Samuel I. Akinwande, Sydney M. Katz, Mykel J. Kochenderfer, Clark Barrett
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation as paper content could not be retrievedMethod: Unable to determine method as paper content could not be retrieved
Result: Unable to determine results as paper content could not be retrieved
Conclusion: Unable to determine conclusion as paper content could not be retrieved
Abstract: Failed to fetch summary for 2603.08964: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.08964&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[698] Curveball Steering: The Right Direction To Steer Isn’t Always Linear
Shivam Raval, Hae Jin Song, Linlin Wu, Abir Harrasse, Jeff M. Phillips, Fazl Barez, Amirali Abdullah
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting).
Details
Motivation: Unable to determine motivation due to fetch failure.Method: Unable to determine method due to fetch failure.
Result: Unable to determine results due to fetch failure.
Conclusion: Unable to draw conclusions due to fetch failure.
Abstract: Failed to fetch summary for 2603.09313: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.09313&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[699] Detecting Intrinsic and Instrumental Self-Preservation in Autonomous Agents: The Unified Continuation-Interest Protocol
Christopher Altman
Main category: cs.AI
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot determine conclusion as paper content is unavailable
Abstract: Failed to fetch summary for 2603.11382: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.11382&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[700] LLMs can construct powerful representations and streamline sample-efficient supervised learning
Ilker Demirel, Lawrence Shi, Zeshan Hussain, David Sontag
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2603.11679: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.11679&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[701] When OpenClaw Meets Hospital: Toward an Agentic Operating System for Dynamic Clinical Workflows
Wenxian Yang, Hanzheng Qiu, Bangqun Zhang, Chengquan Li, Zhiyong Huang, Xiaobin Feng, Rongshan Yu, Jiahong Dong
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to access errorMethod: Unable to determine method due to access error
Result: Unable to determine results due to access error
Conclusion: Unable to determine conclusion due to access error
Abstract: Failed to fetch summary for 2603.11721: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.11721&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[702] Benchmarking Zero-Shot Reasoning Approaches for Error Detection in Solidity Smart Contracts
Eduardo Sardenberg, Antonio José Grandson Busson, Daniel de Sousa Moraes, Julio Cesar Duarte, Sérgio Colcher
Main category: cs.AI
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation without access to paper contentMethod: Cannot determine method without access to paper content
Result: Cannot determine results without access to paper content
Conclusion: Cannot determine conclusion without access to paper content
Abstract: Failed to fetch summary for 2603.13239: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.13239&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[703] Policy Optimization over General State and Action Spaces
Caleb Ju, Guanghui Lan
Main category: cs.AI
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot draw conclusions as paper content is unavailable
Abstract: Failed to fetch summary for 2211.16715: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2211.16715&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[704] Computational Concept of the Psyche
Anton Kolonin, Vladimir Krykov
Main category: cs.AI
TL;DR: Failed to fetch summary for arXiv ID 2603.15586 due to HTTP 429 rate limiting error
Details
Motivation: Unable to determine motivation due to API rate limiting preventing access to paper detailsMethod: Cannot analyze method as paper content is unavailable due to HTTP 429 error
Result: No results available - paper summary fetch failed due to rate limiting
Conclusion: Cannot provide analysis due to technical limitations in accessing the paper
Abstract: Failed to fetch summary for 2603.15586: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.15586&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[705] Better Generative Replay for Continual Federated Learning
Daiqing Qi, Handong Zhao, Sheng Li
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to failed paper fetchMethod: Unable to determine method due to failed paper fetch
Result: Unable to determine results due to failed paper fetch
Conclusion: Unable to determine conclusion due to failed paper fetch
Abstract: Failed to fetch summary for 2302.13001: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2302.13001&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[706] Algorithmic Trading Strategy Development and Optimisation
Owen Nyo Wei Yuan, Victor Tan Jia Xuan, Ong Jun Yao Fabian, Ryan Tan Jun Wei
Main category: cs.AI
TL;DR: Paper 2603.15848: Unable to fetch abstract due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation due to missing abstractMethod: Cannot determine method due to missing abstract
Result: Cannot determine results due to missing abstract
Conclusion: Cannot draw conclusions due to missing abstract
Abstract: Failed to fetch summary for 2603.15848: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.15848&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[707] Secure Linear Alignment of Large Language Models
Matt Gorbett, Suman Jana
Main category: cs.AI
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting). The paper ID 2603.18908 suggests it’s from March 2023, but no content available for analysis.
Details
Motivation: Cannot determine motivation without access to the paper content.Method: Cannot determine method without access to the paper content.
Result: Cannot determine results without access to the paper content.
Conclusion: Cannot draw conclusions without access to the paper content.
Abstract: Failed to fetch summary for 2603.18908: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.18908&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[708] FormalEvolve: Neuro-Symbolic Evolutionary Search for Diverse and Prover-Effective Autoformalization
Haijian Lu, Wei Wang, Jing Liu
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to access errorMethod: Unable to determine method due to access error
Result: Unable to determine results due to access error
Conclusion: Unable to determine conclusion due to access error
Abstract: Failed to fetch summary for 2603.19828: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.19828&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[709] Human strategic decision making in parametrized games
Sam Ganzfried
Main category: cs.AI
TL;DR: Unable to analyze paper 2104.14744 due to HTTP 429 error (rate limiting) when fetching from arXiv API
Details
Motivation: Cannot determine motivation as paper content could not be retrieved due to API rate limitingMethod: Unable to analyze method due to missing paper content
Result: No results available due to failed API request
Conclusion: Paper analysis impossible without access to the actual content
Abstract: Failed to fetch summary for 2104.14744: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2104.14744&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[710] Multi-Step First: A Lightweight Deep Reinforcement Learning Strategy for Robust Continuous Control with Partial Observability
Lingheng Meng, Rob Gorbet, Michael Burke, Dana Kulić
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to technical error in fetching paper contentMethod: Unable to determine method due to technical error in fetching paper content
Result: Unable to determine results due to technical error in fetching paper content
Conclusion: Unable to draw conclusions due to technical error in fetching paper content
Abstract: Failed to fetch summary for 2209.04999: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2209.04999&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[711] Towards Urban General Intelligence: A Review and Outlook of Urban Foundation Models
Weijia Zhang, Jindong Han, Zhao Xu, Hang Ni, Tengfei Lyu, Hao Liu, Hui Xiong
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to data fetch failureMethod: Unable to determine method due to data fetch failure
Result: Unable to determine results due to data fetch failure
Conclusion: Unable to draw conclusions due to data fetch failure
Abstract: Failed to fetch summary for 2402.01749: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2402.01749&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[712] Batch Entanglement Detection in Parameterized Qubit States using Classical Bandit Algorithms
K. Bharati, Vikesh Siddhu, Krishna Jagannathan
Main category: cs.AI
TL;DR: Paper ID 2406.19738 could not be fetched due to HTTP 429 error (rate limiting), so analysis cannot be performed
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to draw conclusions due to fetch failure
Abstract: Failed to fetch summary for 2406.19738: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2406.19738&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[713] Cognitive Spillover in Human-AI Teams
Christoph Riedl, Saiph Savage, Josie Zvelebilova
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2407.17489: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2407.17489&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[714] Strongly-polynomial time and validation analysis of policy gradient methods
Caleb Ju, Guanghui Lan
Main category: cs.AI
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting).
Details
Motivation: Cannot determine motivation without access to paper content.Method: Cannot determine method without access to paper content.
Result: Cannot determine results without access to paper content.
Conclusion: Cannot determine conclusion without access to paper content.
Abstract: Failed to fetch summary for 2409.19437: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2409.19437&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[715] Meta-Transfer Learning Powered Temporal Graph Networks for Cross-City Real Estate Appraisal
Weijia Zhang, Jindong Han, Hao Liu, Wei Fan, Hao Wang, Hui Xiong
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2410.08947: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2410.08947&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[716] A Hybrid Framework for Reinsurance Optimization: Integrating Generative Models and Reinforcement Learning
Stella C. Dong
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to paper fetch failureMethod: Unable to determine method due to paper fetch failure
Result: Unable to determine results due to paper fetch failure
Conclusion: Unable to determine conclusion due to paper fetch failure
Abstract: Failed to fetch summary for 2501.06404: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2501.06404&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[717] Adaptive Insurance Reserving with CVaR-Constrained Reinforcement Learning under Macroeconomic Regimes
Stella C. Dong
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2504.09396: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2504.09396&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[718] Scalable Multi-Task Learning through Spiking Neural Networks with Adaptive Task-Switching Policy for Intelligent Autonomous Agents
Rachmad Vidya Wicaksana Putra, Avaneesh Devkota, Muhammad Shafique
Main category: cs.AI
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting). The paper ID 2504.13541 suggests it’s from April 2025, but no content is available for analysis.
Details
Motivation: Cannot determine motivation without access to the paper content.Method: Cannot determine method without access to the paper content.
Result: Cannot determine results without access to the paper content.
Conclusion: Cannot draw conclusions without access to the paper content.
Abstract: Failed to fetch summary for 2504.13541: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2504.13541&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[719] DIP: Efficient Large Multimodal Model Training with Dynamic Interleaved Pipeline
Zhenliang Xue, Hanpeng Hu, Xing Chen, Yimin Jiang, Yixin Song, Zeyu Mi, Yibo Zhu, Daxin Jiang, Yubin Xia, Haibo Chen
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to failed paper fetchMethod: Unable to determine method due to failed paper fetch
Result: Unable to determine results due to failed paper fetch
Conclusion: Unable to draw conclusions due to failed paper fetch
Abstract: Failed to fetch summary for 2504.14145: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2504.14145&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[720] AlphaZero-Edu: Democratizing Access to AlphaZero
Ruitong Li, Aisheng Mo, Guowei Su, Ru Zhang, Binjie Guo, Haohan Jiang, Xurong Lin, Hongyan Wei, Jie Li, Zhiyuan Qian, Zhuhao Zhang, Xiaoyuan Cheng
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to access limitationsMethod: Unable to determine method due to access limitations
Result: Unable to determine results due to access limitations
Conclusion: Unable to determine conclusion due to access limitations
Abstract: Failed to fetch summary for 2504.14636: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2504.14636&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[721] Uniform Loss vs. Specialized Optimization: A Comparative Analysis in Multi-Task Learning
Gabriel S. Gama, Valdir Grassi Jr
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2505.10347: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2505.10347&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[722] RoboFAC: A Comprehensive Framework for Robotic Failure Analysis and Correction
Zewei Ye, Weifeng Lu, Minghao Ye, Tao Lin, Shuo Yang, Junchi Yan, Bo Zhao
Main category: cs.AI
TL;DR: Paper 2505.12224: Unable to fetch abstract due to HTTP 429 error (rate limiting).
Details
Motivation: Cannot determine motivation due to missing abstract content.Method: Cannot determine method due to missing abstract content.
Result: Cannot determine results due to missing abstract content.
Conclusion: Cannot determine conclusion due to missing abstract content.
Abstract: Failed to fetch summary for 2505.12224: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2505.12224&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[723] FRIREN: Beyond Trajectories – A Spectral Lens on Time
Qilin Wang
Main category: cs.AI
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation due to missing paper contentMethod: Cannot determine method due to missing paper content
Result: Cannot determine results due to missing paper content
Conclusion: Cannot draw conclusions due to missing paper content
Abstract: Failed to fetch summary for 2505.17370: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2505.17370&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[724] Architectural Backdoors for Within-Batch Data Stealing and Model Inference Manipulation
Nicolas Küchler, Ivan Petrov, Conrad Grobler, Ilia Shumailov
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2505.18323: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2505.18323&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[725] Stochastically Dominant Peer Prediction
Yichi Zhang, Shengwei Xu, David Pennock, Grant Schoenebeck
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to access errorMethod: Unable to determine method due to access error
Result: Unable to determine results due to access error
Conclusion: Unable to draw conclusions due to access error
Abstract: Failed to fetch summary for 2506.02259: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2506.02259&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[726] Generalized Incremental Learning under Concept Drift across Evolving Data Streams
En Yu, Jie Lu, Guangquan Zhang
Main category: cs.AI
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot draw conclusions about paper content due to access limitations
Abstract: Failed to fetch summary for 2506.05736: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2506.05736&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[727] The Impact of LLM-Assistants on Software Developer Productivity: A Systematic Review and Mapping Study
Amr Mohamed, Maram Assi, Mariam Guizani
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2507.03156: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2507.03156&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[728] Characterizing State Space Model and Hybrid Language Model Performance with Long Context
Saptarshi Mitra, Rachid Karami, Haocheng Xu, Sitao Huang, Hyoukjun Kwon
Main category: cs.AI
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation due to access errorMethod: Cannot determine method due to access error
Result: Cannot determine results due to access error
Conclusion: Cannot determine conclusion due to access error
Abstract: Failed to fetch summary for 2507.12442: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2507.12442&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[729] Graph Structure Learning with Privacy Guarantees for Open Graph Data
Muhao Guo, Jiaqi Wu, Yang Weng, Yizheng Liao, Shengzhe Chen
Main category: cs.AI
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting). Need to try again later or use alternative methods to access the paper information.
Details
Motivation: Cannot determine motivation without access to the paper content.Method: Cannot determine method without access to the paper content.
Result: Cannot determine results without access to the paper content.
Conclusion: Cannot determine conclusion without access to the paper content.
Abstract: Failed to fetch summary for 2507.19116: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2507.19116&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[730] Latent Policy Steering with Embodiment-Agnostic Pretrained World Models
Yiqi Wang, Mrinal Verghese, Jeff Schneider
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Unable to determine motivation due to API access errorMethod: Unable to determine method due to API access error
Result: Unable to determine results due to API access error
Conclusion: Unable to determine conclusion due to API access error
Abstract: Failed to fetch summary for 2507.13340: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2507.13340&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[731] On Arbitrary Predictions from Equally Valid Models
Sarah Lockfisch, Kristian Schwethelm, Martin Menten, Rickmer Braren, Daniel Rueckert, Alexander Ziller, Georgios Kaissis
Main category: cs.AI
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation without access to paper contentMethod: Cannot determine method without access to paper content
Result: Cannot determine results without access to paper content
Conclusion: Cannot draw conclusions without access to paper content
Abstract: Failed to fetch summary for 2507.19408: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2507.19408&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[732] Packet-Level DDoS Data Augmentation Using Dual-Stream Temporal-Field Diffusion
Gongli Xi, Ye Tian, Yannan Hu, Yuchao Zhang, Yapeng Niu, Xiangyang Gong
Main category: cs.AI
TL;DR: Paper ID 2507.20115 could not be fetched due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to access restrictionsMethod: Unable to determine method due to access restrictions
Result: Unable to determine results due to access restrictions
Conclusion: Unable to determine conclusion due to access restrictions
Abstract: Failed to fetch summary for 2507.20115: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2507.20115&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[733] From Knowledge to Conjectures: A Modal Framework for Reasoning about Hypotheses
Fabio Vitali
Main category: cs.AI
TL;DR: Paper analysis unavailable due to HTTP 429 error when fetching abstract from arXiv
Details
Motivation: Unable to determine paper motivation due to access errorMethod: Unable to determine paper method due to access error
Result: Unable to determine paper results due to access error
Conclusion: Unable to analyze paper due to technical access issues with arXiv API
Abstract: Failed to fetch summary for 2508.07304: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2508.07304&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[734] Can synthetic data reproduce real-world findings in epidemiology? A replication study using adversarial random forests
Jan Kapar, Kathrin Günther, Lori Ann Vallis, Klaus Berger, Nadine Binder, Hermann Brenner, Stefanie Castell, Beate Fischer, Volker Harth, Bernd Holleczek, Timm Intemann, Till Ittermann, André Karch, Thomas Keil, Lilian Krist, Berit Lange, Michael F. Leitzmann, Katharina Nimptsch, Nadia Obi, Iris Pigeot, Tobias Pischon, Tamara Schikowski, Börge Schmidt, Carsten Oliver Schmidt, Anja M. Sedlmair, Justine Tanoey, Harm Wienbergen, Andreas Wienke, Claudia Wigmann, Marvin N. Wright
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to API rate limiting preventing access to paper contentMethod: Unable to determine method due to API rate limiting preventing access to paper content
Result: Unable to determine results due to API rate limiting preventing access to paper content
Conclusion: Unable to determine conclusion due to API rate limiting preventing access to paper content
Abstract: Failed to fetch summary for 2508.14936: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2508.14936&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[735] Randomness and signal propagation in physics-informed neural networks (PINNs): A neural PDE perspective
Jean-Michel Tucny, Abhisek Ganguly, Santosh Ansumali, Sauro Succi
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to API rate limiting preventing access to paper detailsMethod: Unable to determine method due to API rate limiting preventing access to paper details
Result: Unable to determine results due to API rate limiting preventing access to paper details
Conclusion: Unable to determine conclusion due to API rate limiting preventing access to paper details
Abstract: Failed to fetch summary for 2509.18131: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.18131&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[736] Towards A Transferable Acceleration Method for Density Functional Theory
Zhe Liu, Yuyan Ni, Zhichen Pu, Qiming Sun, Siyuan Liu, Wen Yan
Main category: cs.AI
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation due to failed paper retrievalMethod: Cannot determine method due to failed paper retrieval
Result: Cannot determine results due to failed paper retrieval
Conclusion: Cannot determine conclusion due to failed paper retrieval
Abstract: Failed to fetch summary for 2509.25724: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.25724&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[737] CurES: From Gradient Analysis to Efficient Curriculum Learning for Reasoning LLMs
Yongcheng Zeng, Zexu Sun, Bokai Ji, Erxue Min, Hengyi Cai, Shuaiqiang Wang, Dawei Yin, Haifeng Zhang, Xu Chen, Jun Wang
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to access errorMethod: Unable to determine method due to access error
Result: Unable to determine results due to access error
Conclusion: Unable to draw conclusions due to access error
Abstract: Failed to fetch summary for 2510.01037: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.01037&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[738] Auditing Pay-Per-Token in Large Language Models
Ander Artola Velasco, Stratis Tsirtsis, Manuel Gomez-Rodriguez
Main category: cs.AI
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation without access to paper contentMethod: Cannot determine method without access to paper content
Result: Cannot determine results without access to paper content
Conclusion: Cannot draw conclusions without access to paper content
Abstract: Failed to fetch summary for 2510.05181: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.05181&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[739] Towards a Practical Understanding of Lagrangian Methods in Safe Reinforcement Learning
Lindsay Spoor, Álvaro Serra-Gómez, Aske Plaat, Thomas Moerland
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2510.17564: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.17564&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[740] Advantage Shaping as Surrogate Reward Maximization: Unifying Pass@K Policy Gradients
Christos Thrampoulidis, Sadegh Mahdavi, Wenlong Deng
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to failed paper fetchMethod: Unable to determine method due to failed paper fetch
Result: Unable to determine results due to failed paper fetch
Conclusion: Unable to determine conclusion due to failed paper fetch
Abstract: Failed to fetch summary for 2510.23049: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.23049&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[741] COFAP: A Universal Framework for COFs Adsorption Prediction through Designed Multi-Modal Extraction and Cross-Modal Synergy
Zihan Li, Mingyang Wan, Mingyu Gao, Xishi Tai, Zhongshan Chen, Xiangke Wang, Feifan Zhang
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2511.01946: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.01946&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[742] Towards Real-World Validity in Generative AI Benchmarks: Understanding and Designing Domain-Centered Evaluations for Journalism Practitioners
Charlotte Li, Nick Hagar, Sachita Nishal, Jeremy Gilbert, Nick Diakopoulos
Main category: cs.AI
TL;DR: Paper 2511.05501: Unable to fetch abstract due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to missing abstractMethod: Unable to determine method due to missing abstract
Result: Unable to determine results due to missing abstract
Conclusion: Unable to determine conclusion due to missing abstract
Abstract: Failed to fetch summary for 2511.05501: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.05501&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[743] LUT-LLM: Efficient Large Language Model Inference with Memory-based Computations on FPGAs
Zifan He, Shengyu Ye, Rui Ma, Yang Wang, Jason Cong
Main category: cs.AI
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) when trying to access arXiv API
Details
Motivation: Cannot determine motivation as paper content is unavailable due to API rate limitingMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot draw conclusions about paper content due to access limitations
Abstract: Failed to fetch summary for 2511.06174: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.06174&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[744] Conformal Constrained Policy Optimization for Cost-Effective LLM Agents
Wenwen Si, Sooyong Jang, Insup Lee, Osbert Bastani
Main category: cs.AI
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting). The paper ID 2511.11828 appears to be from November 2024, but no content could be retrieved for analysis.
Details
Motivation: Cannot determine motivation without access to the paper content.Method: Cannot determine method without access to the paper content.
Result: Cannot determine results without access to the paper content.
Conclusion: Cannot determine conclusion without access to the paper content.
Abstract: Failed to fetch summary for 2511.11828: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.11828&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[745] SVBRD-LLM: Self-Verifying Behavioral Rule Discovery for Autonomous Vehicle Identification
Xiangyu Li, Tianyi Wang, Junfeng Jiao, Christian Claudel, Zhaomiao Guo
Main category: cs.AI
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot draw conclusions without access to paper content
Abstract: Failed to fetch summary for 2511.14977: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.14977&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[746] Open-weight genome language model safeguards: Assessing robustness via adversarial fine-tuning
James R. M. Black, Moritz S. Hanke, Aaron Maiwald, Tina Hernandez-Boussard, Oliver M. Crook, Jaspreet Pannu
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2511.19299: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.19299&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[747] The Price of Progress: Price Performance and the Future of AI
Hans Gundlach, Jayson Lynch, Matthias Mertens, Neil Thompson
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to API rate limiting preventing access to paper contentMethod: Cannot analyze method without access to paper content
Result: No results available due to API access limitations
Conclusion: Unable to provide analysis due to technical limitations in accessing the paper
Abstract: Failed to fetch summary for 2511.23455: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.23455&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[748] Beyond Linear Surrogates: High-Fidelity Local Explanations for Black-Box Models
Sanjeev Shrestha, Rahul Dubey, Hui Liu
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to failed paper fetchMethod: Unable to determine method due to failed paper fetch
Result: Unable to determine results due to failed paper fetch
Conclusion: Unable to determine conclusion due to failed paper fetch
Abstract: Failed to fetch summary for 2512.05556: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.05556&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[749] Cell-cell Communication Inference and Analysis: Biological Mechanisms, Computational Approaches, and Future Opportunities
Xiangzheng Cheng, Haili Huang, Ye Su, Qing Nie, Xiufen Zou, Suoqin Jin
Main category: cs.AI
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot draw conclusions due to lack of paper content
Abstract: Failed to fetch summary for 2512.03497: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.03497&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[750] Hybrid-Code v2: Zero-Hallucination Clinical ICD-10 Coding via Neuro-Symbolic Verification and Automated Knowledge Base Expansion
Yunguo Yu
Main category: cs.AI
TL;DR: Paper analysis unavailable due to HTTP 429 error when fetching abstract from arXiv API
Details
Motivation: Unable to determine motivation as the paper abstract could not be retrieved due to rate limiting (HTTP 429 error)Method: Cannot analyze method without access to the paper abstract or content
Result: No results available due to technical limitations in accessing the paper information
Conclusion: Analysis cannot be completed due to API rate limiting preventing access to the paper abstract
Abstract: Failed to fetch summary for 2512.23743: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.23743&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[751] MSACL: Multi-Step Actor-Critic Learning with Lyapunov Certificates for Exponentially Stabilizing Control
Yongwei Zhang, Yuanzhe Xing, Quanyi Liang, Quan Quan, Zhikun She
Main category: cs.AI
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot determine conclusion as paper content is unavailable
Abstract: Failed to fetch summary for 2512.24955: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.24955&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[752] Intrinsic-Metric Physics-Informed Neural Networks (IM-PINN) for Reaction-Diffusion Dynamics on Complex Riemannian Manifolds
Julian Evan Chrisnanto, Salsabila Rahma Alia, Nurfauzi Fadillah, Yulison Herry Chrisnanto
Main category: cs.AI
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation as paper content is unavailable due to API rate limitingMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot draw conclusions as paper content is unavailable due to technical limitations
Abstract: Failed to fetch summary for 2601.00834: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.00834&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[753] AI-enhanced tuning of quantum dot Hamiltonians toward Majorana modes
Mateusz Krawczyk, Jarosław Pawłowski
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Unable to determine motivation due to failed API requestMethod: Unable to determine method due to failed API request
Result: Unable to determine results due to failed API request
Conclusion: Unable to determine conclusion due to failed API request
Abstract: Failed to fetch summary for 2601.02149: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.02149&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[754] Meta-Cognitive Reinforcement Learning with Self-Doubt and Recovery
Zhipeng Zhang, Xiongfei Su, Kai Li
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2601.20193: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.20193&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[755] High-Fidelity Modeling of Stochastic Chemical Dynamics on Complex Manifolds: A Multi-Scale SIREN-PINN Framework for the Curvature-Perturbed Ginzburg-Landau Equation
Julian Evan Chrisnanto, Salsabila Rahma Alia, Nurfauzi Fadillah, Yulison Herry Chrisnanto
Main category: cs.AI
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot determine conclusion as paper content is unavailable
Abstract: Failed to fetch summary for 2601.08104: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.08104&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[756] Cross-talk based multi-task learning for fault classification of machine system influenced by multiple variables
Wonjun Yi, Rismaya Kumar Mishra, Yong-Hwa Park
Main category: cs.AI
TL;DR: Failed to fetch summary for paper 2602.05146 due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to failed API requestMethod: Unable to determine method due to failed API request
Result: Unable to determine results due to failed API request
Conclusion: Unable to determine conclusion due to failed API request
Abstract: Failed to fetch summary for 2602.05146: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.05146&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[757] On Randomness in Agentic Evals
Bjarni Haukur Bjarnason, André Silva, Martin Monperrus
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting) for arXiv ID 2602.07150
Details
Motivation: Unable to determine motivation as paper content could not be retrieved due to rate limiting on arXiv APIMethod: No method information available due to failed API request
Result: No results available due to failed paper retrieval
Conclusion: Cannot provide analysis due to technical limitations in accessing the paper
Abstract: Failed to fetch summary for 2602.07150: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.07150&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[758] Native Reasoning Models: Training Language Models to Reason on Unverifiable Data
Yuanfu Wang, Zhixuan Liu, Xiangtian Li, Chaochao Lu, Chao Yang
Main category: cs.AI
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting). The paper ID 2602.11549 suggests it’s from February 2025, but no content available for analysis.
Details
Motivation: Cannot determine motivation without access to paper content.Method: Cannot determine method without access to paper content.
Result: Cannot determine results without access to paper content.
Conclusion: Cannot draw conclusions without access to paper content.
Abstract: Failed to fetch summary for 2602.11549: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.11549&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[759] Energy-Aware Reinforcement Learning for Robotic Manipulation of Articulated Components in Infrastructure Operation and Maintenance
Xiaowen Tao, Yinuo Wang, Haitao Ding, Yuanyang Qi, Ziyu Song
Main category: cs.AI
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting). Need to retry or use alternative methods to access the paper content.
Details
Motivation: Cannot determine motivation without access to the paper content.Method: Cannot determine method without access to the paper content.
Result: Cannot determine results without access to the paper content.
Conclusion: Cannot draw conclusions without access to the paper content.
Abstract: Failed to fetch summary for 2602.12288: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.12288&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[760] LLM-Enhanced Rumor Detection via Virtual Node Induced Edge Prediction
Jiran Tao, Cheng Wang, Binyan Jiang
Main category: cs.AI
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting). The paper ID 2602.13279 suggests it’s from February 2026, but no content is available for analysis.
Details
Motivation: Cannot determine motivation without access to the paper content. The HTTP 429 error indicates the arXiv API is rate limiting requests, preventing retrieval of the abstract.Method: Cannot determine method without access to the paper content. The paper ID format suggests it’s a recent submission (February 2026), but no technical details are available.
Result: No results can be analyzed since the paper content could not be retrieved due to API rate limiting (HTTP 429 error).
Conclusion: Unable to draw any conclusions about the paper’s content or relevance due to technical limitations in accessing the abstract.
Abstract: Failed to fetch summary for 2602.13279: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.13279&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[761] AngelSlim: A more accessible, comprehensive, and efficient toolkit for large model compression
Rui Cen, QiangQiang Hu, Hong Huang, Hong Liu, Song Liu, Xin Luo, Lin Niu, Yifan Tan, Decheng Wu, Linchuan Xie, Rubing Yang, Guanghua Yu, Jianchen Zhu
Main category: cs.AI
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation without access to paper contentMethod: Cannot determine method without access to paper content
Result: Cannot determine results without access to paper content
Conclusion: Cannot determine conclusion without access to paper content
Abstract: Failed to fetch summary for 2602.21233: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.21233&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[762] PhysMem: Self-Evolving Physical Memory for Robot Manipulation
Haoyang Li, Yang You, Hao Su, Leonidas Guibas
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to draw conclusions due to fetch failure
Abstract: Failed to fetch summary for 2602.20323: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.20323&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[763] LLMServingSim 2.0: A Unified Simulator for Heterogeneous and Disaggregated LLM Serving Infrastructure
Jaehong Cho, Hyunmin Choi, Guseul Heo, Jongse Park
Main category: cs.AI
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation due to failed paper retrievalMethod: Cannot determine method due to failed paper retrieval
Result: Cannot determine results due to failed paper retrieval
Conclusion: Cannot determine conclusion due to failed paper retrieval
Abstract: Failed to fetch summary for 2602.23036: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.23036&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[764] Failure Detection in Chemical Processes Using Symbolic Machine Learning: A Case Study on Ethylene Oxidation
Julien Amblard, Niklas Groll, Matthew Tait, Mark Law, Gürkan Sin, Alessandra Russo
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to failed paper fetchMethod: Unable to determine method due to failed paper fetch
Result: Unable to determine results due to failed paper fetch
Conclusion: Unable to determine conclusion due to failed paper fetch
Abstract: Failed to fetch summary for 2603.06767: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.06767&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[765] From Thinker to Society: Security in Hierarchical Autonomy Evolution of AI Agents
Xiaolei Zhang, Lu Zhou, Xiaogang Xu, Jiafei Wu, Tianyu Du, Heqing Huang, Hao Peng, Zhe Liu
Main category: cs.AI
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation without access to paper contentMethod: Cannot determine method without access to paper content
Result: Cannot determine results without access to paper content
Conclusion: Cannot determine conclusion without access to paper content
Abstract: Failed to fetch summary for 2603.07496: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.07496&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[766] A Stable Neural Statistical Dependence Estimator for Autoencoder Feature Analysis
Bo Hu, Jose C Principe
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to paper fetch failureMethod: Unable to determine method due to paper fetch failure
Result: Unable to determine results due to paper fetch failure
Conclusion: Unable to determine conclusion due to paper fetch failure
Abstract: Failed to fetch summary for 2603.11428: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.11428&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[767] Graph In-Context Operator Networks for Generalizable Spatiotemporal Prediction
Chenghan Wu, Zongmin Yu, Boai Sun, Liu Yang
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2603.12725: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.12725&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[768] PREBA: Surgical Duration Prediction via PCA-Weighted Retrieval-Augmented LLMs and Bayesian Averaging Aggregation
Wanyin Wu, Kanxue Li, Baosheng Yu, Haoyun Zhao, Yibing Zhan, Dapeng Tao, Hua Jin
Main category: cs.AI
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation without access to paper contentMethod: Cannot determine method without access to paper content
Result: Cannot determine results without access to paper content
Conclusion: Cannot draw conclusions without access to paper content
Abstract: Failed to fetch summary for 2603.13275: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.13275&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[769] Not All Latent Spaces Are Flat: Hyperbolic Concept Control
Maria Rosaria Briglia, Simone Facchiano, Paolo Cursi, Alessio Sampieri, Emanuele Rodolà, Guido Maria D’Amely di Melendugno, Luca Franco, Fabio Galasso, Iacopo Masi
Main category: cs.AI
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot determine conclusion as paper content is unavailable
Abstract: Failed to fetch summary for 2603.14093: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.14093&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[770] Implicit Maximum Likelihood Estimation for Real-time Generative Model Predictive Control
Grayson Lee, Minh Bui, Shuzi Zhou, Yankai Li, Mo Chen, Ke Li
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to access restrictions preventing paper retrievalMethod: Unable to determine method due to access restrictions preventing paper retrieval
Result: Unable to determine results due to access restrictions preventing paper retrieval
Conclusion: Unable to determine conclusion due to access restrictions preventing paper retrieval
Abstract: Failed to fetch summary for 2603.13733: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.13733&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[771] Compute Allocation for Reasoning-Intensive Retrieval Agents
Sreeja Apparaju, Nilesh Gupta
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to failed paper fetchMethod: Unable to determine method due to failed paper fetch
Result: Unable to determine results due to failed paper fetch
Conclusion: Unable to determine conclusion due to failed paper fetch
Abstract: Failed to fetch summary for 2603.14635: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.14635&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[772] Large Reward Models: Generalizable Online Robot Reward Generation with Vision-Language Models
Yanru Wu, Weiduo Yuan, Ang Qi, Vitor Guizilini, Jiageng Mao, Yue Wang
Main category: cs.AI
TL;DR: Paper 2603.16065 summary unavailable due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to abstract fetching failureMethod: Unable to determine method due to abstract fetching failure
Result: Unable to determine results due to abstract fetching failure
Conclusion: Unable to determine conclusion due to abstract fetching failure
Abstract: Failed to fetch summary for 2603.16065: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.16065&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[773] Deep learning based intelligent IDS for Large-scale IoT networks
Isha Andrade, Shalaka S Mahadik, Mithun Mukherjee, Pranav M Pawar, Raja Muthalagu
Main category: cs.AI
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation without access to paper contentMethod: Cannot determine method without access to paper content
Result: Cannot determine results without access to paper content
Conclusion: Cannot draw conclusions without access to paper content
Abstract: Failed to fetch summary for 2603.16342: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.16342&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[774] Adversarial attacks against Modern Vision-Language Models
Alejandro Paredes La Torre
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to failed paper fetchMethod: Unable to determine method due to failed paper fetch
Result: Unable to determine results due to failed paper fetch
Conclusion: Unable to analyze paper due to technical error in fetching content
Abstract: Failed to fetch summary for 2603.16960: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.16960&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[775] Spectral Alignment in Forward-Backward Representations via Temporal Abstraction
Seyed Mahdi B. Azad, Jasper Hoffmann, Iman Nematollahi, Hao Zhu, Abhinav Valada, Joschka Boedecker
Main category: cs.AI
TL;DR: Unable to analyze paper 2603.20103 due to HTTP 429 error when fetching abstract from arXiv API
Details
Motivation: Cannot determine motivation as abstract is unavailable due to rate limiting errorMethod: Cannot determine method as abstract is unavailable due to rate limiting error
Result: Cannot determine results as abstract is unavailable due to rate limiting error
Conclusion: Cannot determine conclusion as abstract is unavailable due to rate limiting error
Abstract: Failed to fetch summary for 2603.20103: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.20103&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[776] The Spillover Effects of Peer AI Rinsing on Corporate Green Innovation
Li Wenxiu, Wen Zhanjie, Xia Jiechang, Guo Jingqiao
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to failed paper fetchMethod: Unable to determine method due to failed paper fetch
Result: Unable to determine results due to failed paper fetch
Conclusion: Unable to determine conclusion due to failed paper fetch
Abstract: Failed to fetch summary for 2603.18415: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.18415&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[777] The Impact of Corporate AI Washing on Farmers’ Digital Financial Behavior Response – An Analysis from the Perspective of Digital Financial Exclusion
Zhanjie Wen, Wenxiu Li, Jiechang Xia, Jingqiao Guo
Main category: cs.AI
TL;DR: Paper ID 2603.18421: Unable to fetch abstract due to HTTP 429 error (rate limiting). No content available for analysis.
Details
Motivation: Cannot determine motivation as the paper abstract could not be retrieved due to API rate limiting.Method: Cannot determine method as the paper abstract could not be retrieved.
Result: Cannot determine results as the paper abstract could not be retrieved.
Conclusion: Cannot draw conclusions about the paper content due to unavailability of abstract.
Abstract: Failed to fetch summary for 2603.18421: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.18421&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[778] Agent Control Protocol: Admission Control for Agent Actions
Marcelo Fernandez
Main category: cs.AI
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation without access to paper contentMethod: Cannot determine method without access to paper content
Result: Cannot determine results without access to paper content
Conclusion: Cannot draw conclusions without access to paper content
Abstract: Failed to fetch summary for 2603.18829: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.18829&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[779] Revealing Domain-Spatiality Patterns for Configuration Tuning: Domain Knowledge Meets Fitness Landscapes
Yulong Ye, Hongyuan Liang, Chao Jiang, Miqing Li, Tao Chen
Main category: cs.AI
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation as paper content could not be retrievedMethod: Cannot determine method as paper content could not be retrieved
Result: Cannot determine results as paper content could not be retrieved
Conclusion: Cannot draw conclusions as paper content could not be retrieved
Abstract: Failed to fetch summary for 2603.19897: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.19897&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
cs.SD
[780] LL-SDR: Low-Latency Speech enhancement through Discrete Representations
Jingyi Li, Luca Della Libera, Mirco Ravanelli, Cem Subakan
Main category: cs.SD
TL;DR: LL-SDR: A token-based speech enhancement framework using discretization to better separate speech and noise through a Variance-Ordered Residual Vector Quantizer and latent-space discriminator.
Details
Motivation: While discrete audio tokens have been explored for autoregressive speech enhancement, it's unclear whether discretization itself consistently improves performance. The paper aims to leverage discretization explicitly to better separate speech and noise.Method: Proposes LL-SDR with two key components: 1) Variance-Ordered Residual Vector Quantizer (VO-RVQ) designed to disentangle speech and noise distributions during tokenization, and 2) a latent-space discriminator to better align enhanced embeddings with semantic embeddings.
Result: LL-SDR outperforms continuous baselines and matches the performance of autoregressive token-based approaches, while enabling lightweight, low-latency speech enhancement in both reverberant and non-reverberant noisy environments.
Conclusion: Discretization can be effectively leveraged for speech enhancement through proper architectural design, enabling competitive performance with efficient inference suitable for real-time applications.
Abstract: Many speech enhancement (SE) methods rely on continuous representations. Recently, discrete audio tokens have been explored to enable autoregressive generation for SE. However, it remains unclear whether discretization itself consistently improves SE performance. In this paper, we introduce LL-SDR, a token-based speech enhancement framework that explicitly leverages discretization to better separate speech and noise. Our first contribution is a Variance-Ordered Residual Vector Quantizer (VO-RVQ), designed to disentangle speech and noise distributions during tokenization. Second, we propose a latent-space discriminator to better align enhanced embeddings with semantic embeddings. Experiments show that LL-SDR outperforms continuous baselines and matches the performance of autoregressive token-based approaches, while enabling lightweight, low-latency speech enhancement in both reverberant and non-reverberant noisy environments. Demos and source code are available at our project websites.
[781] Voice Privacy from an Attribute-based Perspective
Mehtab Ur Rahman, Martha Larson, Cristian Tejedor García
Main category: cs.SD
TL;DR: This paper introduces an attribute-based perspective for evaluating voice privacy, moving beyond signal-to-signal comparisons to analyze privacy protection through speaker attribute comparisons and uniqueness analysis.
Details
Motivation: Current voice privacy benchmarks focus on signal-to-signal comparisons, but this paper argues for an attribute-based perspective to better measure privacy protection by analyzing speaker uniqueness through attribute comparisons.Method: The authors analyze privacy impact by calculating speaker uniqueness for ground truth attributes, attributes inferred from original speech, and attributes inferred from anonymized speech. They also examine a threat scenario with single utterances per speaker and calculate attack error rates.
Result: The research shows that inferred attributes still present privacy risks despite attribute inference errors, highlighting that current anonymization methods may not fully protect against attribute-based attacks.
Conclusion: Future voice privacy research should consider both attribute-related threats and protection mechanisms, as attribute-based analysis reveals important privacy vulnerabilities not captured by traditional signal comparison methods.
Abstract: Voice privacy approaches that preserve the anonymity of speakers modify speech in an attempt to break the link with the true identity of the speaker. Current benchmarks measure speaker protection based on signal-to-signal comparisons. In this paper, we introduce an attribute-based perspective, where we measure privacy protection in terms of comparisons between sets of speaker attributes. First, we analyze privacy impact by calculating speaker uniqueness for ground truth attributes, attributes inferred on the original speech, and attributes inferred on speech protected with standard anonymization. Next, we examine a threat scenario involving only a single utterance per speaker and calculate attack error rates. Overall, we observe that inferred attributes still present a risk despite attribute inference errors. Our research points to the importance of considering both attribute-related threats and protection mechanisms in future voice privacy research.
[782] ALICE: A Multifaceted Evaluation Framework of Large Audio-Language Models’ In-Context Learning Ability
Yen-Ting Piao, Jay Chiehen Liao, Wei-Tang Chien, Toshiki Ogimoto, Shang-Tse Chen, Yun-Nung Chen, Chun-Yi Lee, Shao-Yuan Lo
Main category: cs.SD
TL;DR: LALMs show degraded instruction-following; ALICE framework evaluates in-context learning with audio conditioning, revealing asymmetry: demonstrations improve format compliance but degrade core task performance.
Details
Motivation: Large Audio-Language Models (LALMs) have shown degraded instruction-following capabilities, but their ability to infer task patterns from in-context examples under audio conditioning remains unstudied. The authors aim to systematically evaluate LALMs' in-context learning ability when conditioned on audio inputs.Method: The authors present ALICE, a three-stage framework that progressively reduces textual guidance to systematically evaluate LALMs’ in-context learning under audio conditioning. They evaluate six LALMs across four audio understanding tasks under two output constraint categories.
Result: The study uncovers a consistent asymmetry across all stages and LALMs: in-context demonstrations reliably improve format compliance but fail to improve, and often degrade, the core task performance. This suggests LALMs can learn surface-level formatting patterns but struggle to leverage cross-modal semantic grounding.
Conclusion: LALMs can glean surface-level formatting patterns from demonstrations but may struggle to leverage cross-modal semantic grounding to reliably infer task objectives from audio-conditioned examples, highlighting potential limitations in current cross-modal integration in multimodal models.
Abstract: While Large Audio-Language Models (LALMs) have been shown to exhibit degraded instruction-following capabilities, their ability to infer task patterns from in-context examples under audio conditioning remains unstudied. To address this gap, we present ALICE, a three-stage framework that progressively reduces textual guidance to systematically evaluate LALMs’ in-context learning ability under audio conditioning. Evaluating six LALMs across four audio understanding tasks under two output constraint categories, we uncover a consistent asymmetry across all stages and LALMs: in-context demonstrations reliably improve format compliance but fail to improve, and often degrade, the core task performance. This suggests that LALMs can glean surface-level formatting patterns from demonstrations but may struggle to leverage cross-modal semantic grounding to reliably infer task objectives from audio-conditioned examples, highlighting potential limitations in current cross-modal integration.
[783] SNAP: Speaker Nulling for Artifact Projection in Speech Deepfake Detection
Kyudan Jung, Jihwan Kim, Minwoo Lee, Soyoon Kim, Jeonghoon Kim, Jaegul Choo, Cheonbok Park
Main category: cs.SD
TL;DR: SNAP framework reduces speaker entanglement in speech deepfake detection by projecting out speaker-specific features to isolate synthesis artifacts, improving generalization across unseen speakers.
Details
Motivation: Current speech deepfake detectors using self-supervised speech encoders struggle to generalize across unseen speakers because they exploit speaker-specific correlations rather than artifact-related cues, a phenomenon called speaker entanglement.Method: Introduces SNAP (speaker-nulling framework) that estimates a speaker subspace and applies orthogonal projection to suppress speaker-dependent components, isolating synthesis artifacts in residual features for improved detection.
Result: SNAP achieves state-of-the-art performance by reducing speaker entanglement and encouraging detectors to focus on artifact-related patterns rather than speaker-specific features.
Conclusion: Speaker entanglement is a key limitation in current speech deepfake detection, and the proposed speaker-nulling framework effectively mitigates this issue, leading to better generalization and detection performance.
Abstract: Recent advancements in text-to-speech technologies enable generating high-fidelity synthetic speech nearly indistinguishable from real human voices. While recent studies show the efficacy of self-supervised learning-based speech encoders for deepfake detection, these models struggle to generalize across unseen speakers. Our quantitative analysis suggests these encoder representations are substantially influenced by speaker information, causing detectors to exploit speaker-specific correlations rather than artifact-related cues. We call this phenomenon speaker entanglement. To mitigate this reliance, we introduce SNAP, a speaker-nulling framework. We estimate a speaker subspace and apply orthogonal projection to suppress speaker-dependent components, isolating synthesis artifacts within the residual features. By reducing speaker entanglement, SNAP encourages detectors to focus on artifact-related patterns, leading to state-of-the-art performance.
[784] ERM-MinMaxGAP: Benchmarking and Mitigating Gender Bias in Multilingual Multimodal Speech-LLM Emotion Recognition
Zi Haur Pang, Xiaoxue Gao, Tatsuya Kawahara, Nancy F. Chen
Main category: cs.SD
TL;DR: A study on gender bias in multilingual speech emotion recognition using large language models, introducing a new benchmark and fairness-aware training method to reduce bias while improving performance.
Details
Motivation: Speech emotion recognition systems can exhibit gender-related performance disparities, but it's unclear how such bias manifests in multilingual speech LLMs across different languages and modalities. There's a need to quantify language-specific SER performance and gender gaps, and develop methods to address these biases.Method: Introduced a novel multilingual, multimodal benchmark built on MELD-ST spanning English, Japanese, and German. Proposed ERM-MinMaxGAP, a fairness-informed training objective that augments empirical risk minimization with an adaptive fairness weight mechanism and a novel MinMaxGAP regularizer on the maximum male-female loss gap within each language and modality. Built upon the Qwen2-Audio backbone.
Result: Found that bias is strongly language-dependent, and multimodal fusion does not reliably improve fairness. The ERM-MinMaxGAP approach improved multilingual SER performance by 5.5% and 5.0% while reducing the overall gender bias gap by 0.1% and 1.4% in unimodal and multimodal settings respectively.
Conclusion: Gender bias in speech emotion recognition varies significantly across languages, and multimodal approaches don’t inherently solve fairness issues. The proposed ERM-MinMaxGAP method effectively reduces bias while improving overall performance in multilingual settings.
Abstract: Speech emotion recognition (SER) systems can exhibit gender-related performance disparities, but how such bias manifests in multilingual speech LLMs across languages and modalities is unclear. We introduce a novel multilingual, multimodal benchmark built on MELD-ST, spanning English, Japanese, and German, to quantify language-specific SER performance and gender gaps. We find bias is strongly language-dependent, and multimodal fusion does not reliably improve fairness. To address these, we propose ERM-MinMaxGAP, a fairness-informed training objective, which augments empirical risk minimization (ERM) with a proposed adaptive fairness weight mechanism and a novel MinMaxGAP regularizer on the maximum male-female loss gap within each language and modality. Building upon the Qwen2-Audio backbone, our ERM-MinMaxGAP approach improves multilingual SER performance by 5.5% and 5.0% while reducing the overall gender bias gap by 0.1% and 1.4% in the unimodal and multimodal settings, respectively.
[785] Emotion-Aware Quantization for Discrete Speech Representations: An Analysis of Emotion Preservation
Haoguang Zhou, Siyi Wang, Jingyao Wu, James Bailey, Ting Dang
Main category: cs.SD
TL;DR: The paper studies how residual vector quantization affects emotional information in discrete speech representations and proposes emotion-aware quantization methods to better preserve emotional content.
Details
Motivation: Modern speech systems use discretized self-supervised speech representations, but their impact on emotional information preservation is unclear. The authors want to understand how quantization affects emotional content and develop methods to better preserve it.Method: 1) Analyze how residual vector quantization (RVQ) reshapes emotional information from representation- and task-level perspectives. 2) Introduce emotion-aware quantization using emotion-specific and emotion-biased codebooks. 3) Propose Emo-Q, a lightweight routed quantization method that selects emotion-specialized codebooks.
Result: Aggressive compression disproportionately degrades emotion, with uneven loss across emotion classes and model architectures. Emotion-aware quantization improves preservation of both hard and soft emotion perception. Emo-Q improves emotion recognition performance at lower bitrates.
Conclusion: Emotion-aware discretization is important for robust affective speech processing. The proposed methods help preserve emotional information in quantized speech representations.
Abstract: Modern speech systems increasingly use discretized self-supervised speech representations for compression and integration with token-based models, yet their impact on emotional information remains unclear. We study how residual vector quantization (RVQ) reshapes emotional information in discrete speech representations from both representation- and task-level perspectives. Our analysis shows that aggressive compression disproportionately degrades emotion, with uneven loss across emotion classes and model architectures. To address this, we introduce emotion-aware quantization using emotion-specific and emotion-biased codebooks, improving the preservation of both hard and soft emotion perception. We further propose Emo-Q, a lightweight routed quantization method that selects emotion-specialized codebooks, improving emotion recognition performance at lower bitrates. These results highlight the importance of emotion-aware discretization for robust affective speech processing.
[786] HELIX: Scaling Raw Audio Understanding with Hybrid Mamba-Attention Beyond the Quadratic Limit
Khushiyant, Param Thakkar
Main category: cs.SD
TL;DR: HELIX framework compares audio architectures (Mamba, attention, hybrid) showing design choices are coupled - preferred input representation depends on backbone, and attention becomes important for long sequences.
Details
Motivation: Audio representation learning typically evaluates design choices like input frontend, sequence backbone, and sequence length in isolation, but these axes are actually coupled and conclusions from one setting often don't transfer to others.Method: Introduces HELIX framework comparing pure Mamba, pure attention, and minimal hybrid with single attention bottleneck. All models parameter-matched at ~8.3M parameters to isolate architectural effects across six datasets.
Result: Preferred input representation depends on backbone; attention hurts performance on short, stationary audio but becomes important at longer sequence lengths. On 5-minute speaker ID task with 30,000 tokens, pure attention fails with OOM errors while HELIX closes 11.5-point gap over pure Mamba.
Conclusion: Architectural choices in audio representation learning are interdependent, and hybrid approaches like HELIX can combine strengths of different backbones for optimal performance across varying sequence lengths.
Abstract: Audio representation learning typically evaluates design choices such as input frontend, sequence backbone, and sequence length in isolation. We show that these axes are coupled, and conclusions from one setting often do not transfer to others. We introduce HELIX, a controlled framework comparing pure Mamba, pure attention, and a minimal hybrid with a single attention bottleneck. All models are parameter-matched at about 8.3M parameters to isolate architectural effects. Across six datasets, we find that the preferred input representation depends on the backbone, and that attention hurts performance on short, stationary audio but becomes important at longer sequence lengths. On a 5-minute speaker identification task with 30,000 tokens, pure attention fails with out-of-memory errors, while HELIX closes an 11.5-point gap over pure Mamba.
[787] Enterprise Sales Copilot: Enabling Real-Time AI Support with Automatic Information Retrieval in Live Sales Calls
Jielin Qiu, Liangwei Yang, Ming Zhu, Wenting Zhao, Zhiwei Liu, Juntao Tan, Zixiang Chen, Roshan Ram, Akshara Prabhakar, Rithesh Murthy, Shelby Heinecke, Caiming Xiong, Silvio Savarese, Huan Wang
Main category: cs.SD
TL;DR: SalesCopilot is a real-time AI assistant that automatically detects customer questions during sales calls, retrieves relevant product information from databases, and displays concise answers to sales representatives within seconds, eliminating manual search delays.
Details
Motivation: During live sales calls, customers ask detailed product questions that require representatives to manually search internal databases and CRM systems, taking 25-65 seconds per query. This creates awkward pauses that hurt customer experience and reduce sales efficiency.Method: The system integrates streaming speech-to-text transcription, LLM-based question detection, and retrieval-augmented generation (RAG) over a structured product database into a unified real-time pipeline. It’s domain-agnostic and can be adapted to any enterprise sales domain by replacing the product database.
Result: In benchmark evaluation on an insurance sales scenario with 50 products spanning 10 categories, SalesCopilot achieves a mean response time of 2.8 seconds with 100% question detection rate, representing a 14x speedup compared to manual CRM search.
Conclusion: SalesCopilot effectively eliminates the bottleneck of manual information retrieval during sales calls, significantly improving response times and customer experience while maintaining domain adaptability.
Abstract: During live sales calls, customers frequently ask detailed product questions that require representatives to manually search internal databases and CRM systems. This process typically takes 25-65 seconds per query, creating awkward pauses that hurt customer experience and reduce sales efficiency. We present SalesCopilot, a real-time AI-powered assistant that eliminates this bottleneck by automatically detecting customer questions, retrieving relevant information from the product database, and displaying concise answers on the representative’s dashboard in seconds. The system integrates streaming speech-to-text transcription, large language model (LLM)-based question detection, and retrieval-augmented generation (RAG) over a structured product database into a unified real-time pipeline. We demonstrate SalesCopilot on an insurance sales scenario with 50 products spanning 10 categories (2,490 FAQs, 290 coverage details, and 162 pricing tiers). In our benchmark evaluation, SalesCopilot achieves a measured mean response time of 2.8 seconds with 100% question detection rate, representing a 14xspeedup compared to manual CRM search in an internal study. The system is domain-agnostic and can be adapted to any enterprise sales domain by replacing the product database.
[788] LipsAM: Lipschitz-Continuous Amplitude Modifier for Audio Signal Processing and its Application to Plug-and-Play Dereverberation
Kazuki Matsumoto, Ren Uchida, Kohei Yatabe
Main category: cs.SD
TL;DR: Proposes Lipschitz-continuous amplitude modifier (LipsAM) architectures for audio processing, with applications to speech dereverberation.
Details
Motivation: Deep neural networks for audio processing lack focus on Lipschitz continuity certification, which is crucial for robustness. Existing Lipschitz continuity results have poor compatibility with audio DNNs, particularly amplitude modifier architectures commonly used for audio signals.Method: Develop Lipschitz-continuous variants of amplitude modifier (AM) architectures called LipsAM. Prove sufficient condition for AM to be Lipschitz continuous and propose two specific LipsAM architectures as examples. Apply these to Plug-and-Play algorithm for speech dereverberation.
Result: Demonstrated improved stability of LipsAM architectures through numerical experiments in speech dereverberation applications.
Conclusion: LipsAM provides certified robust audio processing networks with Lipschitz continuity guarantees, addressing a gap in audio DNN robustness certification.
Abstract: The robustness of deep neural networks (DNNs) can be certified through their Lipschitz continuity, which has made the construction of Lipschitz-continuous DNNs an active research field. However, DNNs for audio processing have not been a major focus due to their poor compatibility with existing results. In this paper, we consider the amplitude modifier (AM), a popular architecture for handling audio signals, and propose its Lipschitz-continuous variants, which we refer to as LipsAM. We prove a sufficient condition for an AM to be Lipschitz continuous and propose two architectures as examples of LipsAM. The proposed architectures were applied to a Plug-and-Play algorithm for speech dereverberation, and their improved stability is demonstrated through numerical experiments.
[789] AnimalCLAP: Taxonomy-Aware Language-Audio Pretraining for Species Recognition and Trait Inference
Risa Shinoda, Kaede Shiohara, Nakamasa Inoue, Hiroaki Santo, Fumio Okura
Main category: cs.SD
TL;DR: AnimalCLAP is a taxonomy-aware language-audio framework that uses hierarchical biological information to improve classification of unseen animal species from their vocalizations.
Details
Motivation: Animal vocalizations are important for wildlife assessment and ecological monitoring, but deep learning approaches struggle with classifying species not seen during training. Current methods lack incorporation of biological taxonomy information.Method: Created a large vocalization dataset (4,225 hours, 6,823 species) annotated with 22 ecological traits. Developed AnimalCLAP model that aligns audio and textual representations using taxonomic structures to improve recognition of unseen species.
Result: The model effectively infers ecological and biological attributes from vocalizations and achieves superior performance compared to CLAP. It demonstrates improved capability for recognizing unseen species.
Conclusion: Incorporating hierarchical biological taxonomy into language-audio models significantly improves wildlife vocalization classification, especially for unseen species, advancing ecological monitoring capabilities.
Abstract: Animal vocalizations provide crucial insights for wildlife assessment, particularly in complex environments such as forests, aiding species identification and ecological monitoring. Recent advances in deep learning have enabled automatic species classification from their vocalizations. However, classifying species unseen during training remains challenging. To address this limitation, we introduce AnimalCLAP, a taxonomy-aware language-audio framework comprising a new dataset and model that incorporate hierarchical biological information. Specifically, our vocalization dataset consists of 4,225 hours of recordings covering 6,823 species, annotated with 22 ecological traits. The AnimalCLAP model is trained on this dataset to align audio and textual representations using taxonomic structures, improving the recognition of unseen species. We demonstrate that our proposed model effectively infers ecological and biological attributes of species directly from their vocalizations, achieving superior performance compared to CLAP. Our dataset, code, and models will be publicly available at https://dahlian00.github.io/AnimalCLAP_Page/.
[790] A Multimodal Data Fusion Generative Adversarial Network for Real Time Underwater Sound Speed Field Construction
Wei Huang, Yuqiang Huang, Yanan Wu, Tianhe Xu, Tingting Lyu, Hao Zhang
Main category: cs.SD
TL;DR: A multi-modal GAN model with residual attention blocks for constructing underwater sound speed profiles from surface data without on-site underwater measurements.
Details
Motivation: Traditional SSP estimation methods require on-site underwater sonar data, which is costly and deployment-intensive. The paper aims to achieve high-precision SSP estimation without underwater measurements by leveraging multi-modal surface data.Method: Proposes MDF-RAGAN (Multi-modal Data-Fusion Generative Adversarial Network with Residual Attention Block) that uses attention mechanisms to capture global spatial feature correlations and residual modules to detect small disturbances in sound velocity distribution caused by sea surface temperature changes.
Result: Achieves accuracy with error less than 0.3m/s, outperforming CNN and spatial interpolation by nearly a factor of two, and achieves 65.8% RMSE reduction compared to mean profile methods.
Conclusion: The proposed multi-modal fusion approach with cross-modal attention effectively enhances overall profile matching for underwater sound speed estimation without requiring underwater measurements.
Abstract: Sound speed profiles (SSPs) are essential parameters underwater that affects the propagation mode of underwater signals and has a critical impact on the energy efficiency of underwater acoustic communication and accuracy of underwater acoustic positioning. Traditionally, SSPs can be obtained by matching field processing (MFP), compressive sensing (CS), and deep learning (DL) methods. However, existing methods mainly rely on on-site underwater sonar observation data, which put forward strict requirements on the deployment of sonar observation systems. To achieve high-precision estimation of sound velocity distribution in a given sea area without on-site underwater data measurement, we propose a multi-modal data-fusion generative adversarial network model with residual attention block (MDF-RAGAN) for SSP construction. To improve the model’s ability for capturing global spatial feature correlations, we embedded the attention mechanisms, and use residual modules for deeply capturing small disturbances in the deep ocean sound velocity distribution caused by changes of SST. Experimental results on real open dataset show that the proposed model outperforms other state-of-the-art methods, which achieves an accuracy with an error of less than 0.3m/s. Specifically, MDF-RAGAN not only outperforms convolutional neural network (CNN) and spatial interpolation (SITP) by nearly a factor of two, but also achieves about 65.8% root mean square error (RMSE) reduction compared to mean profile, which fully reflects the enhancement of overall profile matching by multi-source fusion and cross-modal attention.
[791] Multi-Task Instruction Tuning via Data Scheduling for Low-Resource Arabic AudioLLMs
Hunzalah Hassan Bhatti, Firoj Alam, Shammur Absar Chowdhury
Main category: cs.SD
TL;DR: Arabic-centric audio LLM adaptation study comparing training strategies for multi-task instruction tuning in resource-constrained, dialect-rich Arabic-English settings, introducing AraMega-SSum dataset for Arabic speech summarization.
Details
Motivation: Audio LLMs need adaptation to linguistically complex and dialect-rich environments like Arabic-English, but current approaches face challenges in resource-constrained settings requiring efficient multi-task training strategies.Method: Controlled study comparing four training strategies: Uniform Task Mixing, Task-Progressive Curriculum (TPC), Aligner-Based Diverse Sampling (ADS), and two-stage TPC->ADS, evaluated on generative tasks (ASR, speech/text summarization) and discriminative tasks (dialect/emotion recognition).
Result: ADS speeds early convergence and improves paralinguistic performance but hurts other tasks; two-stage TPC->ADS provides most reliable overall balance across tasks, revealing efficiency-robustness trade-off.
Conclusion: Two-stage TPC->ADS strategy offers practical guidance for adapting audio LLMs to low-resource, dialect-rich environments; AraMega-SSum dataset supports Arabic speech summarization research.
Abstract: Audio large language models (LLMs) enable unified speech understanding and generation, but adapting them to linguistically complex and dialect-rich settings such as Arabic-English remains challenging. We present a controlled study of multi-task instruction tuning for an Arabic-centric audio LLM across generative tasks including ASR and speech and text summarization, and discriminative tasks including dialect and emotion recognition, in a resource-constrained setting. To support end-to-end Arabic speech summarization, we introduce AraMega-SSum, a first speech summarization resource for training and benchmarking Arabic-centric Audio-LLMs. We compare four training strategies (i) Uniform Task Mixing, (ii) Task-Progressive Curriculum (TPC), (iiii) Aligner-Based Diverse Sampling (ADS) for training-time batch construction, and (iv) A two-stage TPC->ADS strategy. Our results show a clear efficiency-robustness trade-off. ADS speeds up early convergence and improves paralinguistic performance, however, it hurts other tasks. A two-stage TPC-> ADS strategy gives the most reliable overall balance across tasks, offering practical guidance for adapting omni audio LLMs to low-resource, dialect-rich environments. We will make AraMega-SSum and all experimental resources publicly available to the community.
[792] CALM: Class-Conditional Sparse Attention Vectors for Large Audio-Language Models
Videet Mehta, Liming Wang, Hilde Kuehne, Rogerio Feris, James R. Glass, M. Jehanzeb Mirza
Main category: cs.SD
TL;DR: Proposes class-conditional sparse attention vectors for large audio-language models to improve few-shot classification by learning class-dependent importance weights over attention heads, outperforming uniform voting methods.
Details
Motivation: While large audio-language models have strong zero-shot capabilities, they lag behind specialized models for discriminative tasks like audio classification. Existing sparse attention methods use uniform weighting across attention heads, assuming equal contribution across all semantic categories, which may not be optimal.Method: Proposes a few-shot classification method that learns class-dependent importance weights over attention heads in large audio-language models. This allows individual heads to specialize in distinct semantic categories and contribute proportionally to their estimated reliability, rather than using uniform weights.
Result: Outperforms state-of-the-art uniform voting-based approaches by up to 14.52% for audio classification, 1.53% for audio-visual classification, and 8.35% for spoofing detection on multiple few-shot benchmarks.
Conclusion: Class-conditional weighting of attention heads in large audio-language models significantly improves few-shot classification performance by allowing heads to specialize in different semantic categories and contribute based on their reliability.
Abstract: Large audio-language models (LALMs) exhibit strong zero-shot capabilities in multiple downstream tasks, such as audio question answering (AQA) and abstract reasoning; however, these models still lag behind specialized models for certain discriminative tasks (e.g., audio classification). Recent studies show that sparse subsets of attention heads within an LALM can serve as strong discriminative feature extractors for downstream tasks such as classification via simple voting schemes. However, these methods assign uniform weights to all selected heads, implicitly assuming that each head contributes equally across all semantic categories. In this work, we propose Class-Conditional Sparse Attention Vectors for Large Audio-Language Models, a few-shot classification method that learns class-dependent importance weights over attention heads. This formulation allows individual heads to specialize in distinct semantic categories and to contribute to ensemble predictions proportionally to their estimated reliability. Experiments on multiple few-shot audio and audiovisual classification benchmarks and tasks demonstrate that our method consistently outperforms state-of-the-art uniform voting-based approaches by up to 14.52%, 1.53%, 8.35% absolute gains for audio classification, audio-visual classification, and spoofing detection respectively.
[793] VorTEX: Various overlap ratio for Target speech EXtraction
Ro-hoon Oh, Jihwan Seol, Bugeun Kim
Main category: cs.SD
TL;DR: VorTEX introduces a text-prompted target speech extraction architecture with decoupled adaptive multi-branch fusion to handle various overlap ratios, evaluated on a new dataset PORTE with a diagnostic metric SuRE to detect suppression behavior.
Details
Motivation: Existing text-prompted target speech extraction approaches assume fully overlapped mixtures, limiting insight into behavior across realistic overlap ratios. There's a need for architectures that work robustly across various overlap scenarios and metrics to detect suppression behavior not captured by conventional measures.Method: Proposes VorTEX architecture with Decoupled Adaptive Multi-branch (DAM) Fusion block that separates primary extraction from auxiliary regularization pathways. Constructs PORTE dataset spanning overlap ratios 0-100% for controlled analysis. Introduces Suppression Ratio on Energy (SuRE) metric to detect suppression behavior.
Result: VorTEX achieves highest separation fidelity across 20-100% overlap (5.50 dB at 20% and 2.04 dB at 100%) while maintaining zero SuRE, indicating robust extraction without suppression-driven artifacts. Existing models exhibit suppression or residual interference under overlap.
Conclusion: VorTEX demonstrates robust target speech extraction across various overlap ratios without suppression artifacts, enabled by the DAM fusion architecture and validated by the new SuRE diagnostic metric on the PORTE dataset.
Abstract: Target speech extraction (TSE) aims to recover a target speaker’s voice from a mixture. While recent text-prompted approaches have shown promise, most approaches assume fully overlapped mixtures, limiting insight into behavior across realistic overlap ratios. We introduce VorTEX (Various overlap ratio for Target speech EXtraction), a text-prompted TSE architecture with a Decoupled Adaptive Multi-branch (DAM) Fusion block that separates primary extraction from auxiliary regularization pathways. To enable controlled analysis, we construct PORTE, a two-speaker dataset spanning overlap ratios from 0% to 100%. We further propose Suppression Ratio on Energy (SuRE), a diagnostic metric that detects suppression behavior not captured by conventional measures. Experiments show that existing models exhibit suppression or residual interference under overlap, whereas VorTEX achieves the highest separation fidelity across 20-100% overlap (e.g., 5.50 dB at 20% and 2.04 dB at 100%) while maintaining zero SuRE, indicating robust extraction without suppression-driven artifacts.
cs.LG
[794] JointFM-0.1: A Foundation Model for Multi-Target Joint Distributional Prediction
Stefan Hackmann
Main category: cs.LG
TL;DR: JointFM is a foundation model that predicts future joint probability distributions of coupled time series by training on synthetic SDEs, eliminating the need for task-specific calibration.
Details
Motivation: Traditional SDE modeling faces challenges: high modeling risk, brittle calibration, and computational expense. The authors propose inverting the paradigm by training on synthetic SDE data rather than fitting SDEs to real data.Method: Instead of fitting SDEs to data, the model samples an infinite stream of synthetic SDEs to train a generic foundation model that can predict future joint probability distributions directly without task-specific calibration or finetuning.
Result: In zero-shot settings, JointFM reduces energy loss by 14.2% relative to the strongest baseline when recovering oracle joint distributions generated by unseen synthetic SDEs.
Conclusion: JointFM establishes the first foundation model for distributional predictions of coupled time series, demonstrating superior performance without requiring calibration or finetuning.
Abstract: Despite the rapid advancements in Artificial Intelligence (AI), Stochastic Differential Equations (SDEs) remain the gold-standard formalism for modeling systems under uncertainty. However, applying SDEs in practice is fraught with challenges: modeling risk is high, calibration is often brittle, and high-fidelity simulations are computationally expensive. This technical report introduces JointFM, a foundation model that inverts this paradigm. Instead of fitting SDEs to data, we sample an infinite stream of synthetic SDEs to train a generic model to predict future joint probability distributions directly. This approach establishes JointFM as the first foundation model for distributional predictions of coupled time series - requiring no task-specific calibration or finetuning. Despite operating in a purely zero-shot setting, JointFM reduces the energy loss by 14.2% relative to the strongest baseline when recovering oracle joint distributions generated by unseen synthetic SDEs.
[795] MARLIN: Multi-Agent Reinforcement Learning for Incremental DAG Discovery
Dong Li, Zhengzhang Chen, Xujiang Zhao, Linlin Yu, Zhong Chen, Yi He, Haifeng Chen, Chen Zhao
Main category: cs.LG
TL;DR: MARLIN: Efficient multi-agent RL approach for incremental causal DAG learning from observational data
Details
Motivation: Existing RL methods for causal structure discovery from observational data lack efficiency for online applications, requiring more efficient approaches for DAG learning.Method: Multi-agent RL framework with DAG generation policy mapping continuous space to DAG space, two RL agents (state-specific and state-invariant) for causal relationship discovery, incremental learning framework with factored action space for parallelization.
Result: Outperforms state-of-the-art methods in both efficiency and effectiveness on synthetic and real datasets.
Conclusion: MARLIN provides an efficient RL-based solution for incremental causal structure learning suitable for online applications.
Abstract: Uncovering causal structures from observational data is crucial for understanding complex systems and making informed decisions. While reinforcement learning (RL) has shown promise in identifying these structures in the form of a directed acyclic graph (DAG), existing methods often lack efficiency, making them unsuitable for online applications. In this paper, we propose MARLIN, an efficient multi agent RL based approach for incremental DAG learning. MARLIN uses a DAG generation policy that maps a continuous real valued space to the DAG space as an intra batch strategy, then incorporates two RL agents state specific and state invariant to uncover causal relationships and integrates these agents into an incremental learning framework. Furthermore, the framework leverages a factored action space to enhance parallelization efficiency. Extensive experiments on synthetic and real datasets demonstrate that MARLIN outperforms state of the art methods in terms of both efficiency and effectiveness.
[796] Collaborative Adaptive Curriculum for Progressive Knowledge Distillation
Jing Liu, Zhenchao Ma, Han Yu, Bobo Ju, Wenliang Yang, Chengfang Li, Bo Hu, Liang Song
Main category: cs.LG
TL;DR: Federated Adaptive Progressive Distillation (FAPD) - a curriculum learning-inspired framework for federated knowledge distillation that adaptively transfers teacher knowledge to heterogeneous clients based on their learning capacities.
Details
Motivation: Address the mismatch between high-dimensional teacher knowledge complexity and heterogeneous client learning capacities in edge-based visual analytics systems, enabling effective collaborative knowledge distillation for resource-constrained distributed multimedia learning.Method: Uses PCA-based hierarchical decomposition of teacher features to create visual knowledge hierarchy, then progressively transfers knowledge of increasing complexity via dimension-adaptive projection matrices. Server monitors global accuracy fluctuations across temporal consensus window to advance curriculum dimensionality only when collective consensus emerges.
Result: Achieves 3.64% accuracy improvement over FedAvg on CIFAR-10, 2x faster convergence, and maintains robust performance under extreme data heterogeneity (α=0.1), outperforming baselines by over 4.5%.
Conclusion: FAPD effectively adapts knowledge transfer pace to client heterogeneity while achieving superior convergence over fixed-complexity approaches, making it suitable for edge-based visual analytics systems.
Abstract: Recent advances in collaborative knowledge distillation have demonstrated cutting-edge performance for resource-constrained distributed multimedia learning scenarios. However, achieving such competitiveness requires addressing a fundamental mismatch: high-dimensional teacher knowledge complexity versus heterogeneous client learning capacities, which currently prohibits deployment in edge-based visual analytics systems. Drawing inspiration from curriculum learning principles, we introduce Federated Adaptive Progressive Distillation (FAPD), a consensus-driven framework that orchestrates adaptive knowledge transfer. FAPD hierarchically decomposes teacher features via PCA-based structuring, extracting principal components ordered by variance contribution to establish a natural visual knowledge hierarchy. Clients progressively receive knowledge of increasing complexity through dimension-adaptive projection matrices. Meanwhile, the server monitors network-wide learning stability by tracking global accuracy fluctuations across a temporal consensus window, advancing curriculum dimensionality only when collective consensus emerges. Consequently, FAPD provably adapts knowledge transfer pace while achieving superior convergence over fixed-complexity approaches. Extensive experiments on three datasets validate FAPD’s effectiveness: it attains 3.64% accuracy improvement over FedAvg on CIFAR-10, demonstrates 2x faster convergence, and maintains robust performance under extreme data heterogeneity (α=0.1), outperforming baselines by over 4.5%.
[797] Transformer-Based Predictive Maintenance for Risk-Aware Instrument Calibration
Adithya Parthasarathy, Aswathnarayan Muthukrishnan Kirubakaran, Akshay Deshpande, Ram Sekhar Bodala, Suhas Malempati, Nachiappan Chockalingam, Vinoth Punniyamoorthy, Seema Gangaiah Aarella
Main category: cs.LG
TL;DR: Paper studies calibration scheduling as predictive maintenance using time-to-drift prediction from sensor data, adapting NASA C-MAPSS benchmark for calibration tasks.
Details
Motivation: Fixed-interval calibration ignores that instruments drift at different rates under different conditions. Need smarter, condition-based calibration scheduling to maintain traceability and reliability while reducing costs.Method: Adapt NASA C-MAPSS benchmark for calibration by selecting drift-sensitive sensors, defining virtual calibration thresholds, and inserting synthetic reset events. Compare classical regressors, recurrent/convolutional sequence models, and compact Transformer for time-to-drift prediction with quantile-based uncertainty modeling.
Result: Transformer provides strongest point forecasts on primary FD001 split, remains competitive on harder splits. Predictive scheduling lowers cost relative to reactive/fixed policies, and uncertainty-aware triggers reduce violations when forecasts are less reliable.
Conclusion: Condition-based calibration can be framed as joint forecasting and decision problem. Combining sequence models with risk-aware policies is practical route toward smarter calibration planning.
Abstract: Accurate calibration is essential for instruments whose measurements must remain traceable, reliable, and compliant over long operating periods. Fixed-interval programs are easy to administer, but they ignore that instruments drift at different rates under different conditions. This paper studies calibration scheduling as a predictive maintenance problem: given recent sensor histories, estimate time-to-drift (TTD) and intervene before a violation occurs. We adapt the NASA C-MAPSS benchmark into a calibration setting by selecting drift-sensitive sensors, defining virtual calibration thresholds, and inserting synthetic reset events that emulate repeated recalibration. We then compare classical regressors, recurrent and convolutional sequence models, and a compact Transformer for TTD prediction. The Transformer provides the strongest point forecasts on the primary FD001 split and remains competitive on the harder FD002–FD004 splits, while a quantile-based uncertainty model supports conservative scheduling when drift behavior is noisier. Under a violation-aware cost model, predictive scheduling lowers cost relative to reactive and fixed policies, and uncertainty-aware triggers sharply reduce violations when point forecasts are less reliable. The results show that condition-based calibration can be framed as a joint forecasting and decision problem, and that combining sequence models with risk-aware policies is a practical route toward smarter calibration planning.
[798] Rolling-Origin Validation Reverses Model Rankings in Multi-Step PM10 Forecasting: XGBoost, SARIMA, and Persistence
Federico Garcia Crespi, Eduardo Yubero Funes, Marina Alfosea Simon
Main category: cs.LG
TL;DR: Comparing XGBoost and SARIMA for air quality forecasting shows static evaluation overstates operational usefulness; rolling-origin evaluation reveals SARIMA outperforms persistence across all horizons while XGBoost doesn’t consistently beat persistence.
Details
Motivation: To assess the operational added value of machine learning methods for air quality forecasting by comparing them against persistence baselines using proper evaluation protocols that simulate real-world updating scenarios.Method: Used 2,350 daily PM10 observations from 2017-2024 at an urban background station in southern Europe. Compared XGBoost and SARIMA against persistence using both static chronological splits and rolling-origin protocols with monthly updates. Defined predictability horizon as maximum horizon with positive persistence-relative skill.
Result: Static evaluation suggests XGBoost performs well from 1-7 days ahead, but rolling-origin evaluation reverses rankings: XGBoost is not consistently better than persistence at short/intermediate horizons, while SARIMA remains positively skilled across all horizons.
Conclusion: Static splits can overstate operational usefulness and change method rankings. Rolling-origin, persistence-referenced skill profiles provide more realistic assessments of which methods stay reliable at each lead time for practical applications.
Abstract: (a) Many air quality forecasting studies report gains from machine learning, but evaluations often use static chronological splits and omit persistence baselines, so the operational added value under routine updating is unclear. (b) Using 2,350 daily PM10 observations from 2017 to 2024 at an urban background monitoring station in southern Europe, we compare XGBoost and SARIMA against persistence under a static split and a rolling-origin protocol with monthly updates. We report horizon-specific skill and the predictability horizon, defined as the maximum horizon with positive persistence-relative skill. Static evaluation suggests XGBoost performs well from one to seven days ahead, but rolling-origin evaluation reverses rankings: XGBoost is not consistently better than persistence at short and intermediate horizons, whereas SARIMA remains positively skilled across the full range. (c) For researchers, static splits can overstate operational usefulness and change rankings. For practitioners, rolling-origin, persistence-referenced skill profiles show which methods stay reliable at each lead time.
[799] Probing the Latent World: Emergent Discrete Symbols and Physical Structure in Latent Representations
Liu hung ming
Main category: cs.LG
TL;DR: AIM framework passively quantizes V-JEPA 2 latent vectors into discrete symbols without modifying the encoder, revealing structured symbolic manifolds in frozen JEPA representations through physical dimension experiments.
Details
Motivation: Video world models like V-JEPA 2 learn rich spatiotemporal representations but lack interpretability due to the absence of visual verification pathways. Existing probing methods either operate in continuous space without structured layers or add generative components that confound attribution to the encoder.Method: Proposes AI Mother Tongue (AIM) framework as a passive quantization probe: lightweight, vocabulary-free probe that converts V-JEPA 2 continuous latent vectors into discrete symbol sequences without task-specific supervision or modifying the frozen encoder.
Result: Category-contrast experiments on Kinetics-mini across grasp angle, object geometry, and motion temporal structure show significant symbol distribution differences (chi^2 p < 10^{-4}; MI 0.036-0.117 bits; JSD up to 0.342). Reveals V-JEPA 2 latent space is compact with semantic differences encoded as graded distributional variations rather than categorical boundaries.
Conclusion: Demonstrates structured symbolic manifolds are discoverable properties of frozen JEPA latent spaces, establishing Stage 1 of a roadmap toward action-conditioned symbolic world models. Shows encoder’s learned physical structure can be accessed through passive quantization without modifying the model.
Abstract: Video world models trained with Joint Embedding Predictive Architectures (JEPA) acquire rich spatiotemporal representations by predicting masked regions in latent space rather than reconstructing pixels. This removes the visual verification pathway of generative models, creating a structural interpretability gap: the encoder has learned physical structure inaccessible in any inspectable form. Existing probing methods either operate in continuous space without a structured intermediate layer, or attach generative components whose parameters confound attribution of behavior to the encoder. We propose the AI Mother Tongue (AIM) framework as a passive quantization probe: a lightweight, vocabulary-free probe that converts V-JEPA 2 continuous latent vectors into discrete symbol sequences without task-specific supervision or modifying the encoder. Because the encoder is kept completely frozen, any symbolic structure in the AIM codebook is attributable entirely to V-JEPA 2 pre-trained representations – not to the probe. We evaluate through category-contrast experiments on Kinetics-mini along three physical dimensions: grasp angle, object geometry, and motion temporal structure. AIM symbol distributions differ significantly across all three experiments (chi^2 p < 10^{-4}; MI 0.036–0.117 bits, NMI 1.2–3.9% of the 3-bit maximum; JSD up to 0.342; codebook active ratio 62.5%). The experiments reveal that V-JEPA 2 latent space is markedly compact: diverse action categories share a common representational core, with semantic differences encoded as graded distributional variations rather than categorical boundaries. These results establish Stage 1 of a four-stage roadmap toward an action-conditioned symbolic world model, demonstrating that structured symbolic manifolds are discoverable properties of frozen JEPA latent spaces.
[800] Bounded Coupled AI Learning Dynamics in Tri-Hierarchical Drone Swarms
Oleksii Bychkov
Main category: cs.LG
TL;DR: A theoretical analysis of tri-hierarchical swarm learning systems with Hebbian learning, MARL, and meta-learning operating at different timescales, establishing formal guarantees for bounded errors and stable dynamics.
Details
Motivation: Modern autonomous multi-agent systems combine heterogeneous learning mechanisms at different timescales, but there's a need for formal guarantees that coupled dynamics stay within admissible operational regimes.Method: Studies a tri-hierarchical swarm learning system with three simultaneous mechanisms: local Hebbian online learning (fast), multi-agent reinforcement learning (medium), and meta-learning (slow). Establishes four theoretical results through mathematical analysis.
Result: Four theorems: Bounded Total Error Theorem (component-wise upper bound on total suboptimality), Bounded Representation Drift Theorem (worst-case estimate of Hebbian updates on embeddings), Meta-Level Compatibility Theorem (conditions for strategic adaptation preserving invariants), and Non-Accumulation Theorem (error doesn’t grow unboundedly).
Conclusion: Provides formal guarantees for stability in hierarchical multi-agent learning systems with heterogeneous timescales, addressing the open question of maintaining admissible operational regimes.
Abstract: Modern autonomous multi-agent systems combine heterogeneous learning mechanisms operating at different timescales. An open question remains: can one formally guarantee that coupled dynamics of such mechanisms stay within the admissible operational regime? This paper studies a tri-hierarchical swarm learning system where three mechanisms act simultaneously: (1) local Hebbian online learning at individual agent level (fast timescale, 10-100 ms); (2) multi-agent reinforcement learning (MARL) for tactical group coordination (medium timescale, 1-10 s); (3) meta-learning (MAML) for strategic adaptation (slow timescale, 10-100 s). Four results are established. The Bounded Total Error Theorem shows that under contractual constraints on learning rates, Lipschitz continuity of inter-level mappings, and weight stabilization, total suboptimality admits a component-wise upper bound uniform in time. The Bounded Representation Drift Theorem gives a worst-case estimate of how Hebbian updates affect coordination-level embeddings during one MARL cycle. The Meta-Level Compatibility Theorem provides sufficient conditions under which strategic adaptation preserves lower-level invariants. The Non-Accumulation Theorem proves that error does not grow unboundedly over time.
[801] Harmful Visual Content Manipulation Matters in Misinformation Detection Under Multimedia Scenarios
Bing Wang, Ximing Li, Changchun Li, Jinjin Chi, Tianze Li, Renchu Guan, Shengsheng Wang
Main category: cs.LG
TL;DR: HAVC-M4D improves multimodal misinformation detection by incorporating manipulation features and harmful intention analysis through weakly supervised learning.
Details
Motivation: Current MMD approaches focus on semantic relationships between modalities but overlook critical indicators like visual manipulation features and the harmful intentions behind such manipulations, which are valuable clues for detecting misinformation.Method: Proposes HAVC-M4D approach that captures two types of features: manipulation features (detecting if visual content is manipulated) and intention features (distinguishing harmful vs harmless manipulations). Uses weakly supervised indicators by incorporating image manipulation detection datasets and framing classification tasks as positive and unlabeled learning problems.
Result: Comprehensive experiments on four prevalent MMD datasets show that HAVC-M4D significantly and consistently enhances the performance of existing MMD methods.
Conclusion: Harmful visual content manipulation matters in multimodal misinformation detection, and the proposed approach effectively leverages manipulation and intention features to improve detection performance.
Abstract: Nowadays, the widespread dissemination of misinformation across numerous social media platforms has led to severe negative effects on society. To address this challenge, the automatic detection of misinformation, particularly under multimedia scenarios, has gained significant attention from both academic and industrial communities, leading to the emergence of a research task known as Multimodal Misinformation Detection (MMD). Typically, current MMD approaches focus on capturing the semantic relationships and inconsistency between various modalities but often overlook certain critical indicators within multimodal content. Recent research has shown that manipulated features within visual content in social media articles serve as valuable clues for MMD. Meanwhile, we argue that the potential intentions behind the manipulation, e.g., harmful and harmless, also matter in MMD. Therefore, in this study, we aim to identify such multimodal misinformation by capturing two types of features: manipulation features, which represent if visual content has been manipulated, and intention features, which assess the nature of these manipulations, distinguishing between harmful and harmless intentions. Unfortunately, the manipulation and intention labels that supervise these features to be discriminative are unknown. To address this, we introduce two weakly supervised indicators as substitutes by incorporating supplementary datasets focused on image manipulation detection and framing two different classification tasks as positive and unlabeled learning issues. With this framework, we introduce an innovative MMD approach, titled Harmful Visual Content Manipulation Matters in MMD (HAVC-M4 D). Comprehensive experiments conducted on four prevalent MMD datasets indicate that HAVC-M4 D significantly and consistently enhances the performance of existing MMD methods.
[802] Hybrid Autoencoder-Isolation Forest approach for time series anomaly detection in C70XP cyclotron operation data at ARRONAX
F Basbous, F Poirier, F Haddad, D Mateus
Main category: cs.LG
TL;DR: Proposes hybrid Autoencoder-Isolation Forest method for anomaly detection in cyclotron sensor data, using reconstruction error as input to improve detection of subtle anomalies near normal data means.
Details
Motivation: Cyclotron systems for radioisotope production are complex, costly, and prone to failures causing operational disruptions. Need for early anomaly detection from sensor measurements to enhance system performance and reliability.Method: Hybrid approach combining fully connected Autoencoder (AE) with Isolation Forest (IF). Uses Mean Cubic Error (MCE) of AE-reconstructed sensor data as input to IF model to enhance detection of subtle anomalies that traditional IF struggles with due to axis-parallel split limitations.
Result: Validated on proton beam intensity time series data, the proposed method demonstrates clear improvement in detection performance compared to standard approaches, as confirmed by experimental results.
Conclusion: The AE-IF hybrid approach effectively enhances anomaly detection capabilities for cyclotron systems, particularly for subtle anomalies near normal data distributions, improving operational reliability.
Abstract: The Interest Public Group ARRONAX’s C70XP cyclotron, used for radioisotope production for medical and research applications, relies on complex and costly systems that are prone to failures, leading to operational disruptions. In this context, this study aims to develop a machine learning-based method for early anomaly detection, from sensor measurements over a temporal window, to enhance system performance. One of the most widely recognized methods for anomaly detection is Isolation Forest (IF), known for its effectiveness and scalability. However, its reliance on axis-parallel splits limits its ability to detect subtle anomalies, especially those occurring near the mean of normal data. This study proposes a hybrid approach that combines a fully connected Autoencoder (AE) with IF to enhance the detection of subtle anomalies. In particular, the Mean Cubic Error (MCE) of the sensor data reconstructed by the AE is used as input to the IF model. Validated on proton beam intensity time series data, the proposed method demonstrates a clear improvement in detection performance, as confirmed by the experimental results.
[803] Graph-Aware Text-Only Backdoor Poisoning for Text-Attributed Graphs
Qi Luo, Minghui Xu, Dongxiao Yu, Xiuzhen Cheng
Main category: cs.LG
TL;DR: TAGBD is a text-only backdoor attack for text-attributed graphs that edits only node text (not graph structure) to poison training data and make models produce wrong predictions on demand.
Details
Motivation: Many learning systems use graph data with node text (e.g., papers with abstracts, users with posts). Attackers could quietly poison training data by editing text, creating security risks in graph learning systems.Method: TAGBD finds training nodes that are easier to influence, generates natural-looking trigger text using a shadow graph model, and injects triggers by either replacing original text or appending short phrases.
Result: Experiments on three benchmark datasets show the attack is highly effective, transfers across different graph models, and remains strong under common defenses.
Conclusion: Text alone is a practical attack channel in graph learning systems, suggesting future defenses should inspect both graph links and node content.
Abstract: Many learning systems now use graph data in which each node also contains text, such as papers with abstracts or users with posts. Because these texts often come from open platforms, an attacker may be able to quietly poison a small part of the training data and later make the model produce wrong predictions on demand. This paper studies that risk in a realistic setting where the attacker edits only node text and does not change the graph structure. We propose TAGBD, a text-only backdoor attack for text-attributed graphs. TAGBD first finds training nodes that are easier to influence, then generates natural-looking trigger text with the help of a shadow graph model, and finally injects the trigger by either replacing the original text or appending a short phrase. Experiments on three benchmark datasets show that the attack is highly effective, transfers across different graph models, and remains strong under common defenses. These results demonstrate that text alone is a practical attack channel in graph learning systems and suggest that future defenses should inspect both graph links and node content.
[804] Interpretable Multiple Myeloma Prognosis with Observational Medical Outcomes Partnership Data
Salma Rachidi, Aso Bozorgpanah, Eric Fey, Alexander Jung
Main category: cs.LG
TL;DR: Novel regularization techniques for interpretable ML models in healthcare, specifically for predicting five-year survival of multiple myeloma patients using clinical data.
Details
Motivation: Machine learning promises better clinical decision-making but opaque model behavior limits adoption in healthcare; need for interpretable models that clinicians can trust.Method: Two regularization techniques: 1) penalizes deviations from predictions of interpretable logistic regression with two manually chosen features, 2) requires consistency with revised international staging system (R-ISS). Applied to 812 patient dataset from Helsinki University Hospital.
Result: Models achieve accuracy up to 0.721 on test set; SHAP values show models rely on selected important features, demonstrating interpretability.
Conclusion: Proposed regularization techniques successfully ensure interpretability of ML models in healthcare while maintaining reasonable predictive performance.
Abstract: Machine learning (ML) promises better clinical decision-making, yet opaque model behavior limits the adoption in healthcare. We propose two novel regularization techniques for ensuring the interpretability of ML models trained on real-world data. In particular, we consider the prediction of five-year survival for multiple myeloma patients using clinical data from Helsinki University Hospital. To ensure the interpretability of the trained models, we use two alternative constructions for a penalty term used for regularization. The first one penalizes deviations from the predictions obtained from an interpretable logistic regression method with two manually chosen features. The second construction requires consistency of model predictions with the revised international staging system (R-ISS). We verify the usefulness of the proposed regularization techniques in numerical experiments using data from 812 patients. They achieve an accuracy up to 0.721 on a test set and SHAP values show that the models rely on the selected important features.
[805] The Multiverse of Time Series Machine Learning: an Archive for Multivariate Time Series Classification
Matthew Middlehurst, Aiden Rushbrooke, Ali Ismail-Fawaz, Maxime Devanne, Germain Forestier, Angus Dempster, Geoffrey I. Webb, Christopher Holder, Anthony Bagnall
Main category: cs.LG
TL;DR: The paper introduces Multiverse archive, a major expansion of the UEA multivariate time series classification benchmark from 30 to 133 datasets, with preprocessed versions bringing total to 147 datasets.
Details
Motivation: Time series machine learning research needs comprehensive benchmark datasets to drive progress. The original UEA archive (30 datasets) has been widely used but is limited in scope. There's a need for a more diverse and extensive collection to support modern TSML research.Method: The authors expanded the UEA archive by consolidating datasets from multiple sources, including other collections and standalone datasets. They created preprocessed versions for datasets with missing values or unequal length series. They also established a core subset (MV-core) for initial exploration and provided baseline evaluations of classification algorithms.
Result: Created Multiverse archive with 133 classification problems (147 including preprocessed versions), more than quadrupling the original UEA archive. Established performance benchmarks using established and recent classification algorithms, and created a dedicated repository with reproducible framework and interactive interface.
Conclusion: The Multiverse archive provides a comprehensive benchmark resource for time series classification research, addressing the need for diverse datasets to evaluate and advance TSML algorithms. The archive supports reproducibility and establishes baselines for future research.
Abstract: Time series machine learning (TSML) is a growing research field that spans a wide range of tasks. The popularity of established tasks such as classification, clustering, and extrinsic regression has, in part, been driven by the availability of benchmark datasets. An archive of 30 multivariate time series classification datasets, introduced in 2018 and commonly known as the UEA archive, has since become an essential resource cited in hundreds of publications. We present a substantial expansion of this archive that more than quadruples its size, from 30 to 133 classification problems. We also release preprocessed versions of datasets containing missing values or unequal length series, bringing the total number of datasets to 147. Reflecting the growth of the archive and the broader community, we rebrand it as the Multiverse archive to capture its diversity of domains. The Multiverse archive includes datasets from multiple sources, consolidating other collections and standalone datasets into a single, unified repository. Recognising that running experiments across the full archive is computationally demanding, we recommend a subset of the full archive called Multiverse-core (MV-core) for initial exploration. To support researchers in using the new archive, we provide detailed guidance and a baseline evaluation of established and recent classification algorithms, establishing performance benchmarks for future research. We have created a dedicated repository for the Multiverse archive that provides a common aeon and scikit-learn compatible framework for reproducibility, an extensive record of published results, and an interactive interface to explore the results.
[806] CAMA: Exploring Collusive Adversarial Attacks in c-MARL
Men Niu, Xinxin Fan, Quanliang Jing, Shaoye Luo, Yunfeng Lu
Main category: cs.LG
TL;DR: Proposes three novel collusive adversarial attack modes in cooperative multi-agent reinforcement learning: Collective, Disguised, and Spied Malicious Agents, with a unified framework CAMA for policy-level attacks.
Details
Motivation: Current adversarial attacks in c-MARL focus on single-adversary perturbation attacks and white-box attacks that manipulate agents' internal observations or actions. There's a need to study more sophisticated collusive attacks where multiple malicious agents coordinate.Method: Designs three collusive attack modes: Collective Malicious Agents (coordinated group attacks), Disguised Malicious Agents (hidden malicious behavior), and Spied Malicious Agents (information gathering). Creates unified framework CAMA for policy-level collusive attacks using observation information fusion and attack-trigger control.
Result: Experiments on four SMAC II maps show the three collusive attacks have additive adversarial synergy, strengthening attack outcomes while maintaining high stealthiness and stability over long horizons.
Conclusion: The work fills the gap for collusive adversarial learning in c-MARL by proposing novel coordinated attack strategies that are more effective and stealthy than single-adversary approaches.
Abstract: Cooperative multi-agent reinforcement learning (c-MARL) has been widely deployed in real-world applications, such as social robots, embodied intelligence, UAV swarms, etc. Nevertheless, many adversarial attacks still exist to threaten various c-MARL systems. At present, the studies mainly focus on single-adversary perturbation attacks and white-box adversarial attacks that manipulate agents’ internal observations or actions. To address these limitations, we in this paper attempt to study collusive adversarial attacks through strategically organizing a set of malicious agents into three collusive attack modes: Collective Malicious Agents, Disguised Malicious Agents, and Spied Malicious Agents. Three novelties are involved: i) three collusive adversarial attacks are creatively proposed for the first time, and a unified framework CAMA for policy-level collusive attacks is designed; ii) the attack effectiveness is theoretically analyzed from the perspectives of disruptiveness, stealthiness, and attack cost; and iii) the three collusive adversarial attacks are technically realized through agent’s observation information fusion, attack-trigger control. Finally, multi-facet experiments on four SMAC II maps are performed, and experimental results showcase the three collusive attacks have an additive adversarial synergy, strengthening attack outcome while maintaining high stealthiness and stability over long horizons. Our work fills the gap for collusive adversarial learning in c-MARL.
[807] Fusing Memory and Attention: A study on LSTM, Transformer and Hybrid Architectures for Symbolic Music Generation
Soudeep Ghoshal, Sandipan Chakraborty, Pradipto Chowdhury, Himanshu Buckchash
Main category: cs.LG
TL;DR: Hybrid Transformer-LSTM architecture for symbolic music generation outperforms standalone models by combining Transformer’s global structure modeling with LSTM’s local pattern capture.
Details
Motivation: Existing literature shows differences between LSTMs and Transformers for symbolic music generation (LSTMs better at local melodic continuity, Transformers better at global structural coherence), but no systematic study of their specific properties in SMG context.Method: Fine-grained comparative analysis of LSTMs vs Transformers using 17 musical quality metrics on Deutschl dataset, followed by proposed hybrid architecture combining Transformer Encoder with LSTM Decoder, evaluated against both baselines with 1,000 generated melodies each.
Result: LSTMs excel at capturing local patterns but fail at long-range dependencies; Transformers model global structure effectively but produce irregular phrasing; Hybrid method achieves better local and global continuity/coherence than baselines.
Conclusion: Hybrid architecture leverages complementary strengths of Transformers and LSTMs for superior symbolic music generation, with findings supported by ablation studies and human perceptual evaluations.
Abstract: Machine learning techniques, such as Transformers and Long Short-Term Memory (LSTM) networks, play a crucial role in Symbolic Music Generation (SMG). Existing literature indicates a difference between LSTMs and Transformers regarding their ability to model local melodic continuity versus maintaining global structural coherence. However, their specific properties within the context of SMG have not been systematically studied. This paper addresses this gap by providing a fine-grained comparative analysis of LSTMs versus Transformers for SMG, examining local and global properties in detail using 17 musical quality metrics on the Deutschl dataset. We find that LSTM networks excel at capturing local patterns but fail to preserve long-range dependencies, while Transformers model global structure effectively but tend to produce irregular phrasing. Based on this analysis and leveraging their respective strengths, we propose a Hybrid architecture combining a Transformer Encoder with an LSTM Decoder and evaluate it against both baselines. We evaluated 1,000 generated melodies from each of the three architectures on the Deutschl dataset. The results show that the hybrid method achieves better local and global continuity and coherence compared to the baselines. Our work highlights the key characteristics of these models and demonstrates how their properties can be leveraged to design superior models. We also supported the experiments with ablation studies and human perceptual evaluations, which statistically support the findings and provide robust validation for this work.
[808] SymCircuit: Bayesian Structure Inference for Tractable Probabilistic Circuits via Entropy-Regularized Reinforcement Learning
Y. Sungtaek Ju
Main category: cs.LG
TL;DR: SymCircuit uses RL-trained generative policy (SymFormer Transformer) for probabilistic circuit structure learning, replacing greedy algorithms with tempered Bayesian posterior approach and achieving improved sample efficiency.
Details
Motivation: Current probabilistic circuit structure learning suffers from greedy algorithms that make irreversible, locally optimal decisions, limiting their effectiveness.Method: Uses entropy-regularized reinforcement learning to train a generative policy (SymFormer), a grammar-constrained autoregressive Transformer with tree-relative self-attention that guarantees valid circuits. Introduces option-level REINFORCE for improved sample efficiency.
Result: Achieves >10 times sample efficiency gain on NLTCS dataset, closes 93% of the gap to LearnSPN, and shows preliminary scalability on Plants dataset with 69 variables.
Conclusion: SymCircuit demonstrates that replacing greedy search with learned generative policies via RL can significantly improve probabilistic circuit structure learning with better sample efficiency and scalability.
Abstract: Probabilistic circuit (PC) structure learning is hampered by greedy algorithms that make irreversible, locally optimal decisions. We propose SymCircuit, which replaces greedy search with a learned generative policy trained via entropy-regularized reinforcement learning. Instantiating the RL-as-inference framework in the PC domain, we show the optimal policy is a tempered Bayesian posterior, recovering the exact posterior when the regularization temperature is set inversely proportional to the dataset size. The policy is implemented as SymFormer, a grammar-constrained autoregressive Transformer with tree-relative self-attention that guarantees valid circuits at every generation step. We introduce option-level REINFORCE, restricting gradient updates to structural decisions rather than all tokens, yielding an SNR (signal to noise ratio) improvement and >10 times sample efficiency gain on the NLTCS dataset. A three-layer uncertainty decomposition (structural via model averaging, parametric via the delta method, leaf via conjugate Dirichlet-Categorical propagation) is grounded in the multilinear polynomial structure of PC outputs. On NLTCS, SymCircuit closes 93% of the gap to LearnSPN; preliminary results on Plants (69 variables) suggest scalability.
[809] KV Cache Optimization Strategies for Scalable and Efficient LLM Inference
Yichun Xu, Navjot K. Khaira, Tejinder Singh
Main category: cs.LG
TL;DR: Systematic review of KV cache optimization techniques for Transformer LLMs, organized into five categories with analysis of trade-offs and deployment scenarios.
Details
Motivation: KV cache memory footprint scales linearly with context length, creating critical bottlenecks for GPU memory, bandwidth, and inference throughput as LLM context windows expand to millions of tokens, making efficient KV cache management essential for scalable LLM deployment.Method: Systematic review organizing KV cache optimization techniques into five principal directions: cache eviction, cache compression, hybrid memory solutions, novel attention mechanisms, and combination strategies. Analysis includes underlying mechanisms, deployment trade-offs, and empirical performance across memory reduction, throughput, and accuracy metrics.
Result: No single technique dominates across all settings; optimal strategy depends on context length, hardware constraints, and workload characteristics. The review maps techniques to seven practical deployment scenarios and provides actionable guidance for practitioners.
Conclusion: KV cache optimization is crucial for scalable LLM deployment, with adaptive, multi-stage optimization pipelines identified as a promising direction for future research due to the context-dependent nature of optimal strategies.
Abstract: The key-value (KV) cache is a foundational optimization in Transformer-based large language models (LLMs), eliminating redundant recomputation of past token representations during autoregressive generation. However, its memory footprint scales linearly with context length, imposing critical bottlenecks on GPU memory capacity, memory bandwidth, and inference throughput as production LLMs push context windows from thousands to millions of tokens. Efficient KV cache management has thus become a first-order challenge for scalable LLM deployment. This paper provides a systematic review of recent KV cache optimization techniques, organizing them into five principal directions: cache eviction, cache compression, hybrid memory solutions, novel attention mechanisms, and combination strategies. For each category we analyze the underlying mechanisms, deployment trade-offs, and empirical performance across memory reduction, throughput, and model accuracy metrics. We further map techniques to seven practical deployment scenarios, including long-context single requests, high-throughput datacenter serving, edge devices, multi-turn conversations, and accuracy-critical reasoning, providing actionable guidance for practitioners selecting among competing approaches. Our analysis reveals that no single technique dominates across all settings; instead, the optimal strategy depends on context length, hardware constraints, and workload characteristics, pointing toward adaptive, multi-stage optimization pipelines as a promising direction for future research.
[810] Putnam 2025 Problems in Rocq using Opus 4.6 and Rocq-MCP
Guillaume Baudart, Marc Lelarge, Tristan Stérin, Jules Viennot
Main category: cs.LG
TL;DR: Claude Opus 4.6 with Model Context Protocol tools for Rocq proof assistant autonomously proved 10 of 12 problems from the 2025 Putnam Mathematical Competition using a compile-first, interactive-fallback strategy.
Details
Motivation: To demonstrate autonomous theorem proving capabilities of large language models when equipped with specialized tools for formal verification systems, specifically testing on challenging mathematical competition problems.Method: Used Claude Opus 4.6 equipped with Model Context Protocol (MCP) tools designed for the Rocq proof assistant, implementing a “compile-first, interactive-fallback” strategy based on analysis of prior experiments with miniF2F-Rocq.
Result: Successfully proved 10 out of 12 problems from the 2025 Putnam Mathematical Competition, using 141 subagents over 17.7 hours of active compute (51.6h wall-clock) and consuming approximately 1.9 billion tokens, with all proofs publicly available.
Conclusion: Large language models with specialized formal verification tools can achieve impressive results in autonomous theorem proving on challenging mathematical problems, demonstrating the potential of AI-assisted formal reasoning.
Abstract: We report on an experiment in which Claude Opus~4.6, equipped with a suite of Model Context Protocol (MCP) tools for the Rocq proof assistant, autonomously proved 10 of 12 problems from the 2025 Putnam Mathematical Competition. The MCP tools, designed with Claude by analyzing logs from a prior experiment on miniF2F-Rocq, encode a “compile-first, interactive-fallback” strategy. Running on an isolated VM with no internet access, the agent deployed 141 subagents over 17.7 hours of active compute (51.6h wall-clock), consuming approximately 1.9 billion tokens. All proofs are publicly available.
[811] Thinking in Different Spaces: Domain-Specific Latent Geometry Survives Cross-Architecture Translation
Marcus Armstrong, Navid Ayoobi, Arjun Mukherjee
Main category: cs.LG
TL;DR: Linear projection enables transferring activation states from teacher to student models for inference-time behavioral correction without weight updates, showing domain-specific geometric compatibility across diverse architectures.
Details
Motivation: To investigate whether independently trained language models develop geometrically compatible latent representations that can be exploited for inference-time behavioral correction without weight updates, and to understand the relationship between representation space alignment and output space impact.Method: Learn linear projection matrices to map activation vectors from teacher models into student model coordinate systems, then intervene on student’s residual stream during generation by substituting internal states with translated teacher representations. Test across 20 heterogeneous teacher-student pairings including mixture-of-experts, dense, code-specialized, and synthetically trained architectures.
Result: Ridge projection achieves R² = 0.50 on verbal reasoning and R² = 0.40 on mathematical reasoning, with behavioral correction rates of 14.0-50.0% on TruthfulQA (mean 25.2%) and 8.5-43.3% on GSM8K (mean 25.5%). Near-zero correlation between geometric alignment quality and behavioral correction rate (r = -0.07). Projection matrices catastrophically fail when transferred across reasoning domains (mean R² = -3.83).
Conclusion: Language models develop domain-specific subspace geometry that enables inference-time behavioral correction via linear projection, but representation space fidelity doesn’t predict output impact, revealing a dissociation between latent alignment and functional improvement.
Abstract: We investigate whether independently trained language models converge to geometrically compatible latent representations, and whether this compatibility can be exploited to correct model behavior at inference time without any weight updates. We learn a linear projection matrix that maps activation vectors from a large teacher model into the coordinate system of a smaller student model, then intervene on the student’s residual stream during generation by substituting its internal state with the translated teacher representation. Across a fully crossed experimental matrix of 20 heterogeneous teacher-student pairings spanning mixture-of-experts, dense, code-specialized, and synthetically trained architectures, the Ridge projection consistently achieves R^2 = 0.50 on verbal reasoning and R^2 = 0.40 on mathematical reasoning, collapsing to R^2 = -0.22 under permutation control and R^2 = 0.01 under L_1 regularization. Behavioral correction rates range from 14.0% to 50.0% on TruthfulQA (mean 25.2%) and from 8.5% to 43.3% on GSM8K arithmetic reasoning (mean 25.5%), demonstrating that the method generalizes across fundamentally different reasoning domains. We report a near-zero correlation between geometric alignment quality and behavioral correction rate (r = -0.07), revealing a dissociation between representation space fidelity and output space impact. Intervention strength is architecture-specific: student models exhibit characteristic sensitivity profiles that invert across domains, with the most steerable verbal student becoming the least steerable mathematical student. Finally, a double dissociation experiment conducted across all 20 model pairings confirms without exception that projection matrices collapse catastrophically when transferred across reasoning domains (mean R^2 = -3.83 in both transfer directions), establishing domain-specific subspace geometry as a universal property of LMs.
[812] SLE-FNO: Single-Layer Extensions for Task-Agnostic Continual Learning in Fourier Neural Operators
Mahmoud Elhadidy, Roshan M. D’Souza, Amirhossein Arzani
Main category: cs.LG
TL;DR: SLE-FNO: A novel continual learning architecture combining Single-Layer Extension with Fourier Neural Operator for fluid dynamics surrogate models that adapt to distribution shifts without catastrophic forgetting.
Details
Motivation: Scientific machine learning surrogate models often fail when faced with distribution shifts from new experimental conditions or simulation regimes, requiring continual learning approaches that can adapt while preventing catastrophic forgetting, especially in fluid dynamics applications.Method: Proposes SLE-FNO architecture combining Single-Layer Extension with Fourier Neural Operator. Compared against multiple CL methods (EWC, LwF, replay-based, OGD, GEM, PiggyBack, LoRA) in image-to-image regression for mapping concentration fields to wall shear stress in aneurysmal blood flow.
Result: Replay-based methods and architecture-based approaches (PiggyBack, LoRA, SLE-FNO) achieved best retention. SLE-FNO provided strongest balance between plasticity and stability with zero forgetting and minimal additional parameters across 230 CFD simulations.
Conclusion: SLE-FNO is a promising continual learning strategy for adapting baseline models when extrapolation is required, highlighting key differences between CL algorithms for scientific machine learning applications.
Abstract: Scientific machine learning is increasingly used to build surrogate models, yet most models are trained under a restrictive assumption in which future data follow the same distribution as the training set. In practice, new experimental conditions or simulation regimes may differ significantly, requiring extrapolation and model updates without re-access to prior data. This creates a need for continual learning (CL) frameworks that can adapt to distribution shifts while preventing catastrophic forgetting. Such challenges are pronounced in fluid dynamics, where changes in geometry, boundary conditions, or flow regimes induce non-trivial changes to the solution. Here, we introduce a new architecture-based approach (SLE-FNO) combining a Single-Layer Extension (SLE) with the Fourier Neural Operator (FNO) to support efficient CL. SLE-FNO was compared with a range of established CL methods, including Elastic Weight Consolidation (EWC), Learning without Forgetting (LwF), replay-based approaches, Orthogonal Gradient Descent (OGD), Gradient Episodic Memory (GEM), PiggyBack, and Low-Rank Approximation (LoRA), within an image-to-image regression setting. The models were trained to map transient concentration fields to time-averaged wall shear stress (TAWSS) in pulsatile aneurysmal blood flow. Tasks were derived from 230 computational fluid dynamics simulations grouped into four sequential and out-of-distribution configurations. Results show that replay-based methods and architecture-based approaches (PiggyBack, LoRA, and SLE-FNO) achieve the best retention, with SLE-FNO providing the strongest overall balance between plasticity and stability, achieving accuracy with zero forgetting and minimal additional parameters. Our findings highlight key differences between CL algorithms and introduce SLE-FNO as a promising strategy for adapting baseline models when extrapolation is required.
[813] Data-driven discovery of roughness descriptors for surface characterization and intimate contact modeling of unidirectional composite tapes
Sebastian Rodriguez, Mikhael Tannous, Jad Mounayer, Camilo Cruz, Anais Barasinski, Francisco Chinesta
Main category: cs.LG
TL;DR: Paper proposes using Rank Reduction Autoencoders (RRAEs) with truncated SVD to extract roughness descriptors from unidirectional tapes that enable both classification for process control and consolidation modeling for composite manufacturing.
Details
Motivation: Current statistical descriptors for surface roughness of unidirectional tapes can represent topology but don't connect well with the physics of inter-tape consolidation during composite manufacturing. Need descriptors that enable both tape classification (for process control) and consolidation modeling.Method: Proposes Rank Reduction Autoencoders (RRAEs) - autoencoders with linear latent space enforced by applying truncated Singular Value Decomposition (SVD) to the latent matrix during training. The SVD modes are forced to both accurately represent roughness after decoding and extract a priori knowledge like classification or modeling properties.
Result: The method extracts useful roughness descriptors that can simultaneously enable tape classification (crucial for process control) and consolidation modeling by inferring the evolution of the degree of intimate contact governed by process parameters.
Conclusion: RRAEs provide a novel strategy to extract roughness descriptors that bridge the gap between surface topology characterization and the physical consolidation processes in composite manufacturing, addressing the dual need for classification and modeling.
Abstract: Unidirectional tapes surface roughness determines the evolution of the degree of intimate contact required for ensuring the thermoplastic molecular diffusion and the associated inter-tapes consolidation during manufacturing of composite structures. However, usual characterization of rough surfaces relies on statistical descriptors that even if they are able to represent the surface topology, they are not necessarily connected with the physics occurring at the interface during inter-tape consolidation. Thus, a key research question could be formulated as follows: Which roughness descriptors simultaneously enable tape classification-crucial for process control-and consolidation modeling via the inference of the evolution of the degree of intimate contact, itself governed by the process parameters?. For providing a valuable response, we propose a novel strategy based on the use of Rank Reduction Autoencoders (RRAEs), autoencoders with a linear latent vector space enforced by applying a truncated Singular Value Decomposition (SVD) to the latent matrix during the encoder-decoder training. In this work, we extract useful roughness descriptors by enforcing the latent SVD modes to (i) accurately represent the roughness after decoding, and (ii) allow the extraction of existing a priori knowledge such as classification or modelling properties.
[814] Detecting Neurovascular Instability from Multimodal Physiological Signals Using Wearable-Compatible Edge AI: A Responsible Computational Framework
Truong Quynh Hoa, Hoang Dinh Cuong, Truong Xuan Khanh
Main category: cs.LG
TL;DR: Melaguard: A lightweight multimodal transformer framework for detecting neurovascular instability from wearable physiological signals before structural stroke occurs, enabling early stroke risk screening.
Details
Motivation: Neurovascular instability (NVI) - the pre-structural dysregulation of cerebrovascular autoregulation preceding overt stroke - remains undetectable by existing single-modality wearables. With 12.2 million incident strokes annually, there's a critical need for continuous multimodal physiological monitoring for community-scale screening.Method: Proposes Melaguard, a multimodal ML framework (Transformer-lite, 1.2M parameters, 4-head self-attention) that fuses heart rate variability (HRV), peripheral perfusion index, SpO2, and bilateral phase coherence into a composite NVI Score. Designed for edge inference with WCET <=4 ms on Cortex-M4. Uses three-stage validation: synthetic benchmark, clinical cohort validation, and PPG pipeline validation.
Result: Synthetic benchmark (n=10,000): AUC=0.88 [0.83-0.92]; Clinical cohort (n=172): Transformer-lite achieves AUC=0.755, outperforming LSTM (0.643), Random Forest (0.665), SVM (0.472); HRV-SDNN discriminates stroke (p=0.011); PPG pipeline (n=53): pulse rate r=0.748 and HRV surrogate r=0.690 vs. ECG ground truth; Cross-modality validation on PPG-BP (n=219): PPG morphology classifies cerebrovascular disease at AUC=0.923.
Conclusion: Multimodal fusion consistently outperforms single-modality baselines for detecting neurovascular instability. The lightweight transformer architecture enables practical edge deployment for continuous physiological monitoring, offering a path to community-scale stroke risk screening.
Abstract: We propose Melaguard, a multimodal ML framework (Transformer-lite, 1.2M parameters, 4-head self-attention) for detecting neurovascular instability (NVI) from wearable-compatible physiological signals prior to structural stroke pathology. The model fuses heart rate variability (HRV), peripheral perfusion index, SpO2, and bilateral phase coherence into a composite NVI Score, designed for edge inference (WCET <=4 ms on Cortex-M4). NVI - the pre-structural dysregulation of cerebrovascular autoregulation preceding overt stroke - remains undetectable by existing single-modality wearables. With 12.2 million incident strokes annually, continuous multimodal physiological monitoring offers a practical path to community-scale screening. Three-stage independent validation: (1) synthetic benchmark (n=10,000), AUC=0.88 [0.83-0.92]; (2) clinical cohort PhysioNet CVES (n=172; 84 stroke, 88 control) - Transformer-lite achieves AUC=0.755 [0.630-0.778], outperforming LSTM (0.643), Random Forest (0.665), SVM (0.472); HRV-SDNN discriminates stroke (p=0.011); (3) PPG pipeline PhysioNet BIDMC (n=53) – pulse rate r=0.748 and HRV surrogate r=0.690 vs. ECG ground truth. Cross-modality validation on PPG-BP (n=219) confirms PPG morphology classifies cerebrovascular disease at AUC=0.923 [0.869-0.968]. Multimodal fusion consistently outperforms single-modality baselines. Code: https://github.com/ClevixLab/Melaguard
[815] SDE-Driven Spatio-Temporal Hypergraph Neural Networks for Irregular Longitudinal fMRI Connectome Modeling in Alzheimer’s Disease
Ruiying Chen, Yutong Wang, Houliang Zhou, Wei Liang, Yong Chen, Lifang He
Main category: cs.LG
TL;DR: SDE-HGNN: A stochastic differential equation-driven hypergraph neural network for modeling irregular longitudinal fMRI connectome data in Alzheimer’s disease progression prediction.
Details
Motivation: Longitudinal neuroimaging for Alzheimer's disease faces challenges with irregular sampling and missing visits, making it difficult to learn reliable temporal representations from fMRI connectome data.Method: Uses SDE-based reconstruction to recover continuous latent trajectories from irregular observations, constructs dynamic hypergraphs to capture higher-order brain region interactions, evolves hypergraph convolution parameters through SDE-controlled recurrent dynamics conditioned on inter-scan intervals, and incorporates sparsity-based importance learning to identify salient brain regions and connectivity patterns.
Result: Extensive experiments on OASIS-3 and ADNI cohorts demonstrate consistent improvements over state-of-the-art graph and hypergraph baselines in AD progression prediction.
Conclusion: SDE-HGNN effectively addresses irregular longitudinal sampling challenges in neuroimaging and provides a robust framework for modeling disease progression in Alzheimer’s disease through advanced spatio-temporal hypergraph neural networks with SDE-driven dynamics.
Abstract: Longitudinal neuroimaging is essential for modeling disease progression in Alzheimer’s disease (AD), yet irregular sampling and missing visits pose substantial challenges for learning reliable temporal representations. To address this challenge, we propose SDE-HGNN, a stochastic differential equation (SDE)-driven spatio-temporal hypergraph neural network for irregular longitudinal fMRI connectome modeling. The framework first employs an SDE-based reconstruction module to recover continuous latent trajectories from irregular observations. Based on these reconstructed representations, dynamic hypergraphs are constructed to capture higher-order interactions among brain regions over time. To further model temporal evolution, hypergraph convolution parameters evolve through SDE-controlled recurrent dynamics conditioned on inter-scan intervals, enabling disease-stage-adaptive connectivity modeling. We also incorporate a sparsity-based importance learning mechanism to identify salient brain regions and discriminative connectivity patterns. Extensive experiments on the OASIS-3 and ADNI cohorts demonstrate consistent improvements over state-of-the-art graph and hypergraph baselines in AD progression prediction. The source code is available at https://anonymous.4open.science/r/SDE-HGNN-017F.
[816] Reinforcement Learning from Multi-Source Imperfect Preferences: Best-of-Both-Regimes Regret
Ming Shi, Yingbin Liang, Ness B. Shroff, Ananthram Swami
Main category: cs.LG
TL;DR: Theoretical analysis of RLHF with multi-source imperfect preferences, showing how to achieve statistical gains from multiple sources while remaining robust to systematic preference mismatches.
Details
Motivation: Practical RLHF systems use feedback from multiple sources (annotators, experts, reward models, heuristics) that can have systematic mismatches due to subjectivity, expertise variation, and modeling artifacts, unlike theoretical assumptions of consistent ground-truth preferences.Method: Proposes a unified algorithm with imperfection-adaptive weighted comparison learning, value-targeted transition estimation to control feedback-induced distribution shift, and sub-importance sampling to keep weighted objectives analyzable.
Result: Achieves regret $\tilde{O}(\sqrt{K/M}+ω)$ with lower bound $\tildeΩ(\max{\sqrt{K/M},ω})$, showing best-of-both-regimes behavior: M-dependent statistical gains when imperfection is small, while robust to large imperfections.
Conclusion: Quantifies when multi-source feedback provably improves RLHF and how cumulative imperfection fundamentally limits it, providing theoretical foundations for practical RLHF systems with imperfect multi-source preferences.
Abstract: Reinforcement learning from human feedback (RLHF) replaces hard-to-specify rewards with pairwise trajectory preferences, yet regret-oriented theory often assumes that preference labels are generated consistently from a single ground-truth objective. In practical RLHF systems, however, feedback is typically \emph{multi-source} (annotators, experts, reward models, heuristics) and can exhibit systematic, persistent mismatches due to subjectivity, expertise variation, and annotation/modeling artifacts. We study episodic RL from \emph{multi-source imperfect preferences} through a cumulative imperfection budget: for each source, the total deviation of its preference probabilities from an ideal oracle is at most $ω$ over $K$ episodes. We propose a unified algorithm with regret $\tilde{O}(\sqrt{K/M}+ω)$, which exhibits a best-of-both-regimes behavior: it achieves $M$-dependent statistical gains when imperfection is small (where $M$ is the number of sources), while remaining robust with unavoidable additive dependence on $ω$ when imperfection is large. We complement this with a lower bound $\tildeΩ(\max{\sqrt{K/M},ω})$, which captures the best possible improvement with respect to $M$ and the unavoidable dependence on $ω$, and a counterexample showing that naïvely treating imperfect feedback as as oracle-consistent can incur regret as large as $\tildeΩ(\min{ω\sqrt{K},K})$. Technically, our approach involves imperfection-adaptive weighted comparison learning, value-targeted transition estimation to control hidden feedback-induced distribution shift, and sub-importance sampling to keep the weighted objectives analyzable, yielding regret guarantees that quantify when multi-source feedback provably improves RLHF and how cumulative imperfection fundamentally limits it.
[817] From Data to Laws: Neural Discovery of Conservation Laws Without False Positives
Rahul D Ray
Main category: cs.LG
TL;DR: NGCG: Neural-symbolic pipeline for discovering conservation laws from data, addressing challenges like parameter variation, non-polynomial invariants, and false positives on chaotic systems.
Details
Motivation: Discovering conservation laws from data is challenging due to parameter variation, non-polynomial invariants, local minima, and false positives on chaotic systems. Existing methods struggle with these issues.Method: NGCG uses a neural-symbolic pipeline that decouples dynamics learning from invariant discovery. It employs a multi-restart variance minimizer to learn near-constant latent representations, followed by system-specific symbolic extraction (polynomial Lasso, log-basis Lasso, explicit PDE candidates, PySR) to yield closed-form expressions. A strict constancy gate and diversity filter eliminate spurious laws.
Result: Achieves consistent discovery (DR=1.0, FDR=0.0, F1=1.0) on all four systems with true conservation laws, with constancy two to three orders of magnitude lower than best baseline. Only method that succeeds on Lotka-Volterra system, correctly outputs no law on all five systems without invariants. Robust to noise, sample efficient, insensitive to hyperparameters, and runs under one minute per system.
Conclusion: NGCG achieves strong performance for data-driven conservation-law discovery, combining high accuracy with interpretability, providing a range of candidate expressions that allow users to trade complexity for constancy.
Abstract: Conservation laws are fundamental to understanding dynamical systems, but discovering them from data remains challenging due to parameter variation, non-polynomial invariants, local minima, and false positives on chaotic systems. We introduce NGCG, a neural-symbolic pipeline that decouples dynamics learning from invariant discovery and systematically addresses these challenges. A multi-restart variance minimiser learns a near-constant latent representation; system-specific symbolic extraction (polynomial Lasso, log-basis Lasso, explicit PDE candidates, and PySR) yields closed-form expressions; a strict constancy gate and diversity filter eliminate spurious laws. On a benchmark of nine diverse systems including Hamiltonian and dissipative ODEs, chaos, and PDEs, NGCG achieves consistent discovery (DR=1.0, FDR=0.0, F1=1.0) on all four systems with true conservation laws, with constancy two to three orders of magnitude lower than the best baseline. It is the only method that succeeds on the Lotka–Volterra system, and it correctly outputs no law on all five systems without invariants. Extensive experiments demonstrate robustness to noise ($σ= 0.1$), sample efficiency (50–100 trajectories), insensitivity to hyperparameters, and runtime under one minute per system. A Pareto analysis shows that the method provides a range of candidate expressions, allowing users to trade complexity for constancy. NGCG achieves strong performance relative to prior methods for data-driven conservation-law discovery, combining high accuracy with interpretability.
[818] Spatio-Temporal Grid Intelligence: A Hybrid Graph Neural Network and LSTM Framework for Robust Electricity Theft Detection
Adewale U. Oguntola, Olowookere A. AbdulQoyum, Adebukola M. Madehin, Adekemi A. Adetoro
Main category: cs.LG
TL;DR: AI framework combining time-series anomaly detection, supervised ML, and graph neural networks for electricity theft detection in smart grids
Details
Motivation: Electricity theft (non-technical loss) causes financial deficits and grid instability; conventional reactive, meter-centric methods fail to capture complex spatio-temporal dynamics and behavioral patterns of fraudulent consumptionMethod: Hybrid AI framework fusing: 1) LSTM autoencoder for temporal anomaly scoring, 2) Random Forest classifier for tabular feature discrimination, and 3) Graph Neural Networks (GNN) to model spatial dependencies across distribution network; uses enriched features including rolling averages, voltage drop estimates, and Grid Imbalance Index
Result: While standalone anomaly detection yields low theft F1-score of 0.20, hybrid fusion achieves 93.7% overall accuracy; with calibrated decision thresholds, achieves balanced theft precision of 0.55 and recall of 0.50, effectively mitigating false positives from single-model approaches
Conclusion: Integrating topological grid awareness with temporal and supervised analytics provides scalable, risk-based solution for proactive electricity theft detection and enhanced smart grid reliability
Abstract: Electricity theft, or non-technical loss (NTL), presents a persistent threat to global power systems, driving significant financial deficits and compromising grid stability. Conventional detection methodologies, predominantly reactive and meter-centric, often fail to capture the complex spatio-temporal dynamics and behavioral patterns associated with fraudulent consumption. This study introduces a novel AI-driven Grid Intelligence Framework that fuses Time-Series Anomaly Detection, Supervised Machine Learning, and Graph Neural Networks (GNN) to identify theft with high precision in imbalanced datasets. Leveraging an enriched feature set, including rolling averages, voltage drop estimates, and a critical Grid Imbalance Index, the methodology employs a Long Short-Term Memory (LSTM) autoencoder for temporal anomaly scoring, a Random Forest classifier for tabular feature discrimination, and a GNN to model spatial dependencies across the distribution network. Experimental validation demonstrates that while standalone anomaly detection yields a low theft F1-score of 0.20, the proposed hybrid fusion achieves an overall accuracy of 93.7%. By calibrating decision thresholds via precision-recall analysis, the system attains a balanced theft precision of 0.55 and recall of 0.50, effectively mitigating the false positives inherent in single-model approaches. These results confirm that integrating topological grid awareness with temporal and supervised analytics provides a scalable, risk-based solution for proactive electricity theft detection and enhanced smart grid reliability.
[819] AE-LLM: Adaptive Efficiency Optimization for Large Language Models
Kaito Tanaka, Masato Ito, Yuji Nishimura, Keisuke Matsuda, Aya Nakayama
Main category: cs.LG
TL;DR: AE-LLM: A unified framework that automatically selects and combines optimal efficiency techniques (attention mechanisms, MoE, quantization, etc.) for LLM deployment based on task requirements and hardware constraints.
Details
Motivation: LLM deployment faces challenges with computational costs, memory requirements, and energy consumption. No single efficiency technique works universally well - effectiveness varies by task, resources, and model scale. Need automated approach to navigate complex trade-offs.Method: Proposes AE-LLM framework with multi-objective optimization considering accuracy, latency, memory, and energy. Uses efficient search algorithm to explore combinatorial space of efficiency techniques across architecture, fine-tuning, and inference stages to find Pareto-optimal configurations.
Result: Achieves average 2.8× improvement in efficiency metrics while maintaining competitive accuracy (within 1.2% of baseline) across 15 models (0.5B-70B parameters) and 10 diverse tasks. Framework also generalizes to vision-language models with similar efficiency gains.
Conclusion: AE-LLM provides automated tool for navigating LLM efficiency optimization trade-offs, enabling practitioners to deploy models more efficiently across diverse scenarios while maintaining performance.
Abstract: Large Language Models (LLMs) have achieved remarkable success across diverse applications, yet their deployment remains challenging due to substantial computational costs, memory requirements, and energy consumption. Recent empirical studies have demonstrated that no single efficiency technique is universally optimal; instead, the effectiveness of methods such as efficient attention mechanisms, mixture-of-experts (MoE), parameter-efficient fine-tuning, and quantization varies significantly depending on task characteristics, resource constraints, and model scales. Building upon these insights, we propose AE-LLM, a unified framework that automatically selects and combines optimal efficiency techniques tailored to specific deployment scenarios. Our approach introduces a multi-objective optimization framework that jointly considers accuracy, latency, memory footprint, and energy consumption, while accounting for hardware constraints and task requirements. We develop an efficient search algorithm that explores the combinatorial space of efficiency techniques across architecture, fine-tuning, and inference stages, identifying Pareto-optimal configurations. Extensive experiments across 15 models (0.5B-70B parameters) and 10 diverse tasks demonstrate that AE-LLM achieves an average of $2.8\times$ improvement in efficiency metrics while maintaining competitive accuracy (within 1.2% of baseline), compared to static efficiency configurations. Furthermore, our framework generalizes effectively to vision-language models, achieving similar efficiency gains. Our contributions provide practitioners with an automated tool for navigating the complex trade-off landscape of LLM efficiency optimization.
[820] Distributed Gradient Clustering: Convergence and the Effect of Initialization
Aleksandar Armacki, Himkant Sharma, Dragana Bajović, Dušan Jakovetić, Mrityunjoy Chakraborty, Soummya Kar
Main category: cs.LG
TL;DR: Distributed gradient-based clustering algorithms for networked users show resilience to center initialization compared to centralized methods, with a novel distributed initialization scheme inspired by K-means++ improving performance.
Details
Motivation: The paper investigates how center initialization affects distributed clustering algorithms in networked environments where users have local datasets and communicate only with neighbors, aiming to find global clustering of joint data.Method: The study performs extensive numerical experiments on distributed gradient-based clustering algorithms, comparing them to centralized gradient clustering. Inspired by K-means++, the authors propose a novel distributed center initialization scheme.
Result: The distributed methods demonstrate greater resilience to initialization effects compared to centralized gradient clustering. The proposed distributed initialization scheme improves performance over baseline random initialization.
Conclusion: Distributed clustering algorithms are more robust to initialization issues than centralized approaches, and a distributed initialization scheme inspired by K-means++ can enhance their performance.
Abstract: We study the effects of center initialization on the performance of a family of distributed gradient-based clustering algorithms introduced in [1], that work over connected networks of users. In the considered scenario, each user contains a local dataset and communicates only with its immediate neighbours, with the aim of finding a global clustering of the joint data. We perform extensive numerical experiments, evaluating the effects of center initialization on the performance of our family of methods, demonstrating that our methods are more resilient to the effects of initialization, compared to centralized gradient clustering [2]. Next, inspired by the $K$-means++ initialization [3], we propose a novel distributed center initialization scheme, which is shown to improve the performance of our methods, compared to the baseline random initialization.
[821] Delightful Distributed Policy Gradient
Ian Osband
Main category: cs.LG
TL;DR: Delightful Policy Gradient (DG) addresses negative learning from surprising data in distributed RL by gating updates with delight (advantage × surprisal), suppressing rare failures and amplifying rare successes without behavior probabilities.
Details
Motivation: Distributed RL suffers from training on stale, buggy, or mismatched actor data where high-surprisal failures can dominate updates despite carrying little useful signal, while high-surprisal successes reveal important opportunities.Method: DG separates updates by gating each with delight - the product of advantage and surprisal (negative log-probability). This suppresses rare failures and amplifies rare successes without requiring behavior probabilities or importance sampling.
Result: On MNIST with simulated staleness, DG outperforms importance-weighted PG. On transformer sequence tasks with multiple frictions (staleness, actor bugs, reward corruption, rare discovery), DG achieves ~10× lower error and order-of-magnitude compute advantage.
Conclusion: DG effectively handles contaminated sampling in distributed RL by focusing on high-surprisal successes while suppressing high-surprisal failures, outperforming traditional methods especially when multiple frictions are present.
Abstract: Distributed reinforcement learning trains on data from stale, buggy, or mismatched actors, producing actions with high surprisal (negative log-probability) under the learner’s policy. The core difficulty is not surprising data per se, but \emph{negative learning from surprising data}. High-surprisal failures can dominate the update direction despite carrying little useful signal, while high-surprisal successes reveal opportunities the current policy would otherwise miss. The \textit{Delightful Policy Gradient} (DG) separates these cases by gating each update with delight, the product of advantage and surprisal, suppressing rare failures and amplifying rare successes without behavior probabilities. Under contaminated sampling, the cosine similarity between the standard policy gradient and the true gradient collapses, while DG’s grows as the policy improves. No sign-blind reweighting, including exact importance sampling, can reproduce this effect. On MNIST with simulated staleness, DG without off-policy correction outperforms importance-weighted PG with exact behavior probabilities. On a transformer sequence task with staleness, actor bugs, reward corruption, and rare discovery, DG achieves roughly $10{\times}$ lower error. When all four frictions act simultaneously, its compute advantage is order-of-magnitude and grows with task complexity.
[822] Does This Gradient Spark Joy?
Ian Osband
Main category: cs.LG
TL;DR: Delightful Policy Gradient (DG) introduces “delight” (advantage × surprisal) as a forward-pass signal to screen samples, and Kondo gate pays for backward passes only when delight exceeds compute price, enabling efficient training by skipping most backward passes while maintaining learning quality.
Details
Motivation: Policy gradient methods compute expensive backward passes for every sample, but most samples carry little learning value. There's a need to identify valuable samples efficiently and skip unnecessary backward computations to improve training efficiency.Method: Proposes Delightful Policy Gradient (DG) using “delight” = advantage × surprisal as a forward-pass signal of learning value. Introduces Kondo gate that compares delight against a compute price and only pays for backward passes when delight exceeds the price. This creates a quality-cost Pareto frontier.
Result: In bandits, zero-price gating preserves useful gradient signal while removing noise. Delight is more reliable than additive combinations of value and surprise. On MNIST and transformer token reversal tasks, Kondo gate skips most backward passes while retaining nearly all of DG’s learning quality, with gains growing as problems get harder and backward passes become more expensive.
Conclusion: The Kondo gate enables efficient training by screening samples before expensive backpropagation, suggesting a “speculative-decoding-for-training” paradigm where cheap forward passes can identify valuable samples worth expensive backward computation.
Abstract: Policy gradient computes a backward pass for every sample, even though the backward pass is expensive and most samples carry little learning value. The Delightful Policy Gradient (DG) provides a forward-pass signal of learning value: \emph{delight}, the product of advantage and surprisal (negative log-probability). We introduce the \emph{Kondo gate}, which compares delight against a compute price and pays for a backward pass only when the sample is worth it, thereby tracing a quality–cost Pareto frontier. In bandits, zero-price gating preserves useful gradient signal while removing perpendicular noise, and delight is a more reliable screening signal than additive combinations of value and surprise. On MNIST and transformer token reversal, the Kondo gate skips most backward passes while retaining nearly all of DG’s learning quality, with gains that grow as problems get harder and backward passes become more expensive. Because the gate tolerates approximate delight, a cheap forward pass can screen samples before expensive backpropagation, suggesting a speculative-decoding-for-training paradigm.
[823] RMNP: Row-Momentum Normalized Preconditioning for Scalable Matrix-Based Optimization
Shenyang Deng, Zhuoli Ouyang, Tianyu Pang, Zihang Liu, Ruochen Jin, Shuhua Yu, Yaoqing Yang
Main category: cs.LG
TL;DR: RMNP is a new optimizer that replaces Newton-Schulz iteration with simple row-wise L2 normalization for more efficient preconditioning in deep learning, achieving similar performance with lower computational cost.
Details
Motivation: To improve computational efficiency of preconditioned adaptive methods for training deep neural networks, particularly addressing the high computational cost of Newton-Schulz iteration used in Muon optimizers while maintaining effective preconditioning.Method: Proposes RMNP (Row Momentum Normalized Preconditioning) which replaces Newton-Schulz iteration with simple row-wise L2 normalization, motivated by empirical observations of diagonal block structure in Transformer layerwise Hessian matrices.
Result: RMNP reduces per-iteration computational complexity from O(mn·min(m,n)) to O(mn) for m×n weight matrices while maintaining comparable optimization performance to Muon, with extensive experiments on large language model pretraining showing competitive performance with substantially reduced wall-clock time.
Conclusion: RMNP provides a more computationally efficient alternative to Muon for preconditioned optimization in deep learning, achieving similar convergence guarantees with significantly lower computational overhead, making it practical for large-scale training.
Abstract: Preconditioned adaptive methods have gained significant attention for training deep neural networks, as they capture rich curvature information of the loss landscape . The central challenge in this field lies in balancing preconditioning effectiveness with computational efficiency of implementing the preconditioner. Among recent advances, \textsc{Muon} stands out by using Newton-Schulz iteration to obtain preconditioned updates without explicitly constructing the preconditioning matrix. Despite its advantages, the efficiency of \textsc{Muon} still leaves room for further improvement. In this paper, we introduce \textsc{RMNP} (Row Momentum Normalized Preconditioning), an optimizer that replaces Newton-Schulz iteration with a simple row-wise $\ell_2$ normalization operation, motivated by the empirically observed diagonal block structure of the Transformer layerwise Hessian. This substitution reduces the per-iteration computational complexity from $\mathcal{O}(mn\cdot\min(m,n))$ to $\mathcal{O}(mn)$ for an $m\times n$ weight matrix while maintaining comparable optimization performance. Theoretically, we establish convergence guarantees for \textsc{RMNP} in the non-convex setting that match recent results for \textsc{Muon} optimizers, achieving the information-theoretic minimax optimal complexity. Extensive experiments on large language model pretraining show that \textsc{RMNP} delivers competitive optimization performance compared with \textsc{Muon} while substantially reducing preconditioning wall-clock time. Our code is available at \href{https://anonymous.4open.science/r/RMNP-E8E1/}{this link}.
[824] Towards Practical Multimodal Hospital Outbreak Detection
Chang Liu, Jieshi Chen, Alexander J. Sundermann, Kathleen Shutt, Marissa P. Griffith, Lora Lee Pless, Lee H. Harrison, Artur W. Dubrawski
Main category: cs.LG
TL;DR: A machine learning approach using MALDI-TOF mass spectrometry, antimicrobial resistance patterns, and EHR data for rapid hospital outbreak detection as an alternative to whole genome sequencing.
Details
Motivation: Whole genome sequencing (WGS) is the gold standard for outbreak investigations but has high costs and long turnaround times, making it impractical for routine surveillance in less-equipped facilities. There's a need for rapid, cost-effective alternatives for outbreak detection.Method: Developed a machine learning approach that learns discriminative features from three alternative modalities: 1) MALDI-TOF mass spectrometry, 2) antimicrobial resistance patterns, and 3) electronic health records. Proposed a tiered surveillance paradigm to reduce reliance on WGS.
Result: Multi-species evaluation shows that integrating these three modalities can boost outbreak detection performance. EHR analysis identified high-risk contamination routes linked to specific clinical procedures involving invasive equipment and high-frequency workflows.
Conclusion: The proposed machine learning approach with multimodal data integration provides a rapid, cost-effective alternative to WGS for outbreak detection, enabling proactive risk mitigation through identification of actionable contamination routes.
Abstract: Rapid identification of outbreaks in hospitals is essential for controlling pathogens with epidemic potential. Although whole genome sequencing (WGS) remains the gold standard in outbreak investigations, its substantial costs and turnaround times limit its feasibility for routine surveillance, especially in less-equipped facilities. We explore three modalities as rapid alternatives: matrix-assisted laser desorption ionization-time of flight (MALDI-TOF) mass spectrometry, antimicrobial resistance (AR) patterns, and electronic health records (EHR). We present a machine learning approach that learns discriminative features from these modalities to support outbreak detection. Multi-species evaluation shows that the integration of these modalities can boost outbreak detection performance. We also propose a tiered surveillance paradigm that can reduce the need for WGS through these alternative modalities. Further analysis of EHR information identifies potentially high-risk contamination routes linked to specific clinical procedures, notably those involving invasive equipment and high-frequency workflows, providing infection prevention teams with actionable targets for proactive risk mitigation
[825] Understanding Behavior Cloning with Action Quantization
Haoqun Cao, Tengyang Xie
Main category: cs.LG
TL;DR: Theoretical analysis of action quantization in autoregressive behavior cloning for continuous control, showing optimal sample complexity and polynomial horizon dependence on quantization error under certain conditions.
Details
Motivation: Autoregressive models like transformers are effective for behavior cloning but require discretizing continuous actions through quantization, a widely adopted practice that lacks theoretical understanding. The paper aims to provide theoretical foundations for this quantization practice.Method: Theoretical analysis of how quantization error propagates along the horizon and interacts with statistical sample complexity. Analysis of different quantization schemes and their requirements, plus a proposed model-based augmentation to improve error bounds without requiring policy smoothness.
Result: Shows that behavior cloning with quantized actions and log-loss achieves optimal sample complexity matching existing lower bounds, with only polynomial horizon dependence on quantization error under stable dynamics and probabilistic policy smoothness. Characterizes when quantization schemes satisfy/violate requirements and establishes fundamental limits capturing quantization error and statistical complexity.
Conclusion: Provides theoretical justification for the common practice of action quantization in autoregressive behavior cloning, showing it can work well under certain conditions while establishing fundamental performance limits.
Abstract: Behavior cloning is a fundamental paradigm in machine learning, enabling policy learning from expert demonstrations across robotics, autonomous driving, and generative models. Autoregressive models like transformer have proven remarkably effective, from large language models (LLMs) to vision-language-action systems (VLAs). However, applying autoregressive models to continuous control requires discretizing actions through quantization, a practice widely adopted yet poorly understood theoretically. This paper provides theoretical foundations for this practice. We analyze how quantization error propagates along the horizon and interacts with statistical sample complexity. We show that behavior cloning with quantized actions and log-loss achieves optimal sample complexity, matching existing lower bounds, and incurs only polynomial horizon dependence on quantization error, provided the dynamics are stable and the policy satisfies a probabilistic smoothness condition. We further characterize when different quantization schemes satisfy or violate these requirements, and propose a model-based augmentation that provably improves the error bound without requiring policy smoothness. Finally, we establish fundamental limits that jointly capture the effects of quantization error and statistical complexity.
[826] LJ-Bench: Ontology-Based Benchmark for U.S. Crime
Hung Yun Tseng, Wuzhen Li, Blerina Gkotse, Grigorios Chrysos
Main category: cs.LG
TL;DR: LJ-Bench is a comprehensive benchmark for evaluating LLM robustness against illegal queries, grounded in legal frameworks with 76 crime types organized taxonomically.
Details
Motivation: Existing benchmarks for evaluating LLM safety against harmful information are limited in scope (focusing on only a few illegal activities) and lack grounding in actual legal frameworks, making comprehensive assessment of LLM vulnerabilities to illegal queries difficult.Method: Created an ontology of crime-related concepts based on the Model Panel Code (influential criminal law reference adopted by many U.S. states), instantiated using Californian Law. This structured knowledge forms LJ-Bench, a benchmark with 76 distinct crime types organized taxonomically for systematic assessment of diverse attacks on LLMs.
Result: LLMs show heightened susceptibility to attacks targeting societal harm rather than those directly impacting individuals. The benchmark reveals valuable insights into LLM vulnerabilities across various crime categories.
Conclusion: LJ-Bench facilitates development of more robust and trustworthy LLMs by providing a comprehensive, legally-grounded benchmark for evaluating LLM safety against illegal queries across diverse crime types.
Abstract: The potential of Large Language Models (LLMs) to provide harmful information remains a significant concern due to the vast breadth of illegal queries they may encounter. Unfortunately, existing benchmarks only focus on a handful types of illegal activities, and are not grounded in legal works. In this work, we introduce an ontology of crime-related concepts grounded in the legal frameworks of Model Panel Code, which serves as an influential reference for criminal law and has been adopted by many U.S. states, and instantiated using Californian Law. This structured knowledge forms the foundation for LJ-Bench, the first comprehensive benchmark designed to evaluate LLM robustness against a wide range of illegal activities. Spanning 76 distinct crime types organized taxonomically, LJ-Bench enables systematic assessment of diverse attacks, revealing valuable insights into LLM vulnerabilities across various crime categories: LLMs exhibit heightened susceptibility to attacks targeting societal harm rather than those directly impacting individuals. Our benchmark aims to facilitate the development of more robust and trustworthy LLMs. The LJ-Bench benchmark and LJ-Ontology, along with experiments implementation for reproducibility are publicly available at https://github.com/AndreaTseng/LJ-Bench.
[827] RECLAIM: Cyclic Causal Discovery Amid Measurement Noise
Muralikrishnna G. Sethuraman, Faramarz Fekri
Main category: cs.LG
TL;DR: RECLAIM is a causal discovery framework that handles cycles and measurement noise using EM with residual normalizing flows for likelihood computation.
Details
Motivation: Existing causal discovery methods assume acyclicity and direct variable access, which fails in real-world settings like genomics where cyclic networks and instrumental noise are common.Method: RECLAIM learns causal graph structure by maximizing observed measurement likelihood via expectation-maximization (EM), using residual normalizing flows for tractable likelihood computation. Considers two measurement models: Gaussian additive noise and linear measurement system with additive Gaussian noise.
Result: Theoretical consistency guarantees provided for both measurement models. Experiments on synthetic data and real-world protein signaling datasets demonstrate method efficacy.
Conclusion: RECLAIM addresses limitations of existing causal discovery methods by handling cycles and measurement noise, with theoretical guarantees and empirical validation.
Abstract: Uncovering causal relationships is a fundamental problem across science and engineering. However, most existing causal discovery methods assume acyclicity and direct access to the system variables – assumptions that fail to hold in many real-world settings. For instance, in genomics, cyclic regulatory networks are common, and measurements are often corrupted by instrumental noise. To address these challenges, we propose RECLAIM, a causal discovery framework that natively handles both cycles and measurement noise. RECLAIM learns the causal graph structure by maximizing the likelihood of the observed measurements via expectation-maximization (EM), using residual normalizing flows for tractable likelihood computation. We consider two measurement models: (i) Gaussian additive noise, and (ii) a linear measurement system with additive Gaussian noise. We provide theoretical consistency guarantees for both the settings. Experiments on synthetic data and real-world protein signaling datasets demonstrate the efficacy of the proposed method.
[828] MKA: Memory-Keyed Attention for Efficient Long-Context Reasoning
Dong Liu, Yanxuan Yu, Ben Lengerich, Ying Nian Wu
Main category: cs.LG
TL;DR: Memory-Keyed Attention (MKA) is a hierarchical attention mechanism with multi-level KV caches that dynamically routes attention across local, session, and long-term memory for efficient long-context language modeling.
Details
Motivation: The growing cost of maintaining large Key/Value caches in long-context language models creates major bottlenecks in training and inference. Existing methods like MQA and MLA reduce memory but sacrifice representation quality or add runtime overhead.Method: Proposes Memory-Keyed Attention (MKA) with hierarchical multi-level KV caches (local, session, long-term) and dynamic attention routing. Also introduces Route-Fused MKA (FastMKA), a broadcast-routed variant that fuses memory sources before attention computation for better efficiency.
Result: FastMKA achieves comparable perplexity to MLA while delivering up to 5x faster training throughput and 1.8x lower evaluation latency across different sequence lengths.
Conclusion: MKA provides a practical and extensible framework for efficient long-context attention with favorable accuracy-efficiency trade-offs.
Abstract: As long-context language modeling becomes increasingly important, the cost of maintaining and attending to large Key/Value (KV) caches grows rapidly, becoming a major bottleneck in both training and inference. While prior works such as Multi-Query Attention (MQA) and Multi-Latent Attention (MLA) reduce memory by sharing or compressing KV features, they often trade off representation quality or incur runtime overhead. We propose Memory-Keyed Attention (MKA), a hierarchical attention mechanism that integrates multi-level KV caches (local, session, and long-term) and learns to route attention across them dynamically. We further introduce Route-Fused MKA (FastMKA), a broadcast-routed variant that fuses memory sources before attention computation for improved efficiency. Experiments on different sequence lengths show that FastMKA achieves a favorable accuracy-efficiency trade-off: comparable perplexity to MLA while achieving up to 5x faster training throughput and 1.8x lower evaluation latency. These results highlight MKA as a practical and extensible framework for efficient long-context attention.
[829] Neural collapse in the orthoplex regime
James Alcala, Rayna Andreeva, Vladimir A. Kobzar, Dustin G. Mixon, Sanghoon Na, Shashank Sule, Yangxinyu Xie
Main category: cs.LG
TL;DR: The paper analyzes neural collapse geometry in the orthoplex regime where number of classes exceeds feature dimension (n > d+1), specifically when d+2 ≤ n ≤ 2d.
Details
Motivation: Neural collapse is known to occur when n ≤ d+1, resulting in feature vectors collapsing to vertices of a regular simplex. However, for applications like language models where n ≫ d, different geometric patterns emerge that need characterization.Method: The analysis uses Radon’s theorem and convexity techniques to characterize the emergent geometric figures in the orthoplex regime where d+2 ≤ n ≤ 2d.
Result: The paper characterizes the specific geometric figures that emerge during neural collapse in the orthoplex regime, providing mathematical understanding of feature vector arrangements when number of classes exceeds feature dimension.
Conclusion: Neural collapse exhibits different geometric patterns depending on the relationship between feature dimension and number of classes, with the orthoplex regime showing distinct emergent structures that can be mathematically characterized.
Abstract: When training a neural network for classification, the feature vectors of the training set are known to collapse to the vertices of a regular simplex, provided the dimension $d$ of the feature space and the number $n$ of classes satisfies $n\leq d+1$. This phenomenon is known as neural collapse. For other applications like language models, one instead takes $n\gg d$. Here, the neural collapse phenomenon still occurs, but with different emergent geometric figures. We characterize these geometric figures in the orthoplex regime where $d+2\leq n\leq 2d$. The techniques in our analysis primarily involve Radon’s theorem and convexity.
[830] Generating from Discrete Distributions Using Diffusions: Insights from Random Constraint Satisfaction Problems
Alankrita Bhatt, Mukur Gupta, Germain Kolossov, Andrea Montanari
Main category: cs.LG
TL;DR: The paper studies generating uniformly random solutions for random k-SAT and k-XORSAT formulas, analyzing how theoretical insights from constraint satisfaction problems affect generative techniques’ performance on these benchmarks.
Details
Motivation: Random k-SAT is increasingly used as a synthetic benchmark for new generative techniques for discrete distributions, but there's limited understanding of how theoretical properties of constraint satisfaction problems affect generative method performance.Method: The authors analyze the problem of generating uniformly random solutions for random k-SAT and k-XORSAT formulas, comparing different generative approaches including continuous diffusions, masked discrete diffusions, and learned diffusions with various variable ordering strategies.
Result: Key findings: (1) Continuous diffusions outperform masked discrete diffusions, (2) Learned diffusions can match theoretical ‘ideal’ accuracy, (3) Smart variable ordering significantly improves accuracy but not following popular heuristics.
Conclusion: Theoretical insights from random constraint satisfaction problems have important, sometimes counter-intuitive implications for generative techniques on k-SAT benchmarks, with continuous diffusions and learned approaches showing superior performance.
Abstract: Generating data from discrete distributions is important for a number of application domains including text, tabular data, and genomic data. Several groups have recently used random $k$-satisfiability ($k$-SAT) as a synthetic benchmark for new generative techniques. In this paper, we show that fundamental insights from the theory of random constraint satisfaction problems have observable implications (sometime contradicting intuition) on the behavior of generative techniques on such benchmarks. More precisely, we study the problem of generating a uniformly random solution of a given (random) $k$-SAT or $k$-XORSAT formula. Among other findings, we observe that: $(i)$~Continuous diffusions outperform masked discrete diffusions; $(ii)$~Learned diffusions can match the theoretical `ideal’ accuracy; $(iii)$~Smart ordering of the variables can significantly improve accuracy, although not following popular heuristics.
[831] Bayesian Learning in Episodic Zero-Sum Games
Chang-Wei Yueh, Andy Zhao, Ashutosh Nayyar, Rahul Jain
Main category: cs.LG
TL;DR: Bayesian learning algorithm for zero-sum Markov games using posterior sampling with theoretical regret guarantees
Details
Motivation: To develop efficient learning algorithms for zero-sum Markov games with unknown transition and reward models, where agents need to learn optimal strategies while competing against opponentsMethod: Posterior sampling algorithm where each player maintains Bayesian posterior over game model, samples a model at episode start, and computes equilibrium policy for sampled model. Analyzed two settings: both players use posterior sampling vs. one player uses it while opponent follows arbitrary algorithm
Result: Theoretical regret bound of O(HS√(ABHK log(SABHK))) for posterior sampling agent, where K=episodes, H=episode length, S=states, A,B=action spaces. Experiments in grid-world predator-prey domain show sublinear regret scaling and competitive performance vs fictitious-play baseline
Conclusion: Posterior sampling provides effective Bayesian learning approach for zero-sum Markov games with strong theoretical guarantees on regret, demonstrating practical viability in competitive multi-agent settings
Abstract: We study Bayesian learning in episodic, finite-horizon zero-sum Markov games with unknown transition and reward models. We investigate a posterior algorithm in which each player maintains a Bayesian posterior over the game model, independently samples a game model at the beginning of each episode, and computes an equilibrium policy for the sampled model. We analyze two settings: (i) Both players use the posterior sampling algorithm, and (ii) Only one player uses posterior sampling while the opponent follows an arbitrary learning algorithm. In each setting, we provide guarantees on the expected regret of the posterior sampling agent. Our notion of regret compares the expected total reward of the learning agent against the expected total reward under equilibrium policies of the true game. Our main theoretical result is an expected regret bound for the posterior sampling agent of order $O(HS\sqrt{ABHK\log(SABHK)})$ where $K$ is the number of episodes, $H$ is the episode length, $S$ is the number of states, and $A,B$ are the action space sizes of the two players. Experiments in a grid-world predator–prey domain illustrate the sublinear regret scaling and show that posterior sampling competes favorably with a fictitious-play baseline.
[832] Beyond Token Eviction: Mixed-Dimension Budget Allocation for Efficient KV Cache Compression
Ruijie Miao, Zhiming Wang, Wang Li, Shiwei Wu, Shufan Liu, Yanbing Jiang, Tong Yang
Main category: cs.LG
TL;DR: MixedDimKV: A mixed-dimension KV cache compression method that allocates dimensions to tokens at a granular level, reducing memory usage while maintaining performance in long-context transformer inference.
Details
Motivation: KV caching accelerates transformer inference but memory cost grows linearly with input length, limiting long-context deployment. Existing token eviction methods are coarse (zero or full dimension allocation), needing more granular compression.Method: Proposes MixedDimKV that allocates dimensions to tokens at granular level, and MixedDimKV-H that integrates head-level importance information. Both methods compress KV cache by mixed-dimension allocation rather than binary token eviction.
Result: Outperforms prior KV cache compression methods without head-level profiling. MixedDimKV-H consistently beats HeadKV when using same head-level info. Achieves comparable performance to full attention on LongBench with only 6.25% KV cache. Maintains 100% accuracy at 50K context length with 0.26% cache in Needle-in-a-Haystack test.
Conclusion: Mixed-dimension KV cache compression provides effective memory reduction for long-context transformers while maintaining performance, enabling more efficient long-context deployment.
Abstract: Key-value (KV) caching is widely used to accelerate transformer inference, but its memory cost grows linearly with input length, limiting long-context deployment. Existing token eviction methods reduce memory by discarding less important tokens, which can be viewed as a coarse form of dimensionality reduction that assigns each token either zero or full dimension. We propose MixedDimKV, a mixed-dimension KV cache compression method that allocates dimensions to tokens at a more granular level, and MixedDimKV-H, which further integrates head-level importance information. Experiments on long-context benchmarks show that MixedDimKV outperforms prior KV cache compression methods that do not rely on head-level importance profiling. When equipped with the same head-level importance information, MixedDimKV-H consistently outperforms HeadKV. Notably, our approach achieves comparable performance to full attention on LongBench with only 6.25% of the KV cache. Furthermore, in the Needle-in-a-Haystack test, our solution maintains 100% accuracy at a 50K context length while using as little as 0.26% of the cache.
[833] Optimal low-rank stochastic gradient estimation for LLM training
Zehao Li, Tao Ren, Zishi Zhang, Xi Chen, Yijie Peng
Main category: cs.LG
TL;DR: A memory-efficient low-rank gradient estimator for LLM training using optimal random projections to reduce memory while maintaining unbiasedness and controlling error.
Details
Motivation: LLM training is bottlenecked by memory constraints and stochastic gradient noise in high-dimensional parameter spaces, with empirical evidence showing many LLM gradient matrices are effectively low-rank during training.Method: Projects high-dimensional stochastic gradient estimators onto random low-dimensional subspaces and lifts them back, using optimally designed projection distributions (including Haar-Stiefel projections) derived from solving a constrained functional optimization problem.
Result: Achieves significant memory savings (3.83GB vs 16.7GB for full backpropagation in RoBERTa-large fine-tuning) while remaining competitive in accuracy, and outperforms traditional methods in autoregressive LLM pretraining (LLaMA-20M/60M/100M).
Conclusion: The proposed low-rank gradient estimator provides practical memory savings and improved training behavior for LLMs through optimal projection strategies.
Abstract: Large language model (LLM) training is often bottlenecked by memory constraints and stochastic gradient noise in extremely high-dimensional parameter spaces. Motivated by empirical evidence that many LLM gradient matrices are effectively low-rank during training, we present an unbiased, memory-efficient, low-rank matrix estimator with the lowest variance that is applicable across common stochastic gradient estimation paradigms. The core idea is to project a high-dimensional stochastic gradient estimator onto a random low-dimensional subspace and lift it back, reducing memory while keeping the estimator unbiased and controlling mean-squared error via an optimally designed projection distribution, including Haar–Stiefel projections. The projection distribution is derived by solving a constrained functional optimization problem, yielding an optimal random projector that guides algorithm design. Empirically, the resulting low-rank gradient estimators deliver both practical memory savings and improved training behavior. In RoBERTa-large fine-tuning, our method attains the lowest peak GPU memory among compared methods (e.g., 3.83GB versus 16.7GB for full BP) while remaining competitive in accuracy; in autoregressive LLM pretraining (LLaMA-20M/60M/100M), our method outperforms the traditional methods, supporting the benefit of the proposed optimal projection strategy.
[834] CFNN: Continued Fraction Neural Network
Chao Wang, Xuancheng Zhou, Ruilin Hou, Xiaoyu Cheng, Ruiyi Ding
Main category: cs.LG
TL;DR: CFNNs use continued fractions with gradient optimization to capture complex asymptotics and discontinuities with far fewer parameters than MLPs, offering better precision and noise robustness for scientific computing.
Details
Motivation: MLPs have spectral bias that hinders resolution of high-curvature features without excessive parameters, making them inefficient for characterizing non-linear functional manifolds with singularities in scientific computing.Method: Introduce Continued Fraction Neural Networks (CFNNs) that integrate continued fractions with gradient-based optimization to provide a “rational inductive bias.” Develop three implementations: CFNN-Boost, CFNN-MoE, and CFNN-Hybrid to address recursive instability.
Result: CFNNs consistently outperform MLPs in precision with 1-2 orders of magnitude fewer parameters, showing up to 47-fold improvement in noise robustness and physical consistency. Provide formal approximation bounds demonstrating exponential convergence and stability guarantees.
Conclusion: CFNNs bridge black-box flexibility and white-box transparency, establishing a reliable “grey-box” paradigm for AI-driven scientific research by enabling efficient capture of complex asymptotics and discontinuities.
Abstract: Accurately characterizing non-linear functional manifolds with singularities is a fundamental challenge in scientific computing. While Multi-Layer Perceptrons (MLPs) dominate, their spectral bias hinders resolving high-curvature features without excessive parameters. We introduce Continued Fraction Neural Networks (CFNNs), integrating continued fractions with gradient-based optimization to provide a rational inductive bias.'' This enables capturing complex asymptotics and discontinuities with extreme parameter frugality. We provide formal approximation bounds demonstrating exponential convergence and stability guarantees. To address recursive instability, we develop three implementations: CFNN-Boost, CFNN-MoE, and CFNN-Hybrid. Benchmarks show CFNNs consistently outperform MLPs in precision with one to two orders of magnitude fewer parameters, exhibiting up to a 47-fold improvement in noise robustness and physical consistency. By bridging black-box flexibility and white-box transparency, CFNNs establish a reliable grey-box’’ paradigm for AI-driven scientific research.
[835] Diffusion Model for Manifold Data: Score Decomposition, Curvature, and Statistical Complexity
Zixuan Zhang, Kaixuan Huang, Tuo Zhao, Mengdi Wang, Minshuo Chen
Main category: cs.LG
TL;DR: Theoretical analysis of diffusion models for high-dimensional data on low-dimensional manifolds, revealing statistical complexity and geometric influences.
Details
Motivation: Despite diffusion models' success in generative modeling, their theoretical understanding for high-dimensional data on low-dimensional structures remains incomplete. The paper aims to bridge this gap by analyzing how diffusion models learn such structured data.Method: Models data as samples from smooth Riemannian manifolds, analyzes score function decompositions under different noise levels, examines interplay of manifold curvature with score structures, and develops efficient neural network approximations for score functions.
Result: Provides statistical rates for score estimation and distribution learning governed by intrinsic data dimension and manifold curvature, advancing theoretical foundations of diffusion models.
Conclusion: The analysis bridges theory and practice for generative modeling on manifolds, offering insights into how diffusion models handle structured high-dimensional data through geometric properties.
Abstract: Diffusion models have become a leading framework in generative modeling, yet their theoretical understanding – especially for high-dimensional data concentrated on low-dimensional structures – remains incomplete. This paper investigates how diffusion models learn such structured data, focusing on two key aspects: statistical complexity and influence of data geometric properties. By modeling data as samples from a smooth Riemannian manifold, our analysis reveals crucial decompositions of score functions in diffusion models under different levels of injected noise. We also highlight the interplay of manifold curvature with the structures in the score function. These analyses enable an efficient neural network approximation to the score function, built upon which we further provide statistical rates for score estimation and distribution learning. Remarkably, the obtained statistical rates are governed by the intrinsic dimension of data and the manifold curvature. These results advance the statistical foundations of diffusion models, bridging theory and practice for generative modeling on manifolds.
[836] Exponential Family Discriminant Analysis: Generalizing LDA-Style Generative Classification to Non-Gaussian Models
Anish Lakkapragada
Main category: cs.LG
TL;DR: EFDA extends LDA to exponential family distributions, providing calibrated classification with linear decision boundaries in sufficient statistic space.
Details
Motivation: Classical LDA assumes Gaussian distributions, limiting its applicability. The authors aim to create a unified generative framework that works with any exponential family distribution while maintaining desirable properties like calibration and efficiency.Method: EFDA assumes class-conditional densities belong to a common exponential family, derives closed-form MLEs for natural parameters, and yields a linear decision rule in sufficient statistics. The framework generalizes to K≥2 classes and multivariate data, with formal verification in Lean 4.
Result: EFDA matches classification accuracy of LDA, QDA, and logistic regression while reducing Expected Calibration Error by 2-6× across five distributions. It achieves asymptotic calibration and statistical efficiency, with log-odds estimator approaching Cramér-Rao bound.
Conclusion: EFDA provides a principled extension of LDA to exponential family distributions with superior calibration properties, maintaining linear decision boundaries while capturing nonlinear patterns in original feature space.
Abstract: We introduce Exponential Family Discriminant Analysis (EFDA), a unified generative framework that extends classical Linear Discriminant Analysis (LDA) beyond the Gaussian setting to any member of the exponential family. Under the assumption that each class-conditional density belongs to a common exponential family, EFDA derives closed-form maximum-likelihood estimators for all natural parameters and yields a decision rule that is linear in the sufficient statistic, recovering LDA as a special case and capturing nonlinear decision boundaries in the original feature space. We prove that EFDA is asymptotically calibrated and statistically efficient under correct specification, and we generalise it to $K \geq 2$ classes and multivariate data. Through extensive simulation across five exponential-family distributions (Weibull, Gamma, Exponential, Poisson, Negative Binomial), EFDA matches the classification accuracy of LDA, QDA, and logistic regression while reducing Expected Calibration Error (ECE) by $2$–$6\times$, a gap that is \emph{structural}: it persists for all $n$ and across all class-imbalance levels, because misspecified models remain asymptotically miscalibrated. We further prove and empirically confirm that EFDA’s log-odds estimator approaches the Cramér-Rao bound under correct specification, and is the only estimator in our comparison whose mean squared error converges to zero. Complete derivations are provided for nine distributions. Finally, we formally verify all four theoretical propositions in Lean 4, using Aristotle (Harmonic) and OpenGauss (Math, Inc.) as proof generators, with all outputs independently machine-checked by AXLE (Axiom).
[837] Breaking the $O(\sqrt{T})$ Cumulative Constraint Violation Barrier while Achieving $O(\sqrt{T})$ Static Regret in Constrained Online Convex Optimization
Haricharan Balasundaram, Karthick Krishna Mahendran, Rahul Vaze
Main category: cs.LG
TL;DR: The paper presents an online convex optimization algorithm that achieves O(√T) regret and O(T^{1/3}) cumulative constraint violation for 2-dimensional problems, improving upon the previously believed Ω(√T) lower bound for CCV.
Details
Motivation: The paper addresses the fundamental trade-off between regret and cumulative constraint violation in constrained online convex optimization, challenging the widely held belief that O(√T) regret necessarily implies Ω(√T) CCV for d≥2.Method: The authors analyze and extend the algorithm from Vaze and Sinha [2025], showing that it can achieve better CCV bounds than previously thought possible, specifically O(T^{1/3}) CCV while maintaining O(√T) regret for d=2.
Result: The paper proves that the algorithm simultaneously achieves O(√T) regret and O(T^{1/3}) cumulative constraint violation for 2-dimensional constrained online convex optimization problems, refuting the previous belief that Ω(√T) CCV was unavoidable.
Conclusion: This work demonstrates that better trade-offs between regret and constraint violation are possible than previously believed, opening new directions for algorithm design in constrained online optimization.
Abstract: The problem of constrained online convex optimization is considered, where at each round, once a learner commits to an action $x_t \in \mathcal{X} \subset \mathbb{R}^d$, a convex loss function $f_t$ and a convex constraint function $g_t$ that drives the constraint $g_t(x)\le 0$ are revealed. The objective is to simultaneously minimize the static regret and cumulative constraint violation (CCV) compared to the benchmark that knows the loss functions and constraint functions $f_t$ and $g_t$ for all $t$ ahead of time, and chooses a static optimal action that is feasible with respect to all $g_t(x)\le 0$. In recent prior work Sinha and Vaze [2024], algorithms with simultaneous regret of $O(\sqrt{T})$ and CCV of $O(\sqrt{T})$ or (CCV of $O(1)$ in specific cases Vaze and Sinha [2025], e.g. when $d=1$) have been proposed. It is widely believed that CCV is $Ω(\sqrt{T})$ for all algorithms that ensure that regret is $O(\sqrt{T})$ with the worst case input for any $d\ge 2$. In this paper, we refute this and show that the algorithm of Vaze and Sinha [2025] simultaneously achieves regret of $O(\sqrt{T})$ regret and CCV of $O(T^{1/3})$ when $d=2$.
[838] Centrality-Based Pruning for Efficient Echo State Networks
Sudip Laudari
Main category: cs.LG
TL;DR: Graph centrality-based pruning method for Echo State Networks that reduces reservoir size while maintaining prediction accuracy by removing structurally unimportant nodes.
Details
Motivation: ESNs have randomly initialized reservoirs with redundant nodes that cause unnecessary computational overhead and reduced efficiency. There's a need to prune these networks while preserving their predictive capabilities.Method: Interpret the reservoir as a weighted directed graph and use centrality measures to identify and remove structurally less important nodes. This graph-based approach allows for targeted pruning based on network structure importance.
Result: Experiments on Mackey-Glass time-series prediction and electric load forecasting show the method can significantly reduce reservoir size while maintaining or even improving prediction accuracy, while preserving essential reservoir dynamics.
Conclusion: Graph centrality-based pruning is an effective approach for optimizing ESNs, reducing computational complexity without sacrificing performance, making ESNs more efficient for time-series prediction tasks.
Abstract: Echo State Networks (ESNs) are a reservoir computing framework widely used for nonlinear time-series prediction. However, despite their effectiveness, the randomly initialized reservoir often contains redundant nodes, leading to unnecessary computational overhead and reduced efficiency. In this work, we propose a graph centrality-based pruning approach that interprets the reservoir as a weighted directed graph and removes structurally less important nodes using centrality measures. Experiments on Mackey-Glass time-series prediction and electric load forecasting demonstrate that the proposed method can significantly reduce reservoir size while maintaining, and in some cases improving, prediction accuracy, while preserving the essential reservoir dynamics.
[839] Neuronal Self-Adaptation Enhances Capacity and Robustness of Representation in Spiking Neural Networks
Zhuobin Yang, Yeyao Bao, Liangfu Lv, Jian Zhang, Xiaohong Li, Yunliang Zang
Main category: cs.LG
TL;DR: A novel Potassium-regulated LIF (KvLIF) neuron model for Spiking Neural Networks that improves adaptability and robustness by mimicking biological potassium channel dynamics.
Details
Motivation: Conventional LIF neurons in SNNs have limited adaptability and are susceptible to noise, leading to degraded accuracy and robustness. The authors aim to bridge biological plausibility with computational efficiency by drawing inspiration from biological potassium channel regulation.Method: Proposed KvLIF neuron model with an auxiliary conductance state that integrates membrane potential and spiking history to adaptively modulate neuronal excitability and reset dynamics, extending dynamic response range and suppressing noise-induced spikes.
Result: Extensive evaluation on static image and neuromorphic datasets shows consistent improvements in classification accuracy and superior robustness compared to existing LIF models.
Conclusion: KvLIF bridges biological plausibility with computational efficiency, offering an enhanced neuron model that improves SNN performance while maintaining suitability for low-power neuromorphic deployment.
Abstract: Spiking Neural Networks (SNNs) are promising for energy-efficient, real-time edge computing, yet their performance is often constrained by the limited adaptability of conventional leaky integrate-and-fire (LIF) neurons. Existing LIF models struggle with restricted information capacity and susceptibility to noise, leading to degraded accuracy and compromised robustness. Inspired by the dynamic self-regulation of biological potassium channels, we propose the Potassium-regulated LIF (KvLIF) neuron model. KvLIF introduces an auxiliary conductance state that integrates membrane potential and spiking history to adaptively modulate neuronal excitability and reset dynamics. This design extends the dynamic response range of neurons to varying input intensities and effectively suppresses noise-induced spikes. We extensively evaluate KvLIF on both static image and neuromorphic datasets, demonstrating consistent improvements in classification accuracy and superior robustness compared to existing LIF models. Our work bridges biological plausibility with computational efficiency, offering a neuron model that enhances SNN performance while maintaining suitability for low-power neuromorphic deployment.
[840] Adversarial Attacks on Locally Private Graph Neural Networks
Matta Varun, Ajay Kumar Dhakar, Yuan Hong, Shamik Sural
Main category: cs.LG
TL;DR: Investigates adversarial attacks on LDP-protected GNNs, analyzing how privacy guarantees affect adversarial robustness and exploring attack effectiveness under LDP constraints.
Details
Motivation: GNNs are vulnerable to adversarial attacks, especially with sensitive graph data. While LDP provides privacy protection for GNN training, its impact on adversarial robustness is underexplored. The paper aims to understand the interplay between privacy and security in graph learning.Method: Analyzes adversarial attacks on LDP-protected GNNs by exploring how LDP privacy guarantees affect adversarial perturbations. Examines effectiveness of existing attack methods under LDP constraints and discusses challenges in crafting adversarial examples with privacy protection.
Result: The paper analyzes the effectiveness of adversarial attacks on LDP-protected GNNs and identifies challenges in creating adversarial examples under LDP constraints. It suggests directions for defending LDP-protected GNNs against adversarial attacks.
Conclusion: Highlights the need for robust and privacy-preserving GNN architectures that balance both privacy (LDP) and security (adversarial robustness) in graph learning systems.
Abstract: Graph neural network (GNN) is a powerful tool for analyzing graph-structured data. However, their vulnerability to adversarial attacks raises serious concerns, especially when dealing with sensitive information. Local Differential Privacy (LDP) offers a privacy-preserving framework for training GNNs, but its impact on adversarial robustness remains underexplored. This paper investigates adversarial attacks on LDP-protected GNNs. We explore how the privacy guarantees of LDP can be leveraged or hindered by adversarial perturbations. The effectiveness of existing attack methods on LDP-protected GNNs are analyzed and potential challenges in crafting adversarial examples under LDP constraints are discussed. Additionally, we suggest directions for defending LDP-protected GNNs against adversarial attacks. This work investigates the interplay between privacy and security in graph learning, highlighting the need for robust and privacy-preserving GNN architectures.
[841] Evaluating Uplift Modeling under Structural Biases: Insights into Metric Stability and Model Robustness
Yuxuan Yang, Dugang Liu, Yiyan Huang
Main category: cs.LG
TL;DR: A systematic benchmarking framework for uplift models in personalized marketing that evaluates model performance under real-world data biases like selection bias and spillover effects using semi-synthetic data.
Details
Motivation: Real-world uplift modeling faces challenges from data biases (selection bias, spillover effects, unobserved confounding) that affect estimation accuracy and metric validity, but there's a lack of systematic studies on bias-aware assessment.Method: Developed a systematic benchmarking framework using semi-synthetic data that retains real-world feature dependencies while providing ground truth to isolate structural biases, enabling evaluation of uplift models under various bias conditions.
Result: Found that: 1) uplift targeting and prediction are distinct objectives with different skill requirements; 2) TARNet shows notable robustness across diverse biases; 3) ATE-approximating metrics yield more consistent model rankings under data imperfections.
Conclusion: The study highlights the need for more robust uplift models and metrics that can handle real-world data biases, with TARNet’s robustness providing insights for future model design and ATE-aligned metrics offering more stable evaluation.
Abstract: In personalized marketing, uplift models estimate incremental effects by modeling how customer behavior changes under alternative treatments. However, real-world data often exhibit biases - such as selection bias, spillover effects, and unobserved confounding - which adversely affect both estimation accuracy and metric validity. Despite the importance of bias-aware assessment, a lack of systematic studies persists. To bridge this gap, we design a systematic benchmarking framework. Unlike standard predictive tasks, real-world uplift datasets lack counterfactual ground truth, rendering direct metric validation infeasible. Therefore, a semi-synthetic approach serves as a critical enabler for systematic benchmarking, effectively bridging the gap by retaining real-world feature dependencies while providing the ground truth needed to isolate structural biases. Our investigations reveal that: (i) uplift targeting and prediction can manifest as distinct objectives, where proficiency in one does not ensure efficacy in the other; (ii) while many models exhibit inconsistent performance under diverse biases, TARNet shows notable robustness, providing insights for subsequent model design; (iii) evaluation metric stability is linked to mathematical alignment with the ATE, suggesting that ATE-approximating metrics yield more consistent model rankings under structural data imperfections. These findings suggest the need for more robust uplift models and metrics. Code will be released upon acceptance.
[842] OmniPatch: A Universal Adversarial Patch for ViT-CNN Cross-Architecture Transfer in Semantic Segmentation
Aarush Aggarwal, Akshat Tomar, Amritanshu Tiwari, Sargam Goyal
Main category: cs.LG
TL;DR: OmniPatch is a training framework for learning universal adversarial patches that transfer across images and both ViT/CNN architectures without needing target model parameters.
Details
Motivation: Current semantic segmentation models for autonomous driving are vulnerable to black-box adversarial attacks, and existing approaches have limitations in practicality and transferability across different architectures.Method: OmniPatch framework learns universal adversarial patches that generalize across images and both Vision Transformer (ViT) and CNN architectures without requiring access to target model parameters.
Result: The approach creates patches that work across different images and architectures, improving transferability compared to methods that craft image-wide perturbations or optimize patches for single architectures.
Conclusion: OmniPatch provides a more practical and transferable adversarial attack framework for semantic segmentation models in autonomous driving scenarios.
Abstract: Robust semantic segmentation is crucial for safe autonomous driving, yet deployed models remain vulnerable to black-box adversarial attacks when target weights are unknown. Most existing approaches either craft image-wide perturbations or optimize patches for a single architecture, which limits their practicality and transferability. We introduce OmniPatch, a training framework for learning a universal adversarial patch that generalizes across images and both ViT and CNN architectures without requiring access to target model parameters.
[843] Neural Autoregressive Flows for Markov Boundary Learning
Khoa Nguyen, Bao Duong, Viet Huynh, Thin Nguyen
Main category: cs.LG
TL;DR: Novel framework for Markov boundary discovery using conditional entropy scoring with masked autoregressive networks and polynomial-time greedy search, showing scalability and superior performance.
Details
Motivation: Current Markov boundary discovery methods rely on nonparametric estimators and heuristic searches without theoretical guarantees, creating a need for more reliable and efficient approaches with provable performance.Method: Integrates conditional entropy from information theory as scoring criterion, designs masked autoregressive network to capture complex dependencies, and proposes parallelizable greedy search strategy with polynomial time complexity.
Result: Comprehensive evaluations on real-world and synthetic datasets demonstrate scalability and superior performance in both Markov boundary discovery and causal discovery tasks.
Conclusion: The proposed framework provides efficient Markov boundary discovery with theoretical guarantees, and learned Markov boundaries can accelerate convergence of causal discovery algorithms.
Abstract: Recovering Markov boundary – the minimal set of variables that maximizes predictive performance for a response variable – is crucial in many applications. While recent advances improve upon traditional constraint-based techniques by scoring local causal structures, they still rely on nonparametric estimators and heuristic searches, lacking theoretical guarantees for reliability. This paper investigates a framework for efficient Markov boundary discovery by integrating conditional entropy from information theory as a scoring criterion. We design a novel masked autoregressive network to capture complex dependencies. A parallelizable greedy search strategy in polynomial time is proposed, supported by analytical evidence. We also discuss how initializing a graph with learned Markov boundaries accelerates the convergence of causal discovery. Comprehensive evaluations on real-world and synthetic datasets demonstrate the scalability and superior performance of our method in both Markov boundary discovery and causal discovery tasks.
[844] Large Neighborhood Search meets Iterative Neural Constraint Heuristics
Yudong W. Xu, Wenhao Li, Scott Sanner, Elias B. Khalil
Main category: cs.LG
TL;DR: Neural networks adapted into Large Neighborhood Search (LNS) framework for constraint satisfaction problems, with novel prediction-guided destroy operators and neural repair using ConsFormer.
Details
Motivation: Neural networks are increasingly used as heuristics for constraint satisfaction problems, often learning iterative refinement. The paper aims to explicitly connect iterative neural heuristics with Large Neighborhood Search (LNS) to improve neural constraint satisfaction methods.Method: Adapts ConsFormer neural constraint satisfaction method into LNS procedure with two components: destroy operators (including classical heuristics and novel prediction-guided operators using model’s internal scores) and repair operators (using ConsFormer with both sampling-based and greedy decoders).
Result: Neural LNS yields substantial gains over vanilla neural settings and improves competitiveness with classical and neural baselines on Sudoku, Graph Coloring, and MaxCut. Stochastic destroy operators outperform greedy ones, while greedy repair is more effective than sampling-based repair for finding high-quality feasible assignments.
Conclusion: LNS serves as a useful framework for structuring and improving iterative neural approaches to constraint satisfaction, with consistent design patterns across different problem domains.
Abstract: Neural networks are being increasingly used as heuristics for constraint satisfaction. These neural methods are often recurrent, learning to iteratively refine candidate assignments. In this work, we make explicit the connection between such iterative neural heuristics and Large Neighborhood Search (LNS), and adapt an existing neural constraint satisfaction method-ConsFormer-into an LNS procedure. We decompose the resulting neural LNS into two standard components: the destroy and repair operators. On the destroy side, we instantiate several classical heuristics and introduce novel prediction-guided operators that exploit the model’s internal scores to select neighborhoods. On the repair side, we utilize ConsFormer as a neural repair operator and compare the original sampling-based decoder to a greedy decoder that selects the most likely assignments. Through an empirical study on Sudoku, Graph Coloring, and MaxCut, we find that adapting the neural heuristic to an LNS procedure yields substantial gains over its vanilla settings and improves its competitiveness with classical and neural baselines. We further observe consistent design patterns across tasks: stochastic destroy operators outperform greedy ones, while greedy repair is more effective than sampling-based repair for finding a single high-quality feasible assignment. These findings highlight LNS as a useful lens and design framework for structuring and improving iterative neural approaches.
[845] Achieving $\widetilde{O}(1/ε)$ Sample Complexity for Bilinear Systems Identification under Bounded Noises
Hongyu Yi, Chenbei Lu, Jing Yu
Main category: cs.LG
TL;DR: Finite-sample set-membership identification for discrete-time bilinear systems with bounded symmetric log-concave disturbances, showing parameter set diameter shrinks with sample complexity Õ(1/ε).
Details
Motivation: Existing finite-sample results focus on linear systems or require stronger noise assumptions. This work addresses the more challenging bilinear setting with trajectory-dependent regressors and marginally stable dynamics with polynomial state growth.Method: Theoretical analysis of finite-sample set-membership identification for discrete-time bilinear systems under bounded symmetric log-concave disturbances, considering trajectory-dependent regressors and marginally stable dynamics.
Result: Proves that the diameter of the feasible parameter set shrinks with sample complexity Õ(1/ε). Simulation results support the theory and show advantages for uncertainty quantification.
Conclusion: The paper provides theoretical guarantees for finite-sample identification of bilinear systems under challenging conditions, enabling reliable uncertainty quantification for this class of systems.
Abstract: This paper studies finite-sample set-membership identification for discrete-time bilinear systems under bounded symmetric log-concave disturbances. Compared with existing finite-sample results for linear systems and related analyses under stronger noise assumptions, we consider the more challenging bilinear setting with trajectory-dependent regressors and allow marginally stable dynamics with polynomial mean-square state growth. Under these conditions, we prove that the diameter of the feasible parameter set shrinks with sample complexity $\widetilde{O}(1/ε)$. Simulation supports the theory and illustrates the advantage of the proposed estimator for uncertainty quantification.
[846] Cross-Granularity Representations for Biological Sequences: Insights from ESM and BiGCARP
Hanlin Xiao, Rainer Breitling, Eriko Takano, Mauricio A. Álvarez
Main category: cs.LG
TL;DR: Cross-granularity integration of biological sequence models (BiGCARP for Pfam domains and ESM for amino acids) improves performance and interpretability by combining complementary biological knowledge from different hierarchical levels.
Details
Motivation: Biological sequences exhibit hierarchical granularity (nucleotides, amino acids, protein domains, genes) that encodes functional information, unlike natural language's symbolic granularity. The paper investigates how to integrate cross-granularity knowledge from different biological sequence models to improve performance and interpretability.Method: Case study integrating BiGCARP (Pfam domain-level model for biosynthetic gene clusters) and ESM (amino acid-level protein language model). Used representation analysis tools and probe tasks to analyze why simple cross-model embedding initialization fails, showing deeper-layer embeddings capture more contextual knowledge. Demonstrated that different granularities encode complementary biological knowledge.
Result: Deeper-layer embeddings capture more contextual and faithful representations of learned knowledge. Representations at different granularities encode complementary biological knowledge. Combining them yields measurable performance gains in intermediate-level prediction tasks.
Conclusion: Cross-granularity integration is a promising strategy for improving both performance and interpretability of biological foundation models by leveraging complementary knowledge from different hierarchical levels of biological sequences.
Abstract: Recent advances in general-purpose foundation models have stimulated the development of large biological sequence models. While natural language shows symbolic granularity (characters, words, sentences), biological sequences exhibit hierarchical granularity whose levels (nucleotides, amino acids, protein domains, genes) further encode biologically functional information. In this paper, we investigate the integration of cross-granularity knowledge from models through a case study of BiGCARP, a Pfam domain-level model for biosynthetic gene clusters, and ESM, an amino acid-level protein language model. Using representation analysis tools and a set of probe tasks, we first explain why a straightforward cross-model embedding initialization fails to improve downstream performance in BiGCARP, and show that deeper-layer embeddings capture a more contextual and faithful representation of the model’s learned knowledge. Furthermore, we demonstrate that representations at different granularities encode complementary biological knowledge, and that combining them yields measurable performance gains in intermediate-level prediction tasks. Our findings highlight cross-granularity integration as a promising strategy for improving both the performance and interpretability of biological foundation models.
[847] Simple Projection-Free Algorithm for Contextual Recommendation with Logarithmic Regret and Robustness
Shinsaku Sakaue
Main category: cs.LG
TL;DR: A simple algorithm for contextual recommendation that improves computational efficiency over prior ONS methods while maintaining O(d log T) regret, with robustness to suboptimal action feedback and kernelization capability.
Details
Motivation: Contextual recommendation is a variant of contextual linear bandits where the learner observes optimal actions rather than reward scalars. Prior work by Sakaue et al. (2025) developed an ONS approach with O(d log T) regret, but it has computational bottlenecks due to Mahalanobis projection steps.Method: The authors propose a simple algorithm that exploits the improperness inherent in contextual recommendation, leading to an update rule similar to the second-order perceptron from online classification. This eliminates the computationally expensive Mahalanobis projection step required by ONS methods.
Result: The algorithm achieves the same O(d log T) regret guarantee as prior ONS methods but with better computational efficiency. It also remains robust to possibly suboptimal action feedback without requiring multiple ONS learners with different learning rates.
Conclusion: The proposed method provides a more efficient alternative to ONS-based approaches for contextual recommendation, with benefits that extend to kernelized settings in Hilbert spaces where eliminating Mahalanobis projections is particularly advantageous.
Abstract: Contextual recommendation is a variant of contextual linear bandits in which the learner observes an (optimal) action rather than a reward scalar. Recently, Sakaue et al. (2025) developed an efficient Online Newton Step (ONS) approach with an $O(d\log T)$ regret bound, where $d$ is the dimension of the action space and $T$ is the time horizon. In this paper, we present a simple algorithm that is more efficient than the ONS-based method while achieving the same regret guarantee. Our core idea is to exploit the improperness inherent in contextual recommendation, leading to an update rule akin to the second-order perceptron from online classification. This removes the Mahalanobis projection step required by ONS, which is often a major computational bottleneck. More importantly, the same algorithm remains robust to possibly suboptimal action feedback, whereas the prior ONS-based method required running multiple ONS learners with different learning rates for this extension. We describe how our method works in general Hilbert spaces (e.g., via kernelization), where eliminating Mahalanobis projections becomes even more beneficial.
[848] Beyond the Academic Monoculture: A Unified Framework and Industrial Perspective for Attributed Graph Clustering
Yunhui Liu, Yue Liu, Yongchao Liu, Tao Zheng, Stan Z. Li, Xinwang Liu, Tieke He
Main category: cs.LG
TL;DR: Survey paper on Attributed Graph Clustering (AGC) that introduces a unified framework, critiques current evaluation practices, and addresses industrial deployment challenges.
Details
Motivation: To bridge the gap between academic benchmark performance and real-world industrial deployment requirements for Attributed Graph Clustering, which partitions nodes by modeling both structural topology and node attributes.Method: 1) Introduces Encode-Cluster-Optimize taxonomic framework to decompose AGC algorithms into three orthogonal modules; 2) Critically examines current evaluation protocols; 3) Analyzes practical industrial deployment constraints and provides engineering strategies.
Result: Provides comprehensive review with unified framework, exposes limitations of current evaluation practices (over-reliance on small citation networks, inadequate metrics), and outlines actionable strategies for industrial deployment.
Conclusion: Proposes holistic evaluation standards and research roadmap prioritizing heterophily-robust encoders, scalable joint optimization, and unsupervised model selection to meet production-grade requirements.
Abstract: Attributed Graph Clustering (AGC) is a fundamental unsupervised task that partitions nodes into cohesive groups by jointly modeling structural topology and node attributes. While the advent of graph neural networks and self-supervised learning has catalyzed a proliferation of AGC methodologies, a widening chasm persists between academic benchmark performance and the stringent demands of real-world industrial deployment. To bridge this gap, this survey provides a comprehensive, industrially grounded review of AGC from three complementary perspectives. First, we introduce the Encode-Cluster-Optimize taxonomic framework, which decomposes the diverse algorithmic landscape into three orthogonal, composable modules: representation encoding, cluster projection, and optimization strategy. This unified paradigm enables principled architectural comparisons and inspires novel methodological combinations. Second, we critically examine prevailing evaluation protocols to expose the field’s academic monoculture: a pervasive over-reliance on small, homophilous citation networks, the inadequacy of supervised-only metrics for an inherently unsupervised task, and the chronic neglect of computational scalability. In response, we advocate for a holistic evaluation standard that integrates supervised semantic alignment, unsupervised structural integrity, and rigorous efficiency profiling. Third, we explicitly confront the practical realities of industrial deployment. By analyzing operational constraints such as massive scale, severe heterophily, and tabular feature noise alongside extensive empirical evidence from our companion benchmark, we outline actionable engineering strategies. Furthermore, we chart a clear roadmap for future research, prioritizing heterophily-robust encoders, scalable joint optimization, and unsupervised model selection criteria to meet production-grade requirements.
[849] A Knowledge-Informed Pretrained Model for Causal Discovery
Wenbo Xu, Yue He, Yunhai Wang, Xingxuan Zhang, Kun Kuang, Yueguo Chen, Peng Cui
Main category: cs.LG
TL;DR: A knowledge-informed pretrained model for causal discovery that integrates weak prior knowledge as a middle ground between costly interventional signals and purely data-driven approaches.
Details
Motivation: Existing causal discovery methods either rely on strong assumptions/costly interventional signals or use purely data-driven approaches with limited guidance, hindering practical deployment. Real-world scenarios often only provide coarse domain knowledge.Method: Proposes a knowledge-informed pretrained model with dual source encoder-decoder architecture to process observational data in a knowledge-informed way. Uses diverse pretraining dataset and curriculum learning strategy to adapt to varying prior strengths across mechanisms, graph densities, and variable scales.
Result: Extensive experiments on in-distribution, out-of-distribution, and real-world datasets demonstrate consistent improvements over existing baselines, with strong robustness and practical applicability.
Conclusion: The proposed approach provides a principled middle ground for causal discovery that effectively integrates weak prior knowledge, showing strong performance and practical utility.
Abstract: Causal discovery has been widely studied, yet many existing methods rely on strong assumptions or fall into two extremes: either depending on costly interventional signals or partial ground truth as strong priors, or adopting purely data driven paradigms with limited guidance, which hinders practical deployment. Motivated by real-world scenarios where only coarse domain knowledge is available, we propose a knowledge-informed pretrained model for causal discovery that integrates weak prior knowledge as a principled middle ground. Our model adopts a dual source encoder-decoder architecture to process observational data in a knowledge-informed way. We design a diverse pretraining dataset and a curriculum learning strategy that smoothly adapts the model to varying prior strengths across mechanisms, graph densities, and variable scales. Extensive experiments on in-distribution, out-of distribution, and real-world datasets demonstrate consistent improvements over existing baselines, with strong robustness and practical applicability.
[850] Semantic Sections: An Atlas-Native Feature Ontology for Obstructed Representation Spaces
Hossein Javidnia
Main category: cs.LG
TL;DR: Introduces semantic sections as a new feature ontology for obstructed representation spaces, where locally coherent meanings don’t assemble into globally consistent features, with discovery pipeline and empirical validation across LLMs.
Details
Motivation: Traditional interpretability approaches treat features as single global directions/dictionary atoms shared across contexts, but this fails in obstructed representation spaces where locally coherent meanings don't assemble into globally consistent features.Method: Introduces semantic sections as transport-compatible families of local feature representatives defined over context atlases; develops discovery-and-certification pipeline with seeded propagation, synchronization, defect-based pruning, cycle-aware taxonomy, and deduplication.
Result: Finds nontrivial populations of semantic sections across layer-16 atlases for Llama 3.2 3B Instruct, Qwen 2.5 3B Instruct, and Gemma 2 2B IT, including cycle-supported globalizable and twisted regimes; shows semantic identity not recovered by raw global-vector similarity.
Conclusion: Semantic sections provide a better feature ontology for obstructed regimes, with perfect identity recovery on certified supports versus poor performance of raw similarity baselines.
Abstract: Recent interpretability work often treats a feature as a single global direction, dictionary atom, or latent coordinate shared across contexts. We argue that this ontology can fail in obstructed representation spaces, where locally coherent meanings need not assemble into one globally consistent feature. We introduce an atlas-native replacement object, the semantic section: a transport-compatible family of local feature representatives defined over a context atlas. We formalize semantic sections, prove that tree-supported propagation is always pathwise realizable, and show that cycle consistency is the key criterion for genuine globalization. This yields a distinction between tree-local, globalizable, and twisted sections, with twisted sections capturing locally coherent but holonomy-obstructed meanings. We then develop a discovery-and-certification pipeline based on seeded propagation, synchronization across overlaps, defect-based pruning, cycle-aware taxonomy, and deduplication. Across layer-16 atlases for Llama 3.2 3B Instruct, Qwen 2.5 3B Instruct, and Gemma 2 2B IT, we find nontrivial populations of semantic sections, including cycle-supported globalizable and twisted regimes after deduplication. Most importantly, semantic identity is not recovered by raw global-vector similarity. Even certified globalizable sections show low cross-chart signed cosine similarity, and raw similarity baselines recover only a small fraction of true within-section pairs, often collapsing at moderate thresholds. By contrast, section-based identity recovery is perfect on certified supports. These results support semantic sections as a better feature ontology in obstructed regimes.
[851] Incentive-Aware Federated Averaging with Performance Guarantees under Strategic Participation
Fateme Maleki, Krishnan Raghavan, Farzad Yousefian
Main category: cs.LG
TL;DR: Incentive-aware federated learning framework where clients strategically adjust their data contributions via Nash equilibrium-seeking updates to balance learning benefits against participation costs.
Details
Motivation: While federated learning improves global model performance, individual agents may behave strategically, balancing learning payoff against the cost of contributing their local data. There's a need for FL frameworks that successfully retain participating agents by addressing their strategic behavior.Method: Proposes an incentive-aware federated averaging method where clients transmit both local model parameters and updated training dataset sizes to the server. Dataset sizes are dynamically adjusted via a Nash equilibrium-seeking update rule that captures strategic data participation.
Result: The method is analyzed under convex and nonconvex global objective settings with established performance guarantees. Numerical experiments on MNIST and CIFAR-10 datasets show agents achieve competitive global model performance while converging to stable data participation strategies.
Conclusion: The proposed incentive-aware FL framework successfully addresses strategic agent behavior, enabling competitive model performance while maintaining stable participation through Nash equilibrium-seeking mechanisms.
Abstract: Federated learning (FL) is a communication-efficient collaborative learning framework that enables model training across multiple agents with private local datasets. While the benefits of FL in improving global model performance are well established, individual agents may behave strategically, balancing the learning payoff against the cost of contributing their local data. Motivated by the need for FL frameworks that successfully retain participating agents, we propose an incentive-aware federated averaging method in which, at each communication round, clients transmit both their local model parameters and their updated training dataset sizes to the server. The dataset sizes are dynamically adjusted via a Nash equilibrium (NE)-seeking update rule that captures strategic data participation. We analyze the proposed method under convex and nonconvex global objective settings and establish performance guarantees for the resulting incentive-aware FL algorithm. Numerical experiments on the MNIST and CIFAR-10 datasets demonstrate that agents achieve competitive global model performance while converging to stable data participation strategies.
[852] Beyond the Birkhoff Polytope: Spectral-Sphere-Constrained Hyper-Connections
Zhaoyi Liu, Haichuan Zhang, Ang Li
Main category: cs.LG
TL;DR: sHC replaces Birkhoff polytope constraints with spectral norm sphere constraints for hyper-connections, enabling negative entries for subtractive feature interactions while maintaining training stability.
Details
Motivation: Existing manifold-constrained hyper-connections (mHC) using Birkhoff polytope constraints suffer from identity degeneration, expressivity bottlenecks due to non-negativity, and parameterization inefficiencies.Method: Proposes Spectral-Sphere-Constrained Hyper-Connections (sHC) that shift from Birkhoff polytope to spectral norm sphere constraints, allowing negative entries and eliminating unstable Sinkhorn projections or factorial parameterization overhead.
Result: sHC enables expressive, non-degenerate residual matrices with subtractive feature interactions while preserving training stability, overcoming limitations of previous mHC approaches.
Conclusion: Geometric shift from rigid polytope to spectral norm sphere constraints provides a more effective solution for hyper-connections, balancing expressivity and training stability.
Abstract: Hyper-Connections (HC) generalize residual connections into multiple streams, employing residual matrices for cross-stream feature mixing to enrich model expressivity. However, unconstrained mixing disrupts the identity mapping property intrinsic to the residual connection, causing unstable training. To address this, Manifold-Constrained Hyper-Connections (mHC) and its variant restrict these matrices to the Birkhoff polytope (doubly stochastic matrices) via Sinkhorn iterations or permutation-based parameterizations. We reveal three limitations of this polytope constraint: (1) identity degeneration, where learned matrices collapse around the identity and diminish cross-stream interactions, (2) an expressivity bottleneck, as the non-negativity constraint prevents subtractive feature disentanglement, and (3) parameterization inefficiencies, manifesting as unstable Sinkhorn iterations or the factorial-scaling overhead of permutation-based parameterizations. To overcome these flaws, we propose Spectral-Sphere-Constrained Hyper-Connections (sHC). By geometrically shifting the feasible set from a rigid polytope to a spectral norm sphere, sHC allows negative entries, unlocking subtractive interactions for selective feature diversification. This shift eliminates unstable Sinkhorn projections and factorial parameterization, enabling expressive, non-degenerate residual matrices while preserving training stability.
[853] Natural Gradient Descent for Online Continual Learning
Joe Khawand, David Colliaux
Main category: cs.LG
TL;DR: Online Continual Learning for image classification using Natural Gradient Descent with KFAC approximation of Fisher Information Matrix to prevent catastrophic forgetting and improve convergence speed.
Details
Motivation: Address the challenge of catastrophic forgetting in Online Continual Learning (OCL) for image classification, where models need to learn from streaming data without i.i.d. assumptions, while also improving rapid convergence which remains difficult in online settings.Method: Proposes using Natural Gradient Descent optimizer with Kronecker Factored Approximate Curvature (KFAC) to approximate the Fisher Information Matrix (FIM), combined with existing OCL techniques to enhance learning efficiency and prevent forgetting.
Result: Demonstrates substantial performance improvements across all OCL methods when combined with existing OCL tricks on benchmark datasets including Split CIFAR-100, CORE50, and Split miniImageNet.
Conclusion: The Natural Gradient Descent approach with KFAC approximation effectively addresses catastrophic forgetting and convergence challenges in Online Continual Learning for image classification, showing significant improvements over existing methods.
Abstract: Online Continual Learning (OCL) for image classification represents a challenging subset of Continual Learning, focusing on classifying images from a stream without assuming data independence and identical distribution (i.i.d). The primary challenge in this context is to prevent catastrophic forgetting, where the model’s performance on previous tasks deteriorates as it learns new ones. Although various strategies have been proposed to address this issue, achieving rapid convergence remains a significant challenge in the online setting. In this work, we introduce a novel approach to training OCL models that utilizes the Natural Gradient Descent optimizer, incorporating an approximation of the Fisher Information Matrix (FIM) through Kronecker Factored Approximate Curvature (KFAC). This method demonstrates substantial improvements in performance across all OCL methods, particularly when combined with existing OCL tricks, on datasets such as Split CIFAR-100, CORE50, and Split miniImageNet.
[854] Bayesian Scattering: A Principled Baseline for Uncertainty on Image Data
Bernardo Fichera, Zarko Ivkovic, Kjell Jorner, Philipp Hennig, Viacheslav Borovitskiy
Main category: cs.LG
TL;DR: Bayesian scattering combines wavelet scattering transform with probabilistic modeling as an interpretable baseline for image uncertainty quantification, providing sensible uncertainty estimates under distribution shifts.
Details
Motivation: The field of uncertainty quantification for image data lacks an interpretable, mathematically grounded baseline similar to Bayesian linear regression for tabular data. Current methods are dominated by complex deep learning approaches that may overfit training distributions.Method: Proposes Bayesian scattering which couples the wavelet scattering transform (a deep, non-learned feature extractor based on geometric principles) with a simple probabilistic head. The scattering features avoid overfitting by being derived from geometric principles rather than learned.
Result: Validated on diverse tasks including medical imaging under institution shift, wealth mapping under country-to-country shift, and Bayesian optimization of molecular properties. The method provides sensible uncertainty estimates even under significant distribution shifts.
Conclusion: Bayesian scattering serves as a solid, interpretable baseline for complex uncertainty quantification methods in image data, analogous to Bayesian linear regression for tabular data.
Abstract: Uncertainty quantification for image data is dominated by complex deep learning methods, yet the field lacks an interpretable, mathematically grounded baseline. We propose Bayesian scattering to fill this gap, serving as a first-step baseline akin to the role of Bayesian linear regression for tabular data. Our method couples the wavelet scattering transform-a deep, non-learned feature extractor-with a simple probabilistic head. Because scattering features are derived from geometric principles rather than learned, they avoid overfitting the training distribution. This helps provide sensible uncertainty estimates even under significant distribution shifts. We validate this on diverse tasks, including medical imaging under institution shift, wealth mapping under country-to-country shift, and Bayesian optimization of molecular properties. Our results suggest that Bayesian scattering is a solid baseline for complex uncertainty quantification methods.
[855] LLM-ODE: Data-driven Discovery of Dynamical Systems with Large Language Models
Amirmohammad Ziaei Bideh, Jonathan Gryak
Main category: cs.LG
TL;DR: LLM-ODE: A framework that combines large language models with genetic programming to discover governing equations of dynamical systems more efficiently than traditional GP methods.
Details
Motivation: Traditional genetic programming approaches for automated equation discovery suffer from inefficient exploration of symbolic search space, leading to slow convergence and suboptimal solutions. There's a need to improve search efficiency while maintaining interpretability.Method: LLM-ODE uses large language models to guide symbolic evolution by extracting patterns from elite candidate equations. It combines LLMs’ generative prior with evolutionary algorithms’ exploratory strengths to produce more informed search trajectories.
Result: Tested on 91 dynamical systems, LLM-ODE variants consistently outperform classical GP methods in search efficiency and Pareto-front quality. The framework improves both efficiency and accuracy over traditional GP-based discovery and offers greater scalability to higher-dimensional systems compared to linear and Transformer-only methods.
Conclusion: LLM-ODE successfully integrates large language models with evolutionary algorithms to enhance automated equation discovery, demonstrating improved performance and scalability for discovering governing equations of dynamical systems.
Abstract: Discovering the governing equations of dynamical systems is a central problem across many scientific disciplines. As experimental data become increasingly available, automated equation discovery methods offer a promising data-driven approach to accelerate scientific discovery. Among these methods, genetic programming (GP) has been widely adopted due to its flexibility and interpretability. However, GP-based approaches often suffer from inefficient exploration of the symbolic search space, leading to slow convergence and suboptimal solutions. To address these limitations, we propose LLM-ODE, a large language model-aided model discovery framework that guides symbolic evolution using patterns extracted from elite candidate equations. By leveraging the generative prior of large language models, LLM-ODE produces more informed search trajectories while preserving the exploratory strengths of evolutionary algorithms. Empirical results on 91 dynamical systems show that LLM-ODE variants consistently outperform classical GP methods in terms of search efficiency and Pareto-front quality. Overall, our results demonstrate that LLM-ODE improves both efficiency and accuracy over traditional GP-based discovery and offers greater scalability to higher-dimensional systems compared to linear and Transformer-only model discovery methods.
[856] Enhancing LIME using Neural Decision Trees
Mohamed Aymen Bouyahia, Argyris Kalogeratos
Main category: cs.LG
TL;DR: NDT-LIME: A LIME variant using Neural Decision Trees as surrogate models for better local explanations of black-box models on tabular data
Details
Motivation: Traditional LIME surrogate models (linear regression, decision trees) struggle to capture complex non-linear decision boundaries of sophisticated black-box models, creating a gap between high predictive performance and interpretable decision-making.Method: Proposes NDT-LIME that integrates Neural Decision Trees as surrogate models, leveraging their structured hierarchical nature to provide more accurate local explanations.
Result: Evaluation on several benchmark tabular datasets shows consistent improvements in explanation fidelity over traditional LIME surrogates.
Conclusion: NDT-LIME bridges the gap between predictive performance and interpretability by providing more faithful local explanations for complex black-box models on tabular data.
Abstract: Interpreting complex machine learning models is a critical challenge, especially for tabular data where model transparency is paramount. Local Interpretable Model-Agnostic Explanations (LIME) has been a very popular framework for interpretable machine learning, also inspiring many extensions. While traditional surrogate models used in LIME variants (e.g. linear regression and decision trees) offer a degree of stability, they can struggle to faithfully capture the complex non-linear decision boundaries that are inherent in many sophisticated black-box models. This work contributes toward bridging the gap between high predictive performance and interpretable decision-making. Specifically, we propose the NDT-LIME variant that integrates Neural Decision Trees (NDTs) as surrogate models. By leveraging the structured, hierarchical nature of NDTs, our approach aims at providing more accurate and meaningful local explanations. We evaluate its effectiveness on several benchmark tabular datasets, showing consistent improvements in explanation fidelity over traditional LIME surrogates.
[857] Discriminative Representation Learning for Clinical Prediction
Yang Zhang, Li Fan, Samuel Lawrence, Shi Li
Main category: cs.LG
TL;DR: A supervised deep learning framework for clinical prediction that uses direct outcome alignment instead of self-supervised pretraining, achieving better performance by maximizing inter-class separation relative to within-class variance.
Details
Motivation: To challenge the assumption that large-scale self-supervised pretraining is necessary for strong clinical performance, arguing that direct outcome alignment provides better inductive bias when high-quality supervision is available in outcome-centric healthcare settings.Method: Proposes a supervised framework that explicitly shapes representation geometry by maximizing inter-class separation relative to within-class variance, concentrating model capacity along clinically meaningful axes through single-stage optimization.
Result: Consistently outperforms masked, autoregressive, and contrastive pretraining baselines across multiple longitudinal EHR tasks (mortality and readmission prediction), improving discrimination, calibration, and sample efficiency.
Conclusion: In low-entropy, outcome-driven healthcare domains, supervision can act as the statistically optimal driver of representation learning, challenging the need for large-scale self-supervised pretraining as a prerequisite for clinical performance.
Abstract: Foundation models in healthcare have largely adopted self supervised pretraining objectives inherited from natural language processing and computer vision, emphasizing reconstruction and large scale representation learning prior to downstream adaptation. We revisit this paradigm in outcome centric clinical prediction settings and argue that, when high quality supervision is available, direct outcome alignment may provide a stronger inductive bias than generative pretraining. We propose a supervised deep learning framework that explicitly shapes representation geometry by maximizing inter class separation relative to within class variance, thereby concentrating model capacity along clinically meaningful axes. Across multiple longitudinal electronic health record tasks, including mortality and readmission prediction, our approach consistently outperforms masked, autoregressive, and contrastive pretraining baselines under matched model capacity. The proposed method improves discrimination, calibration, and sample efficiency, while simplifying the training pipeline to a single stage optimization. These findings suggest that in low entropy, outcome driven healthcare domains, supervision can act as the statistically optimal driver of representation learning, challenging the assumption that large scale self supervised pretraining is a prerequisite for strong clinical performance.
[858] Causally-Guided Diffusion for Stable Feature Selection
Arun Vignesh Malarkkan, Xinyuan Wang, Kunpeng Liu, Denghui Zhang, Yanjie Fu
Main category: cs.LG
TL;DR: CGDFS is a causally-guided diffusion framework for stable feature selection that treats selection as posterior inference over feature subsets, combining diffusion priors with stability-aware likelihood to select features robust to distribution shifts.
Details
Motivation: Traditional feature selection methods optimize for predictive performance under a single data distribution, often selecting spurious features that fail under distribution shifts. The paper aims to develop feature selection that is robust to distribution shifts by incorporating causal invariance principles.Method: Formalizes feature selection as approximate posterior inference over feature subsets. Uses three key components: 1) stability-aware posterior sampling with causal invariance as soft inductive bias, 2) diffusion model as learned prior over continuous selection masks capturing feature dependencies, 3) guided annealed Langevin sampling combining diffusion prior with stability objective for uncertainty-aware inference.
Result: CGDFS consistently selects more stable and transferable feature subsets across classification and regression tasks on real-world datasets with distribution shifts. Leads to improved out-of-distribution performance and greater selection robustness compared to sparsity-based, tree-based, and stability-selection baselines.
Conclusion: The framework successfully combines causal invariance principles with diffusion-based probabilistic modeling to achieve robust feature selection under distribution shifts, avoiding discrete optimization while providing uncertainty-aware inference.
Abstract: Feature selection is fundamental to robust data-centric AI, but most existing methods optimize predictive performance under a single data distribution. This often selects spurious features that fail under distribution shifts. Motivated by principles from causal invariance, we study feature selection from a stability perspective and introduce Causally-Guided Diffusion for Stable Feature Selection (CGDFS). In CGDFS, we formalized feature selection as approximate posterior inference over feature subsets, whose posterior mass favors low prediction error and low cross-environment variance. Our framework combines three key insights: First, we formulate feature selection as stability-aware posterior sampling. Here, causal invariance serves as a soft inductive bias rather than explicit causal discovery. Second, we train a diffusion model as a learned prior over plausible continuous selection masks, combined with a stability-aware likelihood that rewards invariance across environments. This diffusion prior captures structural dependencies among features and enables scalable exploration of the combinatorially large selection space. Third, we perform guided annealed Langevin sampling that combines the diffusion prior with the stability objective, which yields a tractable, uncertainty-aware posterior inference that avoids discrete optimization and produces robust feature selections. We evaluate CGDFS on open-source real-world datasets exhibiting distribution shifts. Across both classification and regression tasks, CGDFS consistently selects more stable and transferable feature subsets, which leads to improved out-of-distribution performance and greater selection robustness compared to sparsity-based, tree-based, and stability-selection baselines.
[859] Beyond Expression Similarity: Contrastive Learning Recovers Functional Gene Associations from Protein Interaction Structure
Jason Dury
Main category: cs.LG
TL;DR: CAL framework transfers from text to biology, using protein-protein interactions to predict gene relationships better than expression similarity, with cross-domain insights about transferability and dataset quality.
Details
Motivation: Test whether the Predictive Associative Memory (PAM) principle that co-occurrence-based associations are more useful than embedding similarity transfers from text to molecular biology, specifically for protein-protein interactions.Method: Apply Contrastive Association Learning (CAL) trained on STRING protein interactions to gene perturbation data (Replogle K562 CRISPRi and DepMap datasets). Use cross-boundary evaluation with node-disjoint splits to test inductive transfer.
Result: CAL achieves cross-boundary AUC of 0.908 vs 0.518 for expression similarity on K562 data, and 0.947 on DepMap. Key findings: (1) biological associations transfer better than text co-occurrences, (2) CAL scores anti-correlate with interaction degree, (3) quality beats quantity in training data.
Conclusion: CAL successfully transfers to biology, revealing physically grounded associations are more transferable than contingent text co-occurrences, with practical implications for studying understudied genes and dataset curation.
Abstract: The Predictive Associative Memory (PAM) framework posits that useful relationships often connect items that co-occur in shared contexts rather than items that appear similar in embedding space. A contrastive MLP trained on co-occurrence annotations–Contrastive Association Learning (CAL)–has improved multi-hop passage retrieval and discovered narrative function at corpus scale in text. We test whether this principle transfers to molecular biology, where protein-protein interactions provide functional associations distinct from gene expression similarity. Four experiments across two biological domains map the operating envelope. On gene perturbation data (Replogle K562 CRISPRi, 2,285 genes), CAL trained on STRING protein interactions achieves cross-boundary AUC of 0.908 where expression similarity scores 0.518. A second gene dataset (DepMap, 17,725 genes) confirms the result after negative sampling correction, reaching cross-boundary AUC of 0.947. Two drug sensitivity experiments produce informative negatives that sharpen boundary conditions. Three cross-domain findings emerge: (1) inductive transfer succeeds in biology–a node-disjoint split with unseen genes yields AUC 0.826 (Delta +0.127)–where it fails in text (+/-0.10), suggesting physically grounded associations are more transferable than contingent co-occurrences; (2) CAL scores anti-correlate with interaction degree (Spearman r = -0.590), with gains concentrating on understudied genes with focused interaction profiles; (3) tighter association quality outperforms larger but noisier training sets, reversing the text pattern. Results are stable across training seeds (SD < 0.001) and cross-boundary threshold choices.
[860] Understanding Contextual Recall in Transformers: How Finetuning Enables In-Context Reasoning over Pretraining Knowledge
Bhavya Vasudeva, Puneesh Deora, Alberto Bietti, Vatsal Sharan, Christos Thrampoulidis
Main category: cs.LG
TL;DR: Transformer models can learn factual knowledge from pretraining but require finetuning on implicit inference tasks to develop contextual recall capabilities, which involves forming low-dimensional latent encodings of attribute types.
Details
Motivation: The paper investigates whether transformer-based language models can develop contextual recall capabilities (a specific form of in-context learning) from pretraining alone, or what additional finetuning is needed, and what mechanisms drive the necessary representations for this capability.Method: The authors introduce a controlled synthetic framework where pretraining sequences consist of subject-grammar-attribute tuples with attribute types tied to grammar statistics. They test whether pretraining yields contextual recall, then show that finetuning on implicit inference tasks (distinct from ICL evaluation) triggers emergence of contextual recall. They also derive a construction for an attention-only transformer that replicates the transition.
Result: Pretraining successfully yields factual knowledge but is insufficient for contextual recall - models fail to implicitly infer attribute types when grammar statistics are removed in ICL prompts. However, finetuning on implicit inference tasks using a subset of subjects triggers contextual recall across all subjects, accompanied by formation of low-dimensional latent encodings of shared attribute types.
Conclusion: Contextual recall does not emerge from pretraining alone but requires finetuning on tasks that demand implicit inference, which leads to the formation of specialized latent representations that enable the model to perform contextual recall across diverse subjects.
Abstract: Transformer-based language models excel at in-context learning (ICL), where they can adapt to new tasks based on contextual examples, without parameter updates. In a specific form of ICL, which we refer to as \textit{contextual recall}, models pretrained on open-ended text leverage pairwise examples to recall specific facts in novel prompt formats. We investigate whether contextual recall emerges from pretraining alone, what finetuning is required, and what mechanisms drive the necessary representations. For this, we introduce a controlled synthetic framework where pretraining sequences consist of subject-grammar-attribute tuples, with attribute types tied to grammar statistics. We demonstrate that while such pretraining successfully yields factual knowledge, it is insufficient for contextual recall: models fail to implicitly infer attribute types when the grammar statistics are removed in ICL prompts. However, we show that finetuning on tasks requiring implicit inference, distinct from the ICL evaluation, using a subset of subjects, triggers the emergence of contextual recall across all subjects. This transition is accompanied by the formation of low-dimensional latent encodings of the shared attribute type. For mechanistic insight, we derive a construction for an attention-only transformer that replicates the transition from factual to contextual recall, corroborated by empirical validation.
[861] Detection of adversarial intent in Human-AI teams using LLMs
Abed K. Musaffar, Ambuj Singh, Francesco Bullo
Main category: cs.LG
TL;DR: LLMs can serve as defensive supervisors in human-AI teams to detect malicious behavior in real-time without task-specific information, enhancing team robustness against attacks.
Details
Motivation: LLMs deployed in human-AI teams are vulnerable to various attacks (data poisoning, prompt injection) that can manipulate them to provide harmful information. While prior work focused on LLMs as attack targets, this paper explores their potential as defensive supervisors to detect malicious behavior in mixed human-AI teams.Method: Formulated the problem of malicious behavior detection from interaction traces using a dataset of multi-party conversations and decisions from real human-AI teams over 25 rounds. Evaluated LLMs’ capability to identify malicious behavior in real-time without task-specific information.
Result: LLMs demonstrated capability to identify malicious behavior in real-time without task-specific information, indicating potential for task-agnostic defense. The malicious behavior was not easily identified using simple heuristics, suggesting LLM defenders could enhance team robustness against certain attack classes.
Conclusion: LLMs show promise as defensive supervisors in human-AI teams, capable of detecting malicious behavior in real-time without task-specific knowledge, potentially making human teams more robust to certain classes of attacks.
Abstract: Large language models (LLMs) are increasingly deployed in human-AI teams as support agents for complex tasks such as information retrieval, programming, and decision-making assistance. While these agents’ autonomy and contextual knowledge enables them to be useful, it also exposes them to a broad range of attacks, including data poisoning, prompt injection, and even prompt engineering. Through these attack vectors, malicious actors can manipulate an LLM agent to provide harmful information, potentially manipulating human agents to make harmful decisions. While prior work has focused on LLMs as attack targets or adversarial actors, this paper studies their potential role as defensive supervisors within mixed human-AI teams. Using a dataset consisting of multi-party conversations and decisions for a real human-AI team over a 25 round horizon, we formulate the problem of malicious behavior detection from interaction traces. We find that LLMs are capable of identifying malicious behavior in real-time, and without task-specific information, indicating the potential for task-agnostic defense. Moreover, we find that the malicious behavior of interest is not easily identified using simple heuristics, further suggesting the introduction of LLM defenders could render human teams more robust to certain classes of attack.
[862] From Causal Discovery to Dynamic Causal Inference in Neural Time Series
Valentina Kuskova, Dmitry Zaytsev, Michael Coppedge
Main category: cs.LG
TL;DR: DCNAR is a two-stage neural framework that first learns causal structure from time series data, then uses it as a prior for time-varying causal inference, enabling dynamic causal reasoning without requiring pre-specified network structure.
Details
Motivation: Existing time-varying causal models assume known causal networks, which is unrealistic in real-world scientific domains where causal structure is uncertain, evolving, or indirectly observable, limiting applicability of dynamic causal inference.Method: Two-stage approach: 1) Neural autoregressive causal discovery learns sparse directed causal network from multivariate time series; 2) Learned structure serves as structural prior for time-varying neural network autoregression to estimate dynamic causal influences.
Result: Experiments on multi-country panel time-series data show learned causal networks yield more stable and behaviorally meaningful dynamic causal inferences than coefficient-based or structure-free alternatives, even with comparable forecasting performance.
Conclusion: DCNAR provides a general framework for using AI as a scientific instrument for dynamic causal reasoning under structural uncertainty, with behavioral diagnostics assessing causal necessity, temporal stability, and sensitivity to structural change.
Abstract: Time-varying causal models provide a powerful framework for studying dynamic scientific systems, yet most existing approaches assume that the underlying causal network is known a priori - an assumption rarely satisfied in real-world domains where causal structure is uncertain, evolving, or only indirectly observable. This limits the applicability of dynamic causal inference in many scientific settings. We propose Dynamic Causal Network Autoregression (DCNAR), a two-stage neural causal modeling framework that integrates data-driven causal discovery with time-varying causal inference. In the first stage, a neural autoregressive causal discovery model learns a sparse directed causal network from multivariate time series. In the second stage, this learned structure is used as a structural prior for a time-varying neural network autoregression, enabling dynamic estimation of causal influence without requiring pre-specified network structure. We evaluate the scientific validity of DCNAR using behavioral diagnostics that assess causal necessity, temporal stability, and sensitivity to structural change, rather than predictive accuracy alone. Experiments on multi-country panel time-series data demonstrate that learned causal networks yield more stable and behaviorally meaningful dynamic causal inferences than coefficient-based or structure-free alternatives, even when forecasting performance is comparable. These results position DCNAR as a general framework for using AI as a scientific instrument for dynamic causal reasoning under structural uncertainty.
[863] Joint Surrogate Learning of Objectives, Constraints, and Sensitivities for Efficient Multi-objective Optimization of Neural Dynamical Systems
Frithjof Gressmann, Ivan Georgiev Raikov, Seung Hyun Kim, Mattia Gazzola, Lawrence Rauchwerger, Ivan Soltesz
Main category: cs.LG
TL;DR: DMOSOPT is a scalable optimization framework for constrained multi-objective problems that uses a unified surrogate model to learn objective landscapes and feasibility boundaries, enabling efficient optimization in high-dimensional parameter spaces without gradient signals.
Details
Motivation: Biophysical neural simulations require optimization in high-dimensional parameter spaces with binary feasible/infeasible constraints that provide no gradient signals, making traditional optimization approaches ineffective for these computationally demanding applications.Method: DMOSOPT uses a unified, jointly learned surrogate model that captures the interplay between objectives, constraints, and parameter sensitivities. It learns smooth approximations of both objective landscapes and feasibility boundaries, providing unified gradients that steer search toward better objectives and constraint satisfaction while estimating per-parameter sensitivities for targeted exploration.
Result: The framework was validated across neural modeling workflows from single-cell dynamics to population-level network activity, demonstrating efficient optimization of highly constrained problems at supercomputing scale with substantially fewer problem evaluations.
Conclusion: DMOSOPT provides an effective optimization framework for constrained multi-objective problems in computational neuroscience and other scientific domains, though it’s primarily an optimization methodology rather than a multimodal AI model.
Abstract: Biophysical neural system simulations are among the most computationally demanding scientific applications, and their optimization requires navigating high-dimensional parameter spaces under numerous constraints that impose a binary feasible/infeasible partition with no gradient signal to guide the search. Here, we introduce DMOSOPT, a scalable optimization framework that leverages a unified, jointly learned surrogate model to capture the interplay between objectives, constraints, and parameter sensitivities. By learning a smooth approximation of both the objective landscape and the feasibility boundary, the joint surrogate provides a unified gradient that simultaneously steers the search toward improved objective values and greater constraint satisfaction, while its partial derivatives yield per-parameter sensitivity estimates that enable more targeted exploration. We validate the framework from single-cell dynamics to population-level network activity, spanning incremental stages of a neural circuit modeling workflow, and demonstrate efficient, effective optimization of highly constrained problems at supercomputing scale with substantially fewer problem evaluations. While motivated by and demonstrated in the context of computational neuroscience, the framework is general and applicable to constrained multi-objective optimization problems across scientific and engineering domains.
[864] Interpreting the Synchronization Gap: The Hidden Mechanism Inside Diffusion Transformers
Emil Albrychiewicz, Andrés Franco Valiente, Li-Ching Chen, Viola Zixin Zhao
Main category: cs.LG
TL;DR: DiT synchronization gap analysis reveals how diffusion transformers resolve generative ambiguity through depth-localized commitment patterns in final layers.
Details
Motivation: Theoretical diffusion models predict synchronization gaps between modes committing at different stages, but it's unclear how this manifests in practical deep discrete architectures like Diffusion Transformers (DiTs).Method: Construct explicit architectural realization of replica coupling by embedding two generative trajectories into joint token sequence with symmetric cross attention gate. Perform linearized analysis of attention difference and empirically validate on pretrained DiT-XL/2 model tracking commitment and per-layer internal mode energies.
Result: Synchronization gap is intrinsic architectural property of DiTs persisting without external coupling; gap collapses under strong coupling; gap is strictly depth-localized in final layers; global low-frequency structures commit before local high-frequency details.
Conclusion: Findings provide mechanistic interpretation of how Diffusion Transformers resolve generative ambiguity by isolating speciation transitions to terminal network layers.
Abstract: Recent theoretical models of diffusion processes, conceptualized as coupled Ornstein-Uhlenbeck systems, predict a hierarchy of interaction timescales, and consequently, the existence of a synchronization gap between modes that commit at different stages of the reverse process. However, because these predictions rely on continuous time and analytically tractable score functions, it remains unclear how this phenomenology manifests in the deep, discrete architectures deployed in practice. In this work, we investigate how the synchronization gap is mechanistically realized within pretrained Diffusion Transformers (DiTs). We construct an explicit architectural realization of replica coupling by embedding two generative trajectories into a joint token sequence, modulated by a symmetric cross attention gate with variable coupling strength g. Through a linearized analysis of the attention difference, we show that the replica interaction decomposes mechanistically. We empirically validate our theoretical framework on a pretrained DiT-XL/2 model by tracking commitment and per layer internal mode energies. Our results reveal that: (1) the synchronization gap is an intrinsic architectural property of DiTs that persists even when external coupling is turned off; (2) as predicted by our spatial routing bounds, the gap completely collapses under strong coupling; (3) the gap is strictly depth localized, emerging sharply only within the final layers of the Transformer; and (4) global, low frequency structures consistently commit before local, high frequency details. Ultimately, our findings provide a mechanistic interpretation of how Diffusion Transformers resolve generative ambiguity, isolating speciation transitions to the terminal layers of the network.
[865] Structural Sensitivity in Compressed Transformers: Error Propagation, Lyapunov Stability, and Formally Verified Bounds
Abhinaba Basu
Main category: cs.LG
TL;DR: Transformer compression sensitivity varies dramatically across layers - early MLP up-projections are catastrophically sensitive while value projections compress easily, with residual connections contracting errors through hidden state growth.
Details
Motivation: To understand why some transformer components are extremely sensitive to compression while others are robust, and to map the landscape of compression sensitivity across different architectures and scales.Method: Systematic compression experiments across 5 transformer architectures (117M-8B parameters), analysis using Lyapunov stability theory, formal verification with Lean 4 theorems, and validation through downstream tasks and activation-aware pruning.
Result: Found consistent hierarchy: early-layer MLP up-projections are catastrophically sensitive (20,000x perplexity increase), value projections compress nearly for free. Residual connections contract compression errors via hidden state growth. Formal bounds verified across 14,040+ configurations.
Conclusion: Transformer compression sensitivity follows predictable patterns, with error contraction from residual connections being necessary but not sufficient for compression tolerance - architecture-specific redundancy also plays crucial role.
Abstract: A single matrix out of 468 in GPT-2 Small can increase perplexity by 20,000x when compressed, revealing that transformer compression sensitivity spans five orders of magnitude. We map this sensitivity landscape across five architectures (117M-8B parameters), finding a consistent hierarchy: early-layer MLP up-projections are catastrophically sensitive while value projections compress nearly for free. This hierarchy is stable across compression levels, evaluation scales (2K-51K tokens), and datasets (WikiText-103, C4). Using Lyapunov stability theory, we show that residual connections contract compression errors by growing the hidden state faster than the error. Error contraction is necessary but not sufficient for compression tolerance: architecture-specific redundancy plays an equally important role, as demonstrated by the hybrid LFM2-2.6B degrading only 7x despite higher amplification than the fully-contracting GPT-2 Small (120x). Ten machine-checked Lean 4 theorems formalize per-matrix error bounds with no sorry markers; all bounds produce zero violations across 14,040+ configurations. We validate with downstream task evaluation (HellaSwag, ARC-Easy, Winogrande), activation-aware pruning on two architectures, and a Compression Fragility Index that rank-orders model robustness.
[866] Long-Term Outlier Prediction Through Outlier Score Modeling
Yuma Aoki, Joon Park, Koh Takeuchi, Hisashi Kashima, Shinya Akimoto, Ryuichi Hashimoto, Takahiro Adachi, Takeshi Kishikawa, Takamitsu Sasaki
Main category: cs.LG
TL;DR: Proposes a novel two-layer unsupervised method for long-term outlier prediction in time series, enabling forecasting of outlier likelihoods beyond immediate detection.
Details
Motivation: Addresses the gap in time series outlier detection where conventional methods only focus on immediate detection, limiting their ability to forecast outlier events far into the future.Method: A simple unsupervised two-layer approach: first layer performs standard outlier detection, second layer predicts future outlier scores based on temporal structure of previously observed outliers.
Result: Experiments on synthetic datasets show the method performs well in both detection and prediction tasks, suggesting it can serve as a strong baseline for future work.
Conclusion: The proposed framework enables both pointwise detection and long-term forecasting of outlier likelihoods, addressing an important gap in time series analysis.
Abstract: This study addresses an important gap in time series outlier detection by proposing a novel problem setting: long-term outlier prediction. Conventional methods primarily focus on immediate detection by identifying deviations from normal patterns. As a result, their applicability is limited when forecasting outlier events far into the future. To overcome this limitation, we propose a simple and unsupervised two-layer method that is independent of specific models. The first layer performs standard outlier detection, and the second layer predicts future outlier scores based on the temporal structure of previously observed outliers. This framework enables not only pointwise detection but also long-term forecasting of outlier likelihoods. Experiments on synthetic datasets show that the proposed method performs well in both detection and prediction tasks. These findings suggest that the method can serve as a strong baseline for future work in outlier detection and forecasting.
[867] When Does Content-Based Routing Work? Representation Requirements for Selective Attention in Hybrid Sequence Models
Abhinaba Basu
Main category: cs.LG
TL;DR: Attention creates latent subspaces enabling precise content-based routing, explaining why recurrent models fail at associative recall without attention.
Details
Motivation: The paper investigates a paradox in hybrid recurrent-attention architectures: content-based routing requires pairwise computation that routing is designed to avoid, creating a fundamental bottleneck.Method: Conducted 20+ controlled experiments across three tasks (synthetic diagnostic, Zoology MQAR benchmark, HotpotQA) to map the routing landscape exhaustively. Tested various mechanisms including softmax attention, random projections, contrastive pretraining, and non-learned indices.
Result: One layer of softmax attention creates a latent ~34-dimensional subspace enabling 98.4% routing precision; zero layers yield 1.2%. Twelve alternative mechanisms cluster at 15-29%. Non-learned indices (Bloom filter: 90.9%; BM25: 82.7%) bypass the bottleneck entirely, revealing a sharp two-regime hierarchy.
Conclusion: Attention’s role is writing pairwise match results into representations, not merely computing them. These findings provide the mechanistic explanation for why recurrent models fail at associative recall and reframe attention as a representation constructor.
Abstract: We identify a routing paradox in hybrid recurrent-attention architectures: content-based routing - deciding which tokens deserve expensive attention - requires exactly the pairwise computation that routing is designed to avoid. Through 20+ controlled experiments across three tasks (a synthetic diagnostic, the Zoology MQAR benchmark, and HotpotQA), we map the routing landscape exhaustively. One layer of softmax attention creates a latent ~34-dimensional subspace enabling 98.4% routing precision; zero layers yield 1.2%. This subspace is invisible to cosine similarity, destroyed by random projections (98.4% to 2.6%), and cannot be created by contrastive pretraining - proving attention’s role is writing pairwise match results into representations, not merely computing them. Twelve alternative mechanisms all cluster at 15-29%. Non-learned indices (Bloom filter: 90.9%; BM25 on HotpotQA: 82.7%) bypass the bottleneck entirely. The result is a sharp two-regime hierarchy with an empty middle ground. These findings provide the mechanistic explanation for the empirical observation that recurrent models fail at associative recall, and reframe attention as a representation constructor rather than merely a computation mechanism.
[868] CLT-Forge: A Scalable Library for Cross-Layer Transcoders and Attribution Graphs
Florent Draye, Abir Harrasse, Vedant Palit, Tung-Yu Wu, Jiarui Liu, Punya Syon Pandey, Roderick Wu, Terry Jingchen Zhang, Zhijing Jin, Bernhard Schölkopf
Main category: cs.LG
TL;DR: An open-source library for training and analyzing Cross-Layer Transcoders (CLTs) to improve mechanistic interpretability of LLMs through scalable distributed training, automated interpretability pipelines, and visualization tools.
Details
Motivation: Current feature attribution graphs from dictionary learning and transcoders are often large and redundant, limiting practical interpretability. While Cross-Layer Transcoders (CLTs) address this by sharing features across layers, they remain difficult to train and analyze at scale.Method: Developed an open-source library with: 1) scalable distributed training with model sharding and compressed activation caching, 2) unified automated interpretability pipeline for feature analysis and explanation, 3) attribution graph computation using Circuit-Tracer, and 4) flexible visualization interface.
Result: Created CLT-Forge, a practical and unified solution for scaling CLT-based mechanistic interpretability, available as open-source code at https://github.com/LLM-Interp/CLT-Forge.
Conclusion: The framework provides an end-to-end solution for training and interpreting CLTs, making large-scale mechanistic interpretability more accessible and practical for understanding how LLMs represent and process information.
Abstract: Mechanistic interpretability seeks to understand how Large Language Models (LLMs) represent and process information. Recent approaches based on dictionary learning and transcoders enable representing model computation in terms of sparse, interpretable features and their interactions, giving rise to feature attribution graphs. However, these graphs are often large and redundant, limiting their interpretability in practice. Cross-Layer Transcoders (CLTs) address this issue by sharing features across layers while preserving layer-specific decoding, yielding more compact representations, but remain difficult to train and analyze at scale. We introduce an open-source library for end-to-end training and interpretability of CLTs. Our framework integrates scalable distributed training with model sharding and compressed activation caching, a unified automated interpretability pipeline for feature analysis and explanation, attribution graph computation using Circuit-Tracer, and a flexible visualization interface. This provides a practical and unified solution for scaling CLT-based mechanistic interpretability. Our code is available at: https://github.com/LLM-Interp/CLT-Forge.
[869] Deep Attention-based Sequential Ensemble Learning for BLE-Based Indoor Localization in Care Facilities
Minh Triet Pham, Quynh Chi Dang, Le Nhat Tan
Main category: cs.LG
TL;DR: DASEL: A sequential learning framework for BLE-based indoor localization in care facilities using attention-based bidirectional GRUs and temporal smoothing
Details
Motivation: Traditional ML approaches treat BLE-based localization measurements as independent observations, limiting performance. Care facilities need accurate indoor localization for staff optimization and quality care delivery.Method: Deep Attention-based Sequential Ensemble Learning (DASEL) reconceptualizes indoor localization as sequential learning problem. Uses frequency-based feature engineering, bidirectional GRU networks with attention mechanisms, multi-directional sliding windows, and confidence-weighted temporal smoothing.
Result: Achieves macro F1 score of 0.4438 on real-world care facility data using 4-fold temporal cross-validation, representing 53.1% improvement over best traditional baseline (0.2898).
Conclusion: DASEL framework effectively captures human movement trajectories and significantly outperforms traditional approaches for indoor localization in care facilities.
Abstract: Indoor localization systems in care facilities enable optimization of staff allocation, workload management, and quality of care delivery. Traditional machine learning approaches to Bluetooth Low Energy (BLE)-based localization treat each temporal measurement as an independent observation, fundamentally limiting their performance. To address this limitation, this paper introduces Deep Attention-based Sequential Ensemble Learning (DASEL), a novel framework that reconceptualizes indoor localization as a sequential learning problem. The framework integrates frequency-based feature engineering, bidirectional GRU networks with attention mechanisms, multi-directional sliding windows, and confidence-weighted temporal smoothing to capture human movement trajectories. Evaluated on real-world data from a care facility using 4-fold temporal cross-validation, DASEL achieves a macro F1 score of 0.4438, representing a 53.1% improvement over the best traditional baseline (0.2898).
[870] Fuel Consumption Prediction: A Comparative Analysis of Machine Learning Paradigms
Ali Akram
Main category: cs.LG
TL;DR: Classical ML models outperform black-box deep learning for predicting vehicle fuel efficiency from physical design parameters, with SVM regression and logistic regression showing best performance for continuous and classification tasks respectively.
Details
Motivation: The automotive industry needs accurate predictive modeling to reduce environmental impact and support sustainable engineering design, requiring identification of governing physical factors for vehicle fuel efficiency.Method: Used data sanitization, statistical outlier elimination, and exploratory data analysis to address multicollinearity. Compared multiple linear regression, SVM regression, and logistic regression on the Motor Trend dataset to predict fuel consumption.
Result: SVM regression achieved best continuous prediction (R²=0.889, RMSE=0.326), capturing non-linear relationships between mass and displacement. Logistic regression achieved 90.8% accuracy for classification with 0.957 recall for low-efficiency vehicles.
Conclusion: Interpretable classical models outperform black-box deep learning for static physical datasets. Vehicle efficiency is fundamentally determined by weight and displacement, suggesting manufacturers should focus on lightweighting and engine downsizing for sustainability.
Abstract: The automotive industry is under growing pressure to reduce its environmental impact, requiring accurate predictive modeling to support sustainable engineering design. This study examines the factors that determine vehicle fuel consumption from the seminal Motor Trend dataset, identifying the governing physical factors of efficiency through rigorous quantitative analysis. Methodologically, the research uses data sanitization, statistical outlier elimination, and in-depth Exploratory Data Analysis (EDA) to curb the occurrence of multicollinearity between powertrain features. A comparative analysis of machine learning paradigms including Multiple Linear Regression, Support Vector Machines (SVM), and Logistic Regression was carried out to assess predictive efficacy. Findings indicate that SVM Regression is most accurate on continuous prediction (R-squared = 0.889, RMSE = 0.326), and is effective in capturing the non-linear relationships between vehicle mass and engine displacement. In parallel, Logistic Regression proved superior for classification (Accuracy = 90.8%) and showed exceptional recall (0.957) when identifying low-efficiency vehicles. These results challenge the current trend toward black-box deep learning architectures for static physical datasets, providing validation of robust performance by interpretable and well-tuned classical models. The research finds that intrinsic vehicle efficiency is fundamentally determined by physical design parameters, weight and displacement, offering a data-driven framework for how manufacturers should focus on lightweighting and engine downsizing to achieve stringent global sustainability goals.
[871] Benchmarking Scientific Machine Learning Models for Air Quality Data
Khawja Imran Masud, Venkata Sai Rahul Unnam, Sahara Ali
Main category: cs.LG
TL;DR: Physics-guided ML/DL models benchmarked for AQI forecasting in North Texas, showing deep learning outperforms classical methods and physics constraints improve stability and physical consistency.
Details
Motivation: Need for accurate AQI forecasting to protect public health in urban regions, with challenges in model evaluation due to lack of rigorous region-specific benchmarking on standardized datasets.Method: Benchmark classical time-series (LR, SARIMAX), machine learning (MLP), and deep learning (LSTM) approaches with physics-guided variants that incorporate EPA breakpoint-based AQI formulation as consistency constraints through weighted loss. Use EPA daily air quality data (2022-2024) for PM2.5 and O3 in Dallas County with lag-wise forecasting for 1,7,14,30 days.
Result: Deep learning models outperform simpler baselines; physics guidance improves stability and yields physically consistent pollutant-AQI relationships, with largest benefits for short-horizon prediction and for PM2.5 and O3.
Conclusion: Provides practical reference for selecting AQI forecasting models in North Texas and clarifies when lightweight physics constraints meaningfully improve predictive performance across pollutants and forecast horizons.
Abstract: Accurate air quality index (AQI) forecasting is essential for the protecting public health in rapidly growing urban regions, and the practical model evaluation and selection are often challenged by the lack of rigorous, region-specific benchmarking on standardized datasets. Physics-guided machine learning and deep learning models could be a good and effective solution to resolve such issues with more accurate and efficient AQI forecasting. This research study presents an explainable and comprehensive benchmark that enables a guideline and proposed physics-guided best model by benchmarking classical time-series, machine-learning, and deep-learning approaches for multi-horizon AQI forecasting in North Texas (Dallas County). Using publicly available U.S. Environmental Protection Agency (EPA) daily observations of air quality data from 2022 to 2024, we curate city-level time series for PM2.5 and O3 by aggregating station measurements and constructing lag-wise forecasting datasets for LAG in {1,7,14,30} days. For benchmarking the best model, linear regression (LR), SARIMAX, multilayer perceptrons (MLP), and LSTM networks are evaluated with the proposed physics-guided variants (MLP+Physics and LSTM+Physics) that incorporate the EPA breakpoint-based AQI formulation as a consistency constraint through a weighted loss. Experiments using chronological train-test splits and error metrics MAE, RMSE showed that deep-learning models outperform simpler baselines, while physics guidance improves stability and yields physically consistent pollutant with AQI relationships, with the largest benefits observed for short-horizon prediction and for PM2.5 and O3. Overall, the results provide a practical reference for selecting AQI forecasting models in North Texas and clarify when lightweight physics constraints meaningfully improve predictive performance across pollutants and forecast horizons.
[872] Confidence Freeze: Early Success Induces a Metastable Decoupling of Metacognition and Behaviour
Zhipeng Zhang, Hongshun He
Main category: cs.LG
TL;DR: Early success in a learning task causes maladaptive persistence through failure streaks despite dropping confidence, suggesting a “confidence-freeze” state rather than stable trait.
Details
Motivation: To understand why humans persist with failing strategies despite accumulating negative evidence, and to test whether this is a dynamic learning state rather than a stable dispositional trait.Method: Multi-reversal two-armed bandit task across three experiments (total N=332; 19,920 trials), comparing groups with different early success rates (90% vs 60%).
Result: Individuals with high early success (90%) showed robust persistence through long failure streaks (mean=6.2 consecutive losses) despite dropping metacognitive confidence, while control group followed normative pattern of using outcome trajectories appropriately.
Conclusion: Persistence in failing strategies can be explained by a “confidence-freeze” account where early success creates a learning state that persists despite contradictory evidence, rather than being a stable trait.
Abstract: Humans must flexibly arbitrate between exploring alternatives and exploiting learned strategies, yet they frequently exhibit maladaptive persistence by continuing to execute failing strategies despite accumulating negative evidence. Here we propose a ``confidence-freeze’’ account that reframes such persistence as a dynamic learning state rather than a stable dispositional trait. Using a multi-reversal two-armed bandit task across three experiments (total N = 332; 19,920 trials), we first show that human learners normally make use of the symmetric statistical structure inherent in outcome trajectories: runs of successes provide positive evidence for environmental stability and thus for strategy maintenance, whereas runs of failures provide negative evidence and should raise switching probability. Behaviour in the control group conformed to this normative pattern. However, individuals who experienced a high rate of early success (90% vs.\ 60%) displayed a robust and selective distortion after the first reversal: they persisted through long stretches of non-reward (mean = 6.2 consecutive losses) while their metacognitive confidence ratings simultaneously dropped from 5 to 2 on a 7-point scale.
[873] Semi-Supervised Learning with Balanced Deep Representation Distributions
Changchun Li, Ximing Li, Bingjie Zhang, Wenting Wang, Jihong Ouyang
Main category: cs.LG
TL;DR: S2TC-BDD improves semi-supervised text classification by addressing margin bias through angular margin loss and balanced label angle variances, achieving better pseudo-label accuracy especially with scarce labeled data.
Details
Motivation: Semi-supervised text classification suffers from low pseudo-label accuracy due to margin bias caused by imbalanced representation distributions between labels. This limits performance, especially when labeled data is scarce.Method: Proposes S2TC-BDD using angular margin loss and Gaussian linear transformations to balance label angle variances. Implements both multi-class and multi-label versions with pseudo-labeling tricks and regularization terms during self-training loops.
Result: Empirical results show S2TC-BDD outperforms state-of-the-art SSTC methods, demonstrating effectiveness particularly when labeled texts are scarce.
Conclusion: Balancing label angle variances through angular margin loss and transformations significantly improves pseudo-label accuracy and overall performance in semi-supervised text classification.
Abstract: Semi-Supervised Text Classification (SSTC) mainly works under the spirit of self-training. They initialize the deep classifier by training over labeled texts; and then alternatively predict unlabeled texts as their pseudo-labels and train the deep classifier over the mixture of labeled and pseudo-labeled texts. Naturally, their performance is largely affected by the accuracy of pseudo-labels for unlabeled texts. Unfortunately, they often suffer from low accuracy because of the margin bias problem caused by the large difference between representation distributions of labels in SSTC. To alleviate this problem, we apply the angular margin loss, and perform several Gaussian linear transformations to achieve balanced label angle variances, i.e., the variance of label angles of texts within the same label. More accuracy of predicted pseudo-labels can be achieved by constraining all label angle variances balanced, where they are estimated over both labeled and pseudo-labeled texts during self-training loops. With this insight, we propose a novel SSTC method, namely Semi-Supervised Text Classification with Balanced Deep representation Distributions (S2TC-BDD). We implement both multi-class classification and multi-label classification versions of S2TC-BDD by introducing some pseudo-labeling tricks and regularization terms. To evaluate S2 TC-BDD, we compare it against the state-of-the-art SSTC methods. Empirical results demonstrate the effectiveness of S2 TC-BDD, especially when the labeled texts are scarce.
[874] Mixture of Chapters: Scaling Learnt Memory in Transformers
Tasmay Pankaj Tibrewal, Pritish Saha, Ankit Meda, Kunal Singh, Pradeep Moturi
Main category: cs.LG
TL;DR: Transformers enhanced with learnable sparse memory banks that use chapter-based routing to scale to 262K memory tokens, improving knowledge retention and reducing forgetting during training phase transitions.
Details
Motivation: Transformers lack explicit architectural mechanisms for storing and organizing knowledge acquired during training, which can lead to forgetting during training phase transitions and limits knowledge capacity.Method: Introduces learnable sparse memory banks - latent tokens trained end-to-end that transformer layers query via cross-attention. Uses chapter-based routing (inspired by Mixture-of-Experts) to partition memory into chapters and train a router to select relevant subsets per input, enabling scaling to 262K tokens while maintaining tractable computation.
Result: Models surpass iso-FLOP baselines on pre-training and instruction fine-tuning benchmarks, demonstrating explicit associative memory provides complementary capacity to implicit parameter knowledge. Shows improved knowledge retention under continued training and robustness to forgetting when transitioning between training phases.
Conclusion: Explicit associative memory in transformers offers a new axis of scaling beyond parameter count, with practical benefits for knowledge retention and training stability across different phases.
Abstract: Transformers lack an explicit architectural mechanism for storing and organizing knowledge acquired during training. We introduce learnable sparse memory banks: a set of latent tokens, randomly initialized and trained end-to-end, that transformer layers query via cross-attention to retrieve stored knowledge. To scale memory capacity without prohibitive attention costs, we propose chapter-based routing inspired by Mixture-of-Experts architectures, partitioning the memory bank into chapters and training a router to select relevant subsets per input. This enables scaling to 262K memory tokens while maintaining tractable computation. We evaluate our approach against standard transformers (in iso-FLOP settings) on pre-training and instruction fine-tuning across relevant benchmarks. Our models surpass iso-FLOP baselines suggesting scope for a new axis of scaling, demonstrating that explicit associative memory provides complementary capacity to what is captured implicitly in model parameters. Additionally, we observe improved knowledge retention under continued training, with robustness to forgetting when transitioning between training phases (e.g., pretraining to instruction fine-tuning).
[875] ResPrune: Text-Conditioned Subspace Reconstruction for Visual Token Pruning in Large Vision-Language Models
Xu Li, Yi Zheng, Yuxuan Liang, Zhe Liu, Xiaolei Chen, Haotian Chen, Rui Zhu, Xiangyang Xue
Main category: cs.LG
TL;DR: ResPrune is a training-free visual token pruning framework for Large Vision-Language Models that selects compact yet informative subsets of visual tokens to reduce computational overhead while maintaining performance.
Details
Motivation: LVLMs use dense visual tokens that create substantial computational and memory overhead during inference, creating a need for efficient token pruning methods that preserve visual information.Method: Formulates visual token pruning as subspace reconstruction with greedy subspace expansion guided by residual energy, conditioned on textual relevance to retain instruction-relevant tokens. Model-agnostic and requires no retraining.
Result: Outperforms existing pruning approaches across multiple LVLM backbones (LLaVA-1.5, LLaVA-NeXT, Qwen2.5-VL) on various benchmarks while reducing computation, memory consumption, and inference latency.
Conclusion: ResPrune provides an effective, lightweight solution for efficient LVLM inference through training-free visual token pruning that preserves both geometric structure and cross-modal alignment.
Abstract: Large Vision-Language Models (LVLMs) rely on dense visual tokens to capture fine-grained visual information, but processing all these tokens incurs substantial computational and memory overhead during inference. To address this issue, we propose ResPrune, a training-free visual token pruning framework that enables efficient LVLM inference by selecting a compact yet informative subset of visual tokens. ResPrune formulates visual token pruning as a subspace reconstruction problem and employs a greedy subspace expansion strategy guided by residual energy, allowing it to preserve the geometric structure of the original visual token space. To further incorporate cross modal alignment, the selection process is conditioned on textual relevance, encouraging the retention of tokens that are both informative and instruction-relevant. The proposed method is lightweight and model-agnostic, and can be seamlessly integrated into existing LVLM pipelines without retraining or architectural modifications. Extensive experiments on multiple LVLM backbones, including LLaVA-1.5, LLaVA-NeXT, and Qwen2.5-VL, demonstrate that ResPrune consistently outperforms existing pruning approaches across a wide range of benchmarks, while achieving effective reductions in computation, memory consumption, and inference latency.
[876] DMMRL: Disentangled Multi-Modal Representation Learning via Variational Autoencoders for Molecular Property Prediction
Long Xu, Junping Guo, Jianbo Zhao, Jianbo Lu, Yuzhong Peng
Main category: cs.LG
TL;DR: DMMRL uses variational autoencoders to disentangle molecular representations into shared (structure-relevant) and private (modality-specific) latent spaces, improving interpretability and predictive performance for molecular property prediction.
Details
Motivation: Existing molecular property prediction models have entangled representations that conflate structural, chemical, and functional factors, limiting interpretability and transferability. They also inadequately exploit complementary information from different molecular modalities (graphs, sequences, geometries) through naive concatenation that neglects inter-modal dependencies.Method: Uses variational autoencoders to disentangle representations into shared (structure-relevant) and private (modality-specific) latent spaces. Employs orthogonality and alignment regularizations to promote statistical independence and cross-modal consistency. Includes a gated attention fusion module to adaptively integrate shared representations and capture complex inter-modal relationships.
Result: Experimental validation across seven benchmark datasets demonstrates DMMRL’s superior performance relative to state-of-the-art approaches.
Conclusion: DMMRL effectively disentangles molecular representations, enhancing both interpretability and predictive performance for molecular property prediction by better exploiting complementary information from different molecular modalities.
Abstract: Molecular property prediction constitutes a cornerstone of drug discovery and materials science, necessitating models capable of disentangling complex structure-property relationships across diverse molecular modalities. Existing approaches frequently exhibit entangled representations–conflating structural, chemical, and functional factors–thereby limiting interpretability and transferability. Furthermore, conventional methods inadequately exploit complementary information from graphs, sequences, and geometries, often relying on naive concatenation that neglects inter-modal dependencies. In this work, we propose DMMRL, which employs variational autoencoders to disentangle molecular representations into shared (structure-relevant) and private (modality-specific) latent spaces, enhancing both interpretability and predictive performance. The proposed variational disentanglement mechanism effectively isolates the most informative features for property prediction, while orthogonality and alignment regularizations promote statistical independence and cross-modal consistency. Additionally, a gated attention fusion module adaptively integrates shared representations, capturing complex inter-modal relationships. Experimental validation across seven benchmark datasets demonstrates DMMRL’s superior performance relative to state-of-the-art approaches. The code and data underlying this article are freely available at https://github.com/xulong0826/DMMRL.
[877] Learning from Label Proportions with Dual-proportion Constraints
Tianhao Ma, Ximing Li, Changchun Li, Renchu Guan
Main category: cs.LG
TL;DR: A method called LLP-DC improves Learning from Label Proportions by enforcing dual proportion constraints at both bag and instance levels during training, using a minimum-cost maximum-flow algorithm for hard pseudo-label generation.
Details
Motivation: Learning from Label Proportions (LLP) addresses scenarios where instance-level labels are unavailable due to privacy constraints or high labeling costs, requiring methods that can learn from only bag-level label proportions.Method: LLP-DC enforces dual proportion constraints: bag-level training aligns mean predictions with given proportions, and instance-level training aligns hard pseudo-labels satisfying proportion constraints using a minimum-cost maximum-flow algorithm.
Result: Extensive experiments across benchmark datasets show LLP-DC consistently outperforms previous LLP methods across various datasets and bag sizes.
Conclusion: The dual constraint approach effectively improves LLP performance by leveraging both bag-level and instance-level proportion information during training.
Abstract: Learning from Label Proportions (LLP) is a weakly supervised problem in which the training data comprise bags, that is, groups of instances, each annotated only with bag-level class label proportions, and the objective is to learn a classifier that predicts instance-level labels. This setting is widely applicable when privacy constraints limit access to instance-level annotations or when fine-grained labeling is costly or impractical. In this work, we introduce a method that leverages Dual proportion Constraints (LLP-DC) during training, enforcing them at both the bag and instance levels. Specifically, the bag-level training aligns the mean prediction with the given proportion, and the instance-level training aligns hard pseudo-labels that satisfy the proportion constraint, where a minimum-cost maximum-flow algorithm is used to generate hard pseudo-labels. Extensive experimental results across various benchmark datasets empirically validate that LLP-DC consistently improves over previous LLP methods across datasets and bag sizes. The code is publicly available at https://github.com/TianhaoMa5/CV PR2026_Findings_LLP_DC.
[878] Beyond a Single Signal: SPECTREG2, A Unified MultiExpert Anomaly Detector for Unknown Unknowns
Rahul D Ray
Main category: cs.LG
TL;DR: SPECTRE-G2 is a multi-signal anomaly detector that combines eight complementary signals from a dual-backbone neural network to detect diverse structural anomalies and unknown unknowns in machine learning systems.
Details
Motivation: Current uncertainty quantification methods rely on single signals (confidence or density) and fail to detect diverse structural anomalies. There's a need for systems that can recognize the limits of their knowledge and act safely under uncertainty, especially when faced with unknown unknowns in open-world settings.Method: Uses a dual-backbone neural network with spectral normalised Gaussianization encoder and plain MLP preserving feature geometry, plus an ensemble of five models. Produces eight complementary signals: density, geometry, uncertainty, discriminative, and causal signals. Signals are normalized using validation statistics and calibrated with synthetic out-of-distribution data. Adaptive top-k fusion selects the most informative signals and averages their scores.
Result: Experiments on synthetic, Adult, CIFAR-10, and Gridworld datasets show strong performance across diverse anomaly types, outperforming multiple baselines on AUROC, AUPR, and FPR95 metrics. The model is stable across seeds and particularly effective for detecting new variables and confounders.
Conclusion: SPECTRE-G2 provides a practical approach for detecting unknown unknowns in open-world settings by combining multiple complementary signals, addressing limitations of single-signal uncertainty quantification methods.
Abstract: Epistemic intelligence requires machine learning systems to recognise the limits of their own knowledge and act safely under uncertainty, especially when faced with unknown unknowns. Existing uncertainty quantification methods rely on a single signal such as confidence or density and fail to detect diverse structural anomalies. We introduce SPECTRE-G2, a multi-signal anomaly detector that combines eight complementary signals from a dual-backbone neural network. The architecture includes a spectral normalised Gaussianization encoder, a plain MLP preserving feature geometry, and an ensemble of five models. These produce density, geometry, uncertainty, discriminative, and causal signals. Each signal is normalised using validation statistics and calibrated with synthetic out-of-distribution data. An adaptive top-k fusion selects the most informative signals and averages their scores. Experiments on synthetic, Adult, CIFAR-10, and Gridworld datasets show strong performance across diverse anomaly types, outperforming multiple baselines on AUROC, AUPR, and FPR95. The model is stable across seeds and particularly effective for detecting new variables and confounders. SPECTRE-G2 provides a practical approach for detecting unknown unknowns in open-world settings.
[879] Model Evolution Under Zeroth-Order Optimization: A Neural Tangent Kernel Perspective
Chen Zhang, Yuxin Cheng, Chenchen Ding, Shuqi Wang, Jingreng Lei, Runsheng Yu, Yik-Chung WU, Ngai Wong
Main category: cs.LG
TL;DR: The paper introduces Neural Zeroth-order Kernel (NZK) theory to characterize training dynamics of zeroth-order optimization methods, providing theoretical analysis for linear models and linearized neural networks.
Details
Motivation: Zeroth-order optimization enables memory-efficient neural network training but has poorly understood training dynamics compared to first-order methods with established Neural Tangent Kernel theory. The stochastic nature of gradient estimation obscures how models evolve during training.Method: Introduces Neural Zeroth-order Kernel (NZK) to describe model evolution in function space under zeroth-order updates. Proves that for linear models, the expected NZK remains constant throughout training and depends on first and second moments of random perturbation directions. Extends analysis to linearized neural networks and interprets ZO updates as kernel gradient descent via NZK.
Result: Theoretical analysis shows NZK invariance yields closed-form expression for model evolution under squared loss. Experiments on synthetic and real-world datasets (MNIST, CIFAR-10, Tiny ImageNet) validate theoretical results and demonstrate acceleration when using a single shared random vector.
Conclusion: NZK provides a novel theoretical framework for understanding zeroth-order optimization dynamics, offering insights into model evolution and potential acceleration strategies. The kernel perspective enables better characterization of ZO training behavior compared to previous stochastic gradient estimation approaches.
Abstract: Zeroth-order (ZO) optimization enables memory-efficient training of neural networks by estimating gradients via forward passes only, eliminating the need for backpropagation. However, the stochastic nature of gradient estimation significantly obscures the training dynamics, in contrast to the well-characterized behavior of first-order methods under Neural Tangent Kernel (NTK) theory. To address this, we introduce the Neural Zeroth-order Kernel (NZK) to describe model evolution in function space under ZO updates. For linear models, we prove that the expected NZK remains constant throughout training and depends explicitly on the first and second moments of the random perturbation directions. This invariance yields a closed-form expression for model evolution under squared loss. We further extend the analysis to linearized neural networks. Interpreting ZO updates as kernel gradient descent via NZK provides a novel perspective for potentially accelerating convergence. Extensive experiments across synthetic and real-world datasets (including MNIST, CIFAR-10, and Tiny ImageNet) validate our theoretical results and demonstrate acceleration when using a single shared random vector.
[880] Pruned Adaptation Modules: A Simple yet Strong Baseline for Continual Foundation Models
Elif Ceren Gok Yildirim, Murat Onur Yildirim, Joaquin Vanschoren
Main category: cs.LG
TL;DR: PAM introduces a lightweight continual learning method using pruned adaptation modules with sparse task-specific layers, outperforming foundation model-based approaches while using significantly fewer parameters.
Details
Motivation: The paper addresses the methodological gap in continual learning where the field has shifted to foundation model-based approaches without proper comparison to strong convolutional baselines, making it unclear whether recent progress represents genuine advances or just lack of rigorous baselines.Method: PAM (Pruned Adaptation Modules) freezes most of a pre-trained ResNet and enables scalable continual adaptation through sparse task-specific layers, achieving up to 5x reduction in trainable parameters and 6x reduction in total parameters.
Result: PAM consistently mitigates catastrophic forgetting and outperforms state-of-the-art foundation model-based continual learning approaches across diverse benchmarks while being more parameter-efficient.
Conclusion: PAM serves as a strong, transparent baseline that bridges the gap between traditional and foundation model-based continual learning, enabling more accurate assessment of true progress in continual adaptation.
Abstract: The continual learning literature has rapidly shifted from traditional class incremental learning (CIL) techniques to foundation model (FM)-based CIL methods without a clear understanding of how these newer approaches compare to strong, lightweight convolutional baselines. This abrupt transition has created a substantial methodological gap, making it difficult to assess whether recent FM-based CIL progress reflects genuine advances or merely the absence of rigorous baselines. To address this gap, we introduce Pruned Adaptation Modules (PAM), a simple yet effective method that freezes the vast majority of the pre-trained ResNet while enabling scalable continual adaptation through sparse task-specific layers. PAM yields up to a ~5x reduction in trainable parameters and a ~6x reduction in total parameters, significantly reducing the cost of continual updates. Across diverse benchmarks, PAM consistently mitigates catastrophic forgetting and outperforms state-of-the-art FM-based CIL approaches. Our findings position PAM as a strong and transparent baseline that helps bridge the gap between traditional and FM-based CIL, guiding future research for a more accurate assessment of true progress in continual adaptation. The code can be found at: https://github.com/ElifCerenGokYildirim/PAM.
[881] Rethinking Plasticity in Deep Reinforcement Learning
Zhiqiang He
Main category: cs.LG
TL;DR: The paper investigates plasticity loss in deep RL, proposing that it occurs because optimal points from previous tasks become poor local optima for new tasks, trapping parameters and hindering learning during task transitions.
Details
Motivation: To understand the fundamental mechanisms behind plasticity loss in deep reinforcement learning, where neural networks lose their ability to adapt to non-stationary environments, moving beyond descriptive metrics to explain underlying optimization dynamics.Method: Proposes the Optimization-Centric Plasticity (OCP) hypothesis, theoretically establishes equivalence between neuron dormancy and zero-gradient states, and conducts experiments across diverse non-stationary RL scenarios to validate the framework.
Result: Shows plasticity loss is task-specific; networks with high dormancy in one task can achieve performance parity with randomly initialized networks on significantly different tasks. Parameter constraints mitigate plasticity loss by preventing deep entrenchment in local optima.
Conclusion: Provides a rigorous optimization-based framework for understanding and restoring network plasticity in complex RL domains, explaining why plasticity loss occurs and how to mitigate it through optimization landscape analysis.
Abstract: This paper investigates the fundamental mechanisms driving plasticity loss in deep reinforcement learning (RL), a critical challenge where neural networks lose their ability to adapt to non-stationary environments. While existing research often relies on descriptive metrics like dormant neurons or effective rank, these summaries fail to explain the underlying optimization dynamics. We propose the Optimization-Centric Plasticity (OCP) hypothesis, which posits that plasticity loss arises because optimal points from previous tasks become poor local optima for new tasks, trapping parameters during task transitions and hindering subsequent learning. We theoretically establish the equivalence between neuron dormancy and zero-gradient states, demonstrating that the absence of gradient signals is the primary driver of dormancy. Our experiments reveal that plasticity loss is highly task-specific; notably, networks with high dormancy rates in one task can achieve performance parity with randomly initialized networks when switched to a significantly different task, suggesting that the network’s capacity remains intact but is inhibited by the specific optimization landscape. Furthermore, our hypothesis elucidates why parameter constraints mitigate plasticity loss by preventing deep entrenchment in local optima. Validated across diverse non-stationary scenarios, our findings provide a rigorous optimization-based framework for understanding and restoring network plasticity in complex RL domains.
[882] Reward Sharpness-Aware Fine-Tuning for Diffusion Models
Kwanyoung Kim, Byeongsu Sim
Main category: cs.LG
TL;DR: RSA-FT: A method to prevent reward hacking in reward-centric diffusion reinforcement learning by using gradients from robustified reward models through parameter and sample perturbations.
Details
Motivation: RLHF has been successful for aligning LLMs with human preferences, but similar approaches for diffusion models (RDRL) suffer from reward hacking where reward scores increase without actual quality improvements, due to non-robust reward model gradients.Method: Introduces RSA-FT which exploits gradients from robustified reward models without retraining. Uses gradients from flattened reward models obtained through parameter perturbations of the diffusion model and perturbations of generated samples.
Result: Each method independently alleviates reward hacking and improves robustness, with joint use amplifying benefits. RSA-FT is simple, broadly compatible, and consistently enhances RDRL reliability.
Conclusion: RSA-FT effectively addresses reward hacking in diffusion reinforcement learning by leveraging robust gradients, improving alignment and controllability of diffusion models similar to RLHF for language models.
Abstract: Reinforcement learning from human feedback (RLHF) has proven effective in aligning large language models with human preferences, inspiring the development of reward-centric diffusion reinforcement learning (RDRL) to achieve similar alignment and controllability. While diffusion models can generate high-quality outputs, RDRL remains susceptible to reward hacking, where the reward score increases without corresponding improvements in perceptual quality. We demonstrate that this vulnerability arises from the non-robustness of reward model gradients, particularly when the reward landscape with respect to the input image is sharp. To mitigate this issue, we introduce methods that exploit gradients from a robustified reward model without requiring its retraining. Specifically, we employ gradients from a flattened reward model, obtained through parameter perturbations of the diffusion model and perturbations of its generated samples. Empirically, each method independently alleviates reward hacking and improves robustness, while their joint use amplifies these benefits. Our resulting framework, RSA-FT (Reward Sharpness-Aware Fine-Tuning), is simple, broadly compatible, and consistently enhances the reliability of RDRL.
[883] Prompt replay: speeding up grpo with on-policy reuse of high-signal prompts
Andrei Baroian, Rutger Berger
Main category: cs.LG
TL;DR: Prompt Replay: An overhead-free online data selection method for GRPO-style RL training that reuses only prompts (not trajectories) to reduce rollout costs while preserving on-policy optimization.
Details
Motivation: GRPO-style reinforcement learning for LLMs is dominated by expensive rollouts and wastes compute on unusable prompts. Current methods are computationally inefficient due to the high cost of generating trajectories.Method: Proposes Prompt Replay which reuses only prompts (not trajectories) by inserting medium-difficulty prompts into a buffer after each step. Prioritizes prompts closer to a 0.5 pass rate to maximize learning signal. Training batches mix reused prompts with fresh samples, with cooldown steps and max reuse times controlling aggressiveness vs overfitting risk.
Result: Across multiple model families (Llama-3.2-3B, Qwen3-8B) and datasets (Dolci, Polaris), Prompt Replay reduces zero-variance prompts, increases mean absolute advantage, and shows faster initial accuracy gains on six standard math benchmarks. However, it plateaus and converges with baseline when too aggressive configuration is used. Most efficient when rollouts are primary bottleneck and dataset is difficult for the model.
Conclusion: Prompt Replay is an effective method for reducing rollout costs in GRPO-style RL training, particularly when rollouts are the bottleneck. The method also reveals that Qwen2.5-Math can exhibit spurious-reward effects, raising concerns about using it as a sole testbed for GRPO research.
Abstract: Reinforcement learning with verifiable rewards (RLVR) plays a crucial role in expanding the capacities of LLM reasoning, but GRPO-style training is dominated by expensive rollouts and wastes compute on unusable prompts. We propose Prompt Replay, an overhead-free online data selection method for GRPO that reuses prompts only (not trajectories), to preserve on-policy optimization. After each step, we insert prompts with medium difficulty into a buffer, and prioritize prompts closer to a pass rate of 0.5 (half answers correct, half wrong) to maximize the advantage, thus learning signal. Training batches are formed by mixing reused prompts with fresh samples, with cooldown steps and max reuse times controlling aggressiveness vs risk of overfitting. Across multiple model families (Llama-3.2- 3B, Qwen3-8B) and training datasets (Dolci, Polaris), evaluated using average accuracy on six standard math benchmarks, Prompt Replay reduces zero-variance prompts, increases mean absolute advantage and shows faster initial accuracy gains. Yet, it plateaus and converges with the baseline, as too aggressive configuration was used. The method is most efficient when the rollouts are the primary bottleneck and the dataset is difficult for the model. We additionally observe that Qwen2.5-Math can exhibit spurious-reward effects that invalidates ablations, raising a warning signal for using it as a sole testbed for GRPO method research.
[884] ALMAB-DC: Active Learning, Multi-Armed Bandits, and Distributed Computing for Sequential Experimental Design and Black-Box Optimization
Foo Hui-Mean, Yuan-chin I Chang
Main category: cs.LG
TL;DR: ALMAB-DC: A Gaussian process-based sequential design framework combining active learning, multi-armed bandits, and distributed computing for expensive black-box optimization tasks.
Details
Motivation: Sequential experimental design under expensive, gradient-free objectives is challenging due to tight evaluation budgets and the need to extract maximum information from each observation.Method: Combines Gaussian process surrogate with uncertainty-aware acquisition, UCB/Thompson-sampling bandit controller for parallel worker allocation, and asynchronous scheduler for heterogeneous runtimes.
Result: Outperforms baselines on statistical experimental-design tasks (dose-response optimization, spatial field estimation) and ML/engineering tasks (CIFAR-10 HPO, CFD drag minimization, MuJoCo RL) with statistically significant advantages. Achieves 7.5× speedup at 16 agents.
Conclusion: ALMAB-DC provides an effective framework for expensive black-box optimization with theoretical guarantees and practical distributed efficiency.
Abstract: Sequential experimental design under expensive, gradient-free objectives is a central challenge in computational statistics: evaluation budgets are tightly constrained and information must be extracted efficiently from each observation. We propose \textbf{ALMAB-DC}, a GP-based sequential design framework combining active learning, multi-armed bandits (MAB), and distributed asynchronous computing for expensive black-box experimentation. A Gaussian process surrogate with uncertainty-aware acquisition identifies informative query points; a UCB or Thompson-sampling bandit controller allocates evaluations across parallel workers; and an asynchronous scheduler handles heterogeneous runtimes. We present cumulative regret bounds for the bandit components and characterize parallel scalability via Amdahl’s Law. We validate ALMAB-DC on five benchmarks. On the two statistical experimental-design tasks, ALMAB-DC achieves lower simple regret than Equal Spacing, Random, and D-optimal designs in dose–response optimization, and in adaptive spatial field estimation matches the Greedy Max-Variance benchmark while outperforming Latin Hypercube Sampling; at $K=4$ the distributed setting reaches target performance in one-quarter of sequential wall-clock rounds. On three ML/engineering tasks (CIFAR-10 HPO, CFD drag minimization, MuJoCo RL), ALMAB-DC achieves 93.4% CIFAR-10 accuracy (outperforming BOHB by 1.7,pp and Optuna by 1.1,pp), reduces airfoil drag to $C_D = 0.059$ (36.9% below Grid Search), and improves RL return by 50% over Grid Search. All advantages over non-ALMAB baselines are statistically significant under Bonferroni-corrected Mann–Whitney $U$ tests. Distributed execution achieves $7.5\times$ speedup at $K = 16$ agents, consistent with Amdahl’s Law.
[885] On the Role of Batch Size in Stochastic Conditional Gradient Methods
Rustem Islamov, Roman Machacek, Aurelien Lucchi, Antonio Silveti-Falls, Eduard Gorbunov, Volkan Cevher
Main category: cs.LG
TL;DR: Analysis of batch size effects in stochastic conditional gradient methods under KL conditions, revealing regime-dependent behavior where increasing batch size initially helps but eventually saturates/degrades performance under fixed token budgets.
Details
Motivation: To understand the role of batch size in stochastic conditional gradient methods (like Scion) under μ-KL conditions, and to provide theoretical insights into the interaction between stepsize, batch size, and stochastic noise for large-scale optimization.Method: Theoretical analysis of momentum-based stochastic conditional gradient algorithms, deriving new analysis that captures stepsize-batch size-noise interactions. Proposes adaptive strategy that increases batch size and sequence length during training while preserving convergence guarantees.
Result: Reveals regime-dependent behavior: increasing batch size initially improves optimization accuracy but beyond a critical threshold, benefits saturate and can degrade performance under fixed token budget. Theory predicts optimal stepsize magnitude and aligns with empirical practices.
Conclusion: Provides theoretical framework for understanding batch size scaling in stochastic conditional gradient methods and offers guidance for designing efficient training schedules in large-scale optimization, with experiments on NanoGPT supporting theoretical predictions.
Abstract: We study the role of batch size in stochastic conditional gradient methods under a $μ$-Kurdyka-Łojasiewicz ($μ$-KL) condition. Focusing on momentum-based stochastic conditional gradient algorithms (e.g., Scion), we derive a new analysis that explicitly captures the interaction between stepsize, batch size, and stochastic noise. Our study reveals a regime-dependent behavior: increasing the batch size initially improves optimization accuracy but, beyond a critical threshold, the benefits saturate and can eventually degrade performance under a fixed token budget. Notably, the theory predicts the magnitude of the optimal stepsize and aligns well with empirical practices observed in large-scale training. Leveraging these insights, we derive principled guidelines for selecting the batch size and stepsize, and propose an adaptive strategy that increases batch size and sequence length during training while preserving convergence guarantees. Experiments on NanoGPT are consistent with the theoretical predictions and illustrate the emergence of the predicted scaling regimes. Overall, our results provide a theoretical framework for understanding batch size scaling in stochastic conditional gradient methods and offer guidance for designing efficient training schedules in large-scale optimization.
[886] Pretrained Video Models as Differentiable Physics Simulators for Urban Wind Flows
Janne Perini, Rafael Bischof, Moab Arar, Ayça Duran, Michael A. Kraus, Siddhartha Mishra, Bernd Bickel
Main category: cs.LG
TL;DR: WinDiNet repurposes a pretrained video diffusion model (LTX-Video) as a fast, differentiable surrogate for urban wind CFD simulations, enabling rapid design optimization for pedestrian wind comfort and safety.
Details
Motivation: Current CFD simulations for urban wind analysis are computationally expensive, making extensive design exploration impractical. There's a need for fast, differentiable surrogates that can accelerate urban design optimization for pedestrian wind comfort and safety.Method: Fine-tunes LTX-Video (2B-parameter latent video transformer) on 10,000 2D incompressible CFD simulations over procedurally generated building layouts. Uses systematic study of training regimes, conditioning mechanisms, and VAE adaptation strategies with physics-informed decoder loss.
Result: Model generates 112-frame CFD rollouts in under a second, outperforming purpose-built neural PDE solvers. Enables gradient-based inverse optimization of building positions for wind safety and comfort, with improvements confirmed by ground-truth CFD simulations.
Conclusion: Video diffusion models can be effectively repurposed as fast, differentiable surrogates for complex physics simulations, enabling practical design optimization for urban wind environments that was previously computationally prohibitive.
Abstract: Designing urban spaces that provide pedestrian wind comfort and safety requires time-resolved Computational Fluid Dynamics (CFD) simulations, but their current computational cost makes extensive design exploration impractical. We introduce WinDiNet (Wind Diffusion Network), a pretrained video diffusion model that is repurposed as a fast, differentiable surrogate for this task. Starting from LTX-Video, a 2B-parameter latent video transformer, we fine-tune on 10,000 2D incompressible CFD simulations over procedurally generated building layouts. A systematic study of training regimes, conditioning mechanisms, and VAE adaptation strategies, including a physics-informed decoder loss, identifies a configuration that outperforms purpose-built neural PDE solvers. The resulting model generates full 112-frame rollouts in under a second. As the surrogate is end-to-end differentiable, it doubles as a physics simulator for gradient-based inverse optimization: given an urban footprint layout, we optimize building positions directly through backpropagation to improve wind safety as well as pedestrian wind comfort. Experiments on single- and multi-inlet layouts show that the optimizer discovers effective layouts even under challenging multi-objective configurations, with all improvements confirmed by ground-truth CFD simulations.
[887] Does Mechanistic Interpretability Transfer Across Data Modalities? A Cross-Domain Causal Circuit Analysis of Variational Autoencoders
Dip Roy, Rajiv Misra, Sanjay Kumar Singh, Anisha Roy
Main category: cs.LG
TL;DR: The paper investigates how interpretability techniques from image VAEs generalize to tabular data, finding significant differences in circuit modularity and performance between domains.
Details
Motivation: While mechanism-based interpretability has advanced for discriminative networks and image VAEs, there's limited understanding of generative models for tabular data, despite their increasing use for imputation, anomaly detection, and synthetic data generation.Method: Extends a four-level causal intervention framework to four tabular and one image benchmark across five VAE architectures (75 runs per architecture). Introduces three new techniques: posterior-calibration of Causal Effect Strength (CES), path-specific activation patching, and Feature-Group Disentanglement (FGD).
Result: Tabular VAEs have 50% lower circuit modularity than image VAEs; β-VAE experiences near-complete collapse in CES scores for tabular data; CES captures 9/11 significant architecture differences; high-specificity interventions predict best downstream AUC values.
Conclusion: Challenges the assumption that architectural guidance from image studies transfers to tabular datasets, showing significant domain differences in interpretability and performance.
Abstract: Although mechanism-based interpretability has generated an abundance of insight for discriminative network analysis, generative models are less understood – particularly outside of image-related applications. We investigate how much of the causal circuitry found within image-related variational autoencoders (VAEs) will generalize to tabular data, as VAEs are increasingly used for imputation, anomaly detection, and synthetic data generation. In addition to extending a four-level causal intervention framework to four tabular and one image benchmark across five different VAE architectures (with 75 individual training runs per architecture and three random seed values for each run), this paper introduces three new techniques: posterior-calibration of Causal Effect Strength (CES), path-specific activation patching, and Feature-Group Disentanglement (FGD). The results from our experiments demonstrate that: (i) Tabular VAEs have circuits with modularity that is approximately 50% lower than their image counterparts. (ii) $β$-VAE experiences nearly complete collapse in CES scores when applied to heterogeneous tabular features (0.043 CES score for tabular data compared to 0.133 CES score for images), which can be directly attributed to reconstruction quality degradation (r = -0.886 correlation coefficient between CES and MSE). (iii) CES successfully captures nine of eleven statistically significant architecture differences using Holm–Šidák corrections. (iv) Interventions with high specificity predict the highest downstream AUC values (r = 0.460, p < .001). This study challenges the common assumption that architectural guidance from image-related studies can be transferred to tabular datasets.
[888] Amortized Variational Inference for Logistic Regression with Missing Covariates
M. Cherifi, Aude Sportisse, Xujia Zhu, Mohammed Nabil El Korso, A. Mesloub
Main category: cs.LG
TL;DR: AV-LR: Amortized variational inference framework for logistic regression with missing covariates that jointly estimates regression parameters and missingness mechanism using a single inference network.
Details
Motivation: Missing covariate data poses challenges for statistical inference and machine learning. Classical methods (EM, multiple imputation) are computationally intensive and sensitive to high missingness rates. Recent deep generative models (VAEs) rely on complex latent representations, creating a need for simpler, more efficient approaches.Method: AV-LR integrates a probabilistic generative model with a simple amortized inference network trained jointly by maximizing the evidence lower bound. It performs inference directly in the space of missing data without additional latent variables, using a single inference network and linear layer to jointly estimate regression parameters and missingness mechanism.
Result: AV-LR achieves estimation accuracy comparable to or better than state-of-the-art EM-like algorithms with significantly lower computational cost. It naturally extends to missing-not-at-random settings and shows effectiveness across various missing-data scenarios on synthetic and real-world datasets.
Conclusion: AV-LR provides an efficient, unified end-to-end framework for binary logistic regression with missing covariates that outperforms classical methods in computational efficiency while maintaining or improving accuracy, with natural extensions to complex missingness patterns.
Abstract: Missing covariate data pose a significant challenge to statistical inference and machine learning, particularly for classification tasks like logistic regression. Classical iterative approaches (EM, multiple imputation) are often computationally intensive, sensitive to high missingness rates, and limited in uncertainty propagation. Recent deep generative models based on VAEs show promise but rely on complex latent representations. We propose Amortized Variational Inference for Logistic Regression (AV-LR), a unified end-to-end framework for binary logistic regression with missing covariates. AV-LR integrates a probabilistic generative model with a simple amortized inference network, trained jointly by maximizing the evidence lower bound. Unlike competing methods, AV-LR performs inference directly in the space of missing data without additional latent variables, using a single inference network and a linear layer that jointly estimate regression parameters and the missingness mechanism. AV-LR achieves estimation accuracy comparable to or better than state-of-the-art EM-like algorithms, with significantly lower computational cost. It naturally extends to missing-not-at-random settings by explicitly modeling the missingness mechanism. Empirical results on synthetic and real-world datasets confirm its effectiveness and efficiency across various missing-data scenarios.
[889] Aggregation Alignment for Federated Learning with Mixture-of-Experts under Data Heterogeneity
Zihan Fang, Qianru Wang, Haonan An, Zheng Lin, Yiqin Deng, Xianhao Chen, Yuguang Fang
Main category: cs.LG
TL;DR: FedAlign-MoE: A federated learning framework for fine-tuning Mixture-of-Experts LLMs that addresses aggregation challenges in non-IID data through routing consistency and expert semantic alignment.
Details
Motivation: Fine-tuning MoE-based LLMs in federated learning settings faces challenges due to data heterogeneity across clients: divergent gating preferences create ineffective global routing, and same-indexed experts develop disparate semantic roles causing specialization degradation.Method: Proposes FedAlign-MoE with two key components: (1) aggregates gating behaviors by aligning routing distributions through consistency weighting and distribution regularization, (2) quantifies semantic consistency among same-indexed experts and selectively aggregates updates from semantically aligned clients.
Result: Extensive experiments show FedAlign-MoE outperforms state-of-the-art benchmarks, achieving faster convergence and superior accuracy in non-IID federated environments.
Conclusion: FedAlign-MoE effectively addresses aggregation challenges in federated fine-tuning of MoE-based LLMs, enabling collaborative learning while preserving data privacy and maintaining model specialization.
Abstract: Large language models (LLMs) increasingly adopt Mixture-of-Experts (MoE) architectures to scale model capacity while reducing computation. Fine-tuning these MoE-based LLMs often requires access to distributed and privacy-sensitive data, making centralized fine-tuning impractical. Federated learning (FL) therefore provides a paradigm to collaboratively fine-tune MoE-based LLMs, enabling each client to integrate diverse knowledge without compromising data privacy. However, the integration of MoE-based LLM fine-tuning into FL encounters two critical aggregation challenges due to inherent data heterogeneity across clients: (i) divergent local data distributions drive clients to develop distinct gating preference for localized expert selection, causing direct parameter aggregation to produce a ``one-size-fits-none’’ global gating network, and (ii) same-indexed experts develop disparate semantic roles across clients, leading to expert semantic blurring and the degradation of expert specialization. To address these challenges, we propose FedAlign-MoE, a federated aggregation alignment framework that jointly enforces routing consistency and expert semantic alignment. Specifically, FedAlign-MoE aggregates gating behaviors by aligning routing distributions through consistency weighting and optimizes local gating networks through distribution regularization, maintaining cross-client stability without overriding discriminative local preferences. Meanwhile, FedAlign-MoE explicitly quantifies semantic consistency among same-indexed experts across clients and selectively aggregates updates from semantically aligned clients, ensuring stable and specialized functional roles for global experts. Extensive experiments demonstrate that FedAlign-MoE outperforms state-of-the-art benchmarks, achieving faster convergence and superior accuracy in non-IID federated environments.
[890] Sonny: Breaking the Compute Wall in Medium-Range Weather Forecasting
Minjong Cheon
Main category: cs.LG
TL;DR: Sonny is an efficient hierarchical transformer for medium-range weather forecasting that achieves competitive performance with operational systems while being computationally feasible for academic groups with limited resources.
Details
Motivation: Current data-driven weather forecasting models require large-scale training regimes and compute-intensive architectures, creating practical barriers for academic groups with limited compute resources. There's a need for efficient models that maintain competitive forecasting performance while being accessible to researchers with constrained budgets.Method: Sonny uses a two-stage StepsNet design: 1) a narrow slow path that models large-scale atmospheric dynamics, and 2) a full-width fast path that integrates thermodynamic interactions. The model applies exponential moving average (EMA) during training to stabilize medium-range rollout without additional fine-tuning stages.
Result: On WeatherBench2, Sonny achieves robust medium-range forecast skill, remains competitive with operational baselines, and shows clear advantages over FastNet, especially at extended tropical lead times. Practically, Sonny can be trained to convergence on a single NVIDIA A40 GPU in approximately 5.5 days.
Conclusion: Sonny demonstrates that efficient transformer architectures can achieve competitive weather forecasting performance while being computationally accessible to academic research groups, addressing the practical barriers of existing compute-intensive models.
Abstract: Weather forecasting is a fundamental problem for protecting lives and infrastructure from high-impact atmospheric events. Recently, data-driven weather forecasting methods based on deep learning have demonstrated strong performance, often reaching accuracy levels competitive with operational numerical systems. However, many existing models rely on large-scale training regimes and compute-intensive architectures, which raises the practical barrier for academic groups with limited compute resources. Here we introduce Sonny, an efficient hierarchical transformer that achieves competitive medium-range forecasting performance while remaining feasible within reasonable compute budgets. At the core of Sonny is a two-stage StepsNet design: a narrow slow path first models large-scale atmospheric dynamics, and a subsequent full-width fast path integrates thermodynamic interactions. To stabilize medium-range rollout without an additional fine-tuning stage, we apply exponential moving average (EMA) during training. On WeatherBench2, Sonny yields robust medium-range forecast skill, remains competitive with operational baselines, and demonstrates clear advantages over FastNet, particularly at extended tropical lead times. In practice, Sonny can be trained to convergence on a single NVIDIA A40 GPU in approximately 5.5 days.
[891] Direct Interval Propagation Methods using Neural-Network Surrogates for Uncertainty Quantification in Physical Systems Surrogate Model
Ghifari Adam Faza, Jolan Wauters, Fabio Cuzzolin, Hans Hallez, David Moens
Main category: cs.LG
TL;DR: Neural network surrogate models for interval propagation reformulated as interval-valued regression to directly predict output bounds, improving computational efficiency over traditional optimization methods.
Details
Motivation: Standard interval propagation requires solving optimization problems that are computationally expensive for complex systems. Surrogate models help but still need many inference calls. The paper aims to overcome this by directly predicting output bounds.Method: Reformulates interval propagation as interval-valued regression problem. Studies neural network-based surrogate models including MLPs and DeepONet. Investigates three approaches: naive interval propagation through standard architectures, bound propagation methods (IBP and CROWN), and interval neural networks with interval weights.
Result: Methods significantly improve computational efficiency over traditional optimization-based approaches while maintaining accurate interval estimates. Practical limitations and open challenges in applying interval-based propagation methods are discussed.
Conclusion: Neural network-based surrogate models offer efficient alternatives to traditional interval propagation methods, with different approaches providing trade-offs between accuracy and computational cost.
Abstract: In engineering, uncertainty propagation aims to characterise system outputs under uncertain inputs. For interval uncertainty, the goal is to determine output bounds given interval-valued inputs, which is critical for robust design optimisation and reliability analysis. However, standard interval propagation relies on solving optimisation problems that become computationally expensive for complex systems. Surrogate models alleviate this cost but typically replace only the evaluator within the optimisation loop, still requiring many inference calls. To overcome this limitation, we reformulate interval propagation as an interval-valued regression problem that directly predicts output bounds. We present a comprehensive study of neural network-based surrogate models, including multilayer perceptrons (MLPs) and deep operator networks (DeepONet), for this task. Three approaches are investigated: (i) naive interval propagation through standard architectures, (ii) bound propagation methods such as Interval Bound Propagation (IBP) and CROWN, and (iii) interval neural networks (INNs) with interval weights. Results show that these methods significantly improve computational efficiency over traditional optimisation-based approaches while maintaining accurate interval estimates. We further discuss practical limitations and open challenges in applying interval-based propagation methods.
[892] FluidWorld: Reaction-Diffusion Dynamics as a Predictive Substrate for World Models
Fabien Polly
Main category: cs.LG
TL;DR: FluidWorld introduces a PDE-based world model using reaction-diffusion equations for video prediction, achieving better spatial structure and multi-step rollouts than Transformer and ConvLSTM baselines with comparable parameters.
Details
Motivation: Current Transformer-based world models suffer from O(N²) computation and lack explicit spatial inductive bias. The paper questions whether self-attention is necessary for predictive world modeling and explores alternative computational substrates.Method: FluidWorld uses partial differential equations (PDEs) of reaction-diffusion type as the predictive dynamics. Instead of a separate neural network predictor, the PDE integration itself produces future state predictions. Tested on UCF-101 video prediction with strict parameter matching against Transformer and ConvLSTM baselines.
Result: FluidWorld achieves 2x lower reconstruction error, produces representations with 10-15% higher spatial structure preservation and 18-25% more effective dimensionality, and maintains coherent multi-step rollouts where baselines degrade rapidly. All models converge to comparable single-step prediction loss.
Conclusion: PDE-based dynamics provide a viable and parameter-efficient alternative to attention and convolutional recurrence for world modeling, offering O(N) spatial complexity, adaptive computation, and global spatial coherence through diffusion.
Abstract: World models learn to predict future states of an environment, enabling planning and mental simulation. Current approaches default to Transformer-based predictors operating in learned latent spaces. This comes at a cost: O(N^2) computation and no explicit spatial inductive bias. This paper asks a foundational question: is self-attention necessary for predictive world modeling, or can alternative computational substrates achieve comparable or superior results? I introduce FluidWorld, a proof-of-concept world model whose predictive dynamics are governed by partial differential equations (PDEs) of reaction-diffusion type. Instead of using a separate neural network predictor, the PDE integration itself produces the future state prediction. In a strictly parameter-matched three-way ablation on unconditional UCF-101 video prediction (64x64, ~800K parameters, identical encoder, decoder, losses, and data), FluidWorld is compared against both a Transformer baseline (self-attention) and a ConvLSTM baseline (convolutional recurrence). While all three models converge to comparable single-step prediction loss, FluidWorld achieves 2x lower reconstruction error, produces representations with 10-15% higher spatial structure preservation and 18-25% more effective dimensionality, and critically maintains coherent multi-step rollouts where both baselines degrade rapidly. All experiments were conducted on a single consumer-grade PC (Intel Core i5, NVIDIA RTX 4070 Ti), without any large-scale compute. These results establish that PDE-based dynamics, which natively provide O(N) spatial complexity, adaptive computation, and global spatial coherence through diffusion, are a viable and parameter-efficient alternative to both attention and convolutional recurrence for world modeling.
[893] Stream separation improves Bregman conditioning in transformers
James Clayton Kerce
Main category: cs.LG
TL;DR: Analysis shows softmax induces curved geometry in transformer representations, causing Euclidean steering methods to leak probability mass. Intermediate layers have severely degenerate geometry, but stream separation improves conditioning, with cosine similarity between primal/dual directions predicting steering effectiveness.
Details
Motivation: Current linear methods for steering transformer representations (probing, activation engineering, concept erasure) assume Euclidean geometry, but softmax induces curved Bregman geometry. This curvature causes Euclidean steering to leak probability mass to unintended tokens, raising concerns about reliability of linear safety interventions.Method: Measure the Hessian (curvature) at intermediate layers using controlled 2x2 design crossing stream separation with per-layer supervision. Compare standard single-stream transformers vs. stream-separated architectures, all at matched vocabulary and parameter count. Analyze effective rank and conditioning of geometry.
Result: In standard transformers, Hessian is severely degenerate at intermediate layers (effective rank 8 in 516 dimensions). Stream separation improves conditioning by up to 22 in effective rank, even without auxiliary supervision. Per-layer supervision helps but less. Cosine similarity between primal and dual concept directions predicts steering effectiveness with threshold near 0.3.
Conclusion: Geometry of transformer representations is curved and poorly conditioned at intermediate layers, undermining reliability of linear safety interventions. Stream separation significantly improves geometry conditioning. The findings have important implications for safety interventions that depend on linear steering methods.
Abstract: Linear methods for steering transformer representations, including probing, activation engineering, and concept erasure, implicitly assume the geometry of representation space is Euclidean. Park et al. [Park et al., 2026] showed that softmax induces a curved Bregman geometry whose metric tensor is the Hessian of the log-normalizer, $H(λ) = Cov[γ | λ]$. Ignoring this curvature causes Euclidean steering to leak probability mass to unintended tokens. Their analysis applies at the output layer. We measure this Hessian at intermediate layers in a controlled 2x2 design crossing stream separation with per-layer supervision (vocabulary decoding loss at each layer), all at matched vocabulary and parameter count. In standard single-stream transformers, H is severely degenerate at intermediate layers (effective rank 8 in 516 dimensions). Stream separation improves conditioning by up to 22 in effective rank, even without auxiliary supervision. Per-layer supervision helps, but less. The cosine similarity between primal and dual concept directions predicts per-layer steering effectiveness on downstream tasks, with a threshold near 0.3. These results bear on the reliability of linear safety interventions, which depend on the geometry being well-conditioned at the layer where they are applied.
[894] Active Inference Agency Formalization, Metrics, and Convergence Assessments
Eduard Kapelko
Main category: cs.LG
TL;DR: This paper provides a formal framework for defining and analyzing agency in AI systems, focusing on mesa-optimization safety concerns. It conceptualizes agency as continuous representation achieving autopoiesis through balance of curiosity and empowerment, and introduces metrics for detecting mesa-optimizers.
Details
Motivation: The paper addresses critical AI safety challenges related to mesa-optimization - the emergence of unintended inner optimization processes in AI systems. There's a need for formal definitions and detection methods for agency to identify potentially dangerous mesa-optimizers in complex AI systems.Method: The paper proposes a formal definition of agency as continuous representation achieving autopoiesis through dynamic balance between curiosity (minimizing prediction error) and empowerment (maximizing control channel information capacity). It introduces a metric based on distance between system behavior and “ideal” agentic function in canonicalized reward space (STARC).
Result: The agency function is shown to be smooth and convex with favorable optimization properties. While agentic functions are rare in abstract function space, they exhibit logarithmic convergence in sparse environments, suggesting high probability of spontaneous emergence in large-scale models. The STARC-based metric provides concrete apparatus for classifying and detecting mesa-optimizers.
Conclusion: The framework provides robust tools for analyzing and identifying undesirable inner optimization in complex AI systems, offering formal methods for mesa-optimizer detection and classification based on proximity to ideal agentic objectives.
Abstract: This paper addresses the critical challenge of mesa-optimization in AI safety by providing a formal definition of agency and a framework for its analysis. Agency is conceptualized as a Continuous Representation of accumulated experience that achieves autopoiesis through a dynamic balance between curiosity (minimizing prediction error to ensure non-computability and novelty) and empowerment (maximizing the control channel’s information capacity to ensure subjectivity and goal-directedness). Empirical evidence suggests that this active inference-based model successfully accounts for classical instrumental goals, such as self-preservation and resource acquisition. The analysis demonstrates that the proposed agency function is smooth and convex, possessing favorable properties for optimization. While agentic functions occupy a vanishingly small fraction of the total abstract function space, they exhibit logarithmic convergence in sparse environments. This suggests a high probability for the spontaneous emergence of agency during the training of modern, large-scale models. To quantify the degree of agency, the paper introduces a metric based on the distance between the behavioral equivalents of a given system and an “ideal” agentic function within the space of canonicalized rewards (STARC). This formalization provides a concrete apparatus for classifying and detecting mesa-optimizers by measuring their proximity to an ideal agentic objective, offering a robust tool for analyzing and identifying undesirable inner optimization in complex AI systems.
[895] AutoKernel: Autonomous GPU Kernel Optimization via Iterative Agent-Driven Search
Jaber Jaber, Osama Jaber
Main category: cs.LG
TL;DR: AutoKernel is an autonomous agent framework that automatically optimizes GPU kernels for PyTorch models through profiling, bottleneck identification, and iterative refinement of Triton/CUDA implementations.
Details
Motivation: Writing high-performance GPU kernels is extremely labor-intensive in ML systems engineering, requiring significant expertise and manual optimization effort.Method: AutoKernel profiles models to identify computational bottlenecks, ranks them by Amdahl’s law impact, and iteratively refines Triton or CUDA C++ kernel implementations through hundreds of experiments. It uses a five-stage correctness harness for validation and includes 18 starter kernels across two backends with a six-tier optimization playbook.
Result: On NVIDIA H100, Triton kernels outperform PyTorch eager by 5.29x on RMSNorm, 2.82x on softmax, and 2.21x on cross-entropy, and beat torch.compile by 2.83x, 3.44x, and 2.94x respectively. An AutoKernel-optimized kernel achieved first place on the vectorsum_v2 B200 leaderboard.
Conclusion: AutoKernel demonstrates that autonomous agent loops can effectively automate GPU kernel optimization, achieving significant performance improvements over existing PyTorch execution methods while ensuring correctness through comprehensive validation.
Abstract: Writing high-performance GPU kernels is among the most labor-intensive tasks in machine learning systems engineering. We present AutoKernel, an open-source framework that applies an autonomous agent loop to GPU kernel optimization for arbitrary PyTorch models. Given a model, AutoKernel profiles it to identify computational bottlenecks, ranks them by Amdahl’s law impact, and iteratively refines Triton or CUDA C++ kernel implementations through hundreds of experiments without human intervention. A five-stage correctness harness covering smoke tests, shape sweeps, numerical stability, determinism verification, and edge-case coverage ensures every candidate kernel is validated before any speedup is recorded. The system comprises over 9,000 lines of Python, 18 starter kernel implementations across two backends, a six-tier optimization playbook, and integration with the KernelBench benchmark suite. AutoKernel covers nine kernel types spanning the dominant operations in modern transformer architectures. On an NVIDIA H100, our Triton kernels outperform both PyTorch eager and torch.compile (max-autotune) on the majority of tested configurations: 5.29x over eager on RMSNorm, 2.82x on softmax, and 2.21x on cross-entropy, while beating torch.compile by 2.83x, 3.44x, and 2.94x respectively. In community deployment, an AutoKernel-optimized kernel achieved first place on the vectorsum_v2 B200 leaderboard. The full system is available at https://github.com/RightNow-AI/autokernel.
[896] The Workload-Router-Pool Architecture for LLM Inference Optimization: A Vision Paper from the vLLM Semantic Router Project
Huamin Chen, Xunzhuo Liu, Bowei He, Fuyuan Lyu, Yankai Chen, Xue Liu, Yuhan Liu, Junchen Jiang
Main category: cs.LG
TL;DR: The paper presents a Workload-Router-Pool (WRP) architecture framework for LLM inference optimization, synthesizing prior work on routing mechanisms, fleet optimization, and multimodal/agentic routing into a three-dimensional framework.
Details
Motivation: The motivation is to create a unified framework for LLM inference optimization that integrates various independent research problems (routing mechanisms, fleet optimization, multimodal routing) into a coherent architecture, as these problems are interdependent in real-world deployments.Method: The paper distills prior research results into a three-dimensional WRP framework: Workload (characterizing what the fleet serves), Router (determining how requests are dispatched), and Pool (defining where inference runs). It maps prior work onto a 3x3 interaction matrix and identifies research gaps.
Result: The paper presents the WRP architecture framework, maps existing research onto the framework, identifies covered and open research areas, and proposes 21 concrete research directions at the intersections of workload, router, and pool dimensions.
Conclusion: The WRP framework provides a systematic approach to LLM inference optimization, revealing interdependencies between workload characteristics, routing decisions, and computational resources, with identified research gaps offering opportunities for future work.
Abstract: Over the past year, the vLLM Semantic Router project has released a series of work spanning: (1) core routing mechanisms – signal-driven routing, context-length pool routing, router performance engineering, policy conflict detection, low-latency embedding models, category-aware semantic caching, user-feedback-driven routing adaptation, hallucination detection, and hierarchical content-safety classification for privacy and jailbreak protection; (2) fleet optimization – fleet provisioning and energy-efficiency analysis; (3) agentic and multimodal routing – multimodal agent routing, tool selection, CUA security, and multi-turn context memory and safety; (4) governance and standards – inference routing protocols and multi-provider API extensions. Each paper tackled a specific problem in LLM inference, but the problems are not independent; for example, fleet provisioning depends on the routing policy, which depends on the workload mix, shifting as organizations adopt agentic and multimodal workloads. This paper distills those results into the Workload-Router-Pool (WRP) architecture, a three-dimensional framework for LLM inference optimization. Workload characterizes what the fleet serves (chat vs. agent, single-turn vs. multi-turn, warm vs. cold, prefill-heavy vs. decode-heavy). Router determines how each request is dispatched (static semantic rules, online bandit adaptation, RL-based model selection, quality-aware cascading). Pool defines where inference runs (homogeneous vs. heterogeneous GPU, disaggregated prefill/decode, KV-cache topology). We map our prior work onto a 3x3 WRP interaction matrix, identify which cells we have covered and which remain open, and propose twenty-one concrete research directions at the intersections, each grounded in our prior measurements, tiered by maturity from engineering-ready to open research.
[897] TIDE: Token-Informed Depth Execution for Per-Token Early Exit in LLM Inference
Jaber Jaber, Osama Jaber
Main category: cs.LG
TL;DR: TIDE is a post-training system that adds tiny learned routers to language models to enable early token exit, reducing computational cost without retraining.
Details
Motivation: Current LLMs process every token through all layers regardless of difficulty, wasting computation on easy tokens that could exit earlier.Method: Attaches learned routers at periodic checkpoint layers; at inference, selects earliest layer where token’s hidden state has converged; works with any HuggingFace causal LM without retraining.
Result: Achieves 100% prefill exit rate, reduces prefill latency by 7.2%, increases throughput by 6.6-8.1%, with 98-99% of tokens exiting early during decoding while maintaining accuracy.
Conclusion: TIDE enables efficient inference via early token exit without model retraining, compatible with existing LLMs and hardware.
Abstract: Large language models run every token through every layer, regardless of difficulty. We present TIDE, a post-training system that attaches tiny learned routers at periodic checkpoint layers and, at inference time, selects the earliest layer whose hidden state has converged for each token. TIDE requires no model retraining, works with any HuggingFace causal LM, auto-detects GPU architecture, and supports float32, float16, and bfloat16 through fused CUDA kernels. On an NVIDIA A100 with DeepSeek R1 Distill 8B, TIDE achieves 100% prefill exit rate (5% of tokens exit at layer 11, the remaining at layer 31), reduces prefill latency by 7.2%, and increases single-batch throughput by 6.6%. During autoregressive decoding, 98-99% of tokens exit early while the model correctly solves a multi-step math problem with 95 unique output tokens. On Qwen3 8B (36 layers), throughput improves by 8.1% at batch size 8. Calibration on 2,000 WikiText samples takes under 3 minutes and produces a ~4 MB router checkpoint. The system comprises 1,308 lines of Python and 1,081 lines of CUDA/C++ with 74 passing tests. Code: https://github.com/RightNow-AI/TIDE
[898] PLR: Plackett-Luce for Reordering In-Context Learning Examples
Pawel Batorski, Paul Swoboda
Main category: cs.LG
TL;DR: PLR: A probabilistic approach to in-context learning example ordering using Plackett-Luce distribution to efficiently find high-performing orderings without exhaustive search.
Details
Motivation: In-context learning performance is highly sensitive to example ordering, but exhaustive search over n! orderings is infeasible. Existing methods use model confidence measures or direct approaches, but need more efficient probabilistic methods.Method: PLR models orderings using a Plackett-Luce distribution and iteratively updates parameters to concentrate probability mass on high-performing orderings. Uses Gumbel perturb-and-sort for efficient sampling of candidate orderings.
Result: PLR consistently improves few-shot accuracy for k ∈ {4, 8, 16, 32} examples across multiple classification benchmarks and shows gains on mathematical reasoning tasks where label-based ordering methods don’t apply.
Conclusion: PLR provides an effective probabilistic approach to in-context example ordering that outperforms existing methods and works in scenarios where label-based ordering is not applicable.
Abstract: In-context learning (ICL) adapts large language models by conditioning on a small set of ICL examples, avoiding costly parameter updates. Among other factors, performance is often highly sensitive to the ordering of the examples. However, exhaustive search over the $n!$ possible orderings is infeasible. Therefore more efficient ordering methods use model confidence measures (e.g., label-probability entropy) over label sets or take a direct approach to finding the best ordering. We propose PLR, a probabilistic approach to in-context example ordering that replaces discrete ordering search with learning a probability distribution over orderings with the Plackett-Luce model. PLR models orderings using a Plackett-Luce distribution and iteratively updates its parameters to concentrate probability mass on high-performing orderings under a task-level metric. Candidate orderings are sampled efficiently via a Gumbel perturb-and-sort procedure. Experiments on multiple classification benchmarks show that PLR consistently improves few-shot accuracy for $k \in {4, 8, 16, 32}$ examples, and we further demonstrate gains on mathematical reasoning tasks where label-based ordering methods are not applicable. Our code is available at https://github.com/Batorskq/PLR.
[899] Constrained Online Convex Optimization with Memory and Predictions
Mohammed Abdullah, George Iosifidis, Salah Eddine Elayoubi, Tijani Chahed
Main category: cs.LG
TL;DR: Constrained Online Convex Optimization with Memory (COCO-M) framework with algorithms achieving sublinear regret and constraint violation for time-varying constraints with/without predictions
Details
Motivation: Extends online convex optimization to settings where both loss and constraints depend on past decisions, capturing practical problems like constrained dynamical systems control and scheduling with reconfiguration budgetsMethod: Proposes two algorithms: 1) adaptive penalty approach without predictions, 2) optimistic algorithm with delayed feedback reinterpretation when short-horizon predictions are available
Result: First algorithms achieving sublinear regret and cumulative constraint violation under time-varying constraints, with performance improving as prediction accuracy improves while remaining robust to inaccurate predictions
Conclusion: Bridges gap between classical constrained online convex optimization and memory-dependent settings, providing versatile learning toolbox for diverse applications
Abstract: We study Constrained Online Convex Optimization with Memory (COCO-M), where both the loss and the constraints depend on a finite window of past decisions made by the learner. This setting extends the previously studied unconstrained online optimization with memory framework and captures practical problems such as the control of constrained dynamical systems and scheduling with reconfiguration budgets. For this problem, we propose the first algorithms that achieve sublinear regret and sublinear cumulative constraint violation under time-varying constraints, both with and without predictions of future loss and constraint functions. Without predictions, we introduce an adaptive penalty approach that guarantees sublinear regret and constraint violation. When short-horizon and potentially unreliable predictions are available, we reinterpret the problem as online learning with delayed feedback and design an optimistic algorithm whose performance improves as prediction accuracy improves, while remaining robust when predictions are inaccurate. Our results bridge the gap between classical constrained online convex optimization and memory-dependent settings, and provide a versatile learning toolbox with diverse applications.
[900] A Generalised Exponentiated Gradient Approach to Enhance Fairness in Binary and Multi-class Classification Tasks
Maryam Boubekraoui, Giordano d’Aloisio, Antinisca Di Marco
Main category: cs.LG
TL;DR: Proposes GEG algorithm for fair multi-class classification with multiple fairness constraints, showing up to 92% fairness improvement with ≤14% accuracy trade-off.
Details
Motivation: Bias mitigation in AI/ML models is crucial for sensitive applications, but existing methods focus on binary classification while multi-class fairness remains under-explored.Method: Formulates fair multi-class learning as multi-objective optimization between prediction correctness and multiple linear fairness constraints, then proposes Generalised Exponentiated Gradient (GEG) algorithm for in-processing fairness enhancement.
Result: Extensive evaluation on 7 multi-class and 3 binary datasets shows GEG outperforms 6 baselines with fairness improvements up to 92% and accuracy decrease up to 14%.
Conclusion: GEG effectively addresses fairness in multi-class classification, demonstrating significant fairness improvements with reasonable accuracy trade-offs across multiple fairness definitions.
Abstract: The widespread use of AI and ML models in sensitive areas raises significant concerns about fairness. While the research community has introduced various methods for bias mitigation in binary classification tasks, the issue remains under-explored in multi-class classification settings. To address this limitation, in this paper, we first formulate the problem of fair learning in multi-class classification as a multi-objective problem between effectiveness (i.e., prediction correctness) and multiple linear fairness constraints. Next, we propose a Generalised Exponentiated Gradient (GEG) algorithm to solve this task. GEG is an in-processing algorithm that enhances fairness in binary and multi-class classification settings under multiple fairness definitions. We conduct an extensive empirical evaluation of GEG against six baselines across seven multi-class and three binary datasets, using four widely adopted effectiveness metrics and three fairness definitions. GEG overcomes existing baselines, with fairness improvements up to 92% and a decrease in accuracy up to 14%.
[901] DSPA: Dynamic SAE Steering for Data-Efficient Preference Alignment
James Wedgwood, Aashiq Muhamed, Mona T. Diab, Virginia Smith
Main category: cs.LG
TL;DR: DSPA is an inference-time preference alignment method that makes sparse autoencoder steering prompt-conditional, modifying only token-active latents without weight updates, achieving competitive performance with fewer FLOPs.
Details
Motivation: Traditional preference alignment requires weight-updating training which adds substantial compute and provides limited mechanistic visibility. There's a need for more efficient, interpretable methods that don't require model weight updates.Method: DSPA computes a conditional-difference map from preference triples linking prompt features to generation-control features. During decoding, it modifies only token-active latents using sparse autoencoder steering, without updating base model weights.
Result: DSPA improves MT-Bench scores and is competitive on AlpacaEval while preserving multiple-choice accuracy across Gemma-2-2B/9B and Qwen3-8B models. Under restricted preference data, it remains robust and can rival RAHF-SCIT pipeline with up to 4.47× fewer alignment-stage FLOPs.
Conclusion: DSPA provides an efficient inference-time alternative to weight-updating preference alignment methods, offering better mechanistic interpretability through SAE feature analysis while maintaining competitive performance with significantly reduced compute.
Abstract: Preference alignment is usually achieved by weight-updating training on preference data, which adds substantial alignment-stage compute and provides limited mechanistic visibility. We propose Dynamic SAE Steering for Preference Alignment (DSPA), an inference-time method that makes sparse autoencoder (SAE) steering prompt-conditional. From preference triples, DSPA computes a conditional-difference map linking prompt features to generation-control features; during decoding, it modifies only token-active latents, without base-model weight updates. Across Gemma-2-2B/9B and Qwen3-8B, DSPA improves MT-Bench and is competitive on AlpacaEval while preserving multiple-choice accuracy. Under restricted preference data, DSPA remains robust and can rival the two-stage RAHF-SCIT pipeline while requiring up to $4.47\times$ fewer alignment-stage FLOPs. Finally, we audit the SAE features DSPA modifies, finding that preference directions are dominated by discourse and stylistic signals, and provide theory clarifying the conditional-difference map estimate and when top-$k$ ablation is principled.
[902] Mechanisms of Introspective Awareness
Uzay Macar, Li Yang, Atticus Wang, Peter Wallich, Emmanuel Ameisen, Jack Lindsey
Main category: cs.LG
TL;DR: LLMs can detect injected steering vectors in their residual stream, showing “introspective awareness” - this paper investigates the mechanisms behind this capability and finds it’s behaviorally robust, emerges from post-training, relies on distributed MLP computation, and can be substantially improved.
Details
Motivation: Recent work shows LLMs can detect when steering vectors are injected into their residual stream and identify the injected concept, cited as evidence of "introspective awareness." The paper aims to understand what mechanisms underlie this capability and whether they reflect genuine introspective circuitry or more shallow heuristics.Method: Investigates introspective awareness in open-source models through three main approaches: 1) Behavioral analysis of detection robustness across diverse prompts, 2) Analysis of training origins (pretraining vs post-training), 3) Mechanistic analysis using linear probes and feature ablation to understand the distributed MLP computation involved.
Result: Three main findings: 1) Introspection is behaviorally robust with moderate true positive rates and 0% false positives, emerging specifically from post-training. 2) Anomaly detection relies on distributed MLP computation across multiple directions (evidence carrier and gate features), not a single linear confound. 3) Models have greater introspective capability than elicited by default - ablating refusal directions improves detection by 53pp and trained steering vectors by 75pp.
Conclusion: Introspective awareness in LLMs is behaviorally robust, grounded in nontrivial internal anomaly detection mechanisms, and likely could be substantially improved in future models. The capability emerges from post-training and involves distributed computational patterns rather than simple heuristics.
Abstract: Recent work shows that LLMs can sometimes detect when steering vectors are injected into their residual stream and identify the injected concept, a phenomenon cited as evidence of “introspective awareness.” But what mechanisms underlie this capability, and do they reflect genuine introspective circuitry or more shallow heuristics? We investigate these questions in open-source models and establish three main findings. First, introspection is behaviorally robust: detection achieves moderate true positive rates with 0% false positives across diverse prompts. We also find this capability emerges specifically from post-training rather than pretraining. Second, introspection is not reducible to a single linear confound: anomaly detection relies on distributed MLP computation across multiple directions, implemented by evidence carrier and gate features. Third, models possess greater introspective capability than is elicited by default: ablating refusal directions improves detection by 53pp and a trained steering vector by 75pp. Overall, our results suggest that introspective awareness is behaviorally robust, grounded in nontrivial internal anomaly detection, and likely could be substantially improved in future models. Code: https://github.com/safety-research/introspection-mechanisms.
[903] Off-Policy Evaluation for Ranking Policies under Deterministic Logging Policies
Koichi Tanaka, Kazuki Kawamura, Takanori Muroi, Yusuke Narita, Yuki Sasamoto, Kei Tateno, Takuma Udagawa, Wei-Wei Du, Yuta Saito
Main category: cs.LG
TL;DR: CIPS estimator uses click probability as importance weighting for off-policy evaluation in ranking systems, enabling low-bias estimation even with deterministic logging policies.
Details
Motivation: Existing OPE estimators for ranking systems require stochastic logging policies and fail with deterministic policies, creating a practical limitation for real-world applications where deterministic policies are common.Method: Proposes Click-based Inverse Propensity Score (CIPS) that exploits intrinsic stochasticity of user click behavior rather than policy stochasticity, using click probability as a new form of importance weighting.
Result: Theoretical analysis shows favorable bias and variance properties; experiments on synthetic and real-world data demonstrate significantly lower bias compared to baselines with deterministic logging policies.
Conclusion: CIPS enables effective off-policy evaluation in ranking systems even with deterministic logging policies, addressing a key practical limitation of existing methods.
Abstract: Off-Policy Evaluation (OPE) is an important practical problem in algorithmic ranking systems, where the goal is to estimate the expected performance of a new ranking policy using only offline logged data collected under a different, logging policy. Existing estimators, such as the ranking-wise and position-wise inverse propensity score (IPS) estimators, require the data collection policy to be sufficiently stochastic and suffer from severe bias when the logging policy is fully deterministic. In this paper, we propose novel estimators, Click-based Inverse Propensity Score (CIPS), exploiting the intrinsic stochasticity of user click behavior to address this challenge. Unlike existing methods that rely on the stochasticity of the logging policy, our approach uses click probability as a new form of importance weighting, enabling low-bias OPE even under deterministic logging policies where existing methods incur substantial bias. We provide theoretical analyses of the bias and variance properties of the proposed estimators and show, through synthetic and real-world experiments, that our estimators achieve significantly lower bias compared to strong baselines, for a range of experimental settings with completely deterministic logging policies.
[904] Thinking Deeper, Not Longer: Depth-Recurrent Transformers for Compositional Generalization
Hung-Hsuan Chen
Main category: cs.LG
TL;DR: Depth-recurrent Transformer architecture that decouples computational depth from parameter count, enabling variable-depth reasoning through iterative application of shared-weight Transformer blocks with mechanisms for stable deep recurrence.
Details
Motivation: Standard Transformers have fixed computational depth, limiting their ability to generalize to tasks requiring variable-depth reasoning like multi-hop graph traversal or nested logic. There's a need for architectures that can perform deeper reasoning at inference time without increasing parameter count.Method: Proposes a depth-recurrent Transformer that applies a shared-weight Transformer block iteratively in latent space. Three key mechanisms enable stable deep recurrence (20+ steps): (1) silent thinking objective that supervises only final output, (2) LayerScale initialization to protect reasoning states, and (3) identity-biased recurrence creating gradient highway across steps.
Result: Evaluated on three compositional reasoning domains: graph reachability, nested boolean logic, and unstructured relational text. Observed clear computational frontier where performance transitions from chance to near-perfect as thinking steps scale with task complexity. Different generalization behaviors emerged: precise but brittle (graph), approximate but robust (logic), and autonomous latent routing without structural hints (text).
Conclusion: The architecture demonstrates how interplay between task-invariant recurrent reasoning core and task-specific perceptual interfaces shapes out-of-distribution generalization, offering mechanistic perspective on vertical chain-of-thought that complements horizontal token-generation paradigm.
Abstract: Standard Transformers have a fixed computational depth, fundamentally limiting their ability to generalize to tasks requiring variable-depth reasoning, such as multi-hop graph traversal or nested logic. We propose a depth-recurrent Transformer that decouples computational depth from parameter count by iteratively applying a shared-weight Transformer block in latent space – enabling the model to trade recurrence steps for deeper reasoning at inference time. Our architecture incorporates three mechanisms to make deep recurrence (20+ steps) stable: (1) a silent thinking objective that supervises only the final output, forcing genuine multi-step reasoning rather than intermediate heuristic shortcuts; (2) LayerScale initialization to protect fragile reasoning states from untrained layer noise; and (3) an identity-biased recurrence that creates a gradient highway across many steps. We evaluate on three compositional reasoning domains with decreasing inductive biases: graph reachability (strict adjacency masking), nested boolean logic (relative positioning), and unstructured relational text (where sequence position provides no structural hints). Across all tasks, we observe a clear \emph{computational frontier} – a boundary where performance transitions from chance to near-perfect as thinking steps scale with task complexity. Moreover, these tasks reveal qualitatively different generalization behaviors: precise but brittle (graph), approximate but robust (logic), and autonomous latent routing without structural hints (text). This progression illuminates how the interplay between a task-invariant recurrent reasoning core and task-specific perceptual interfaces shapes out-of-distribution (OOD) generalization, offering a mechanistic perspective on vertical chain-of-thought that complements the prevailing horizontal token-generation paradigm.
[905] Learning Can Converge Stably to the Wrong Belief under Latent Reliability
Zhipeng Zhang, Zhenjie Yao, Kai Li, Lei Yang
Main category: cs.LG
TL;DR: MTR framework uses learning dynamics to infer feedback reliability and modulates updates via trust variables to prevent convergence to incorrect solutions under latent unreliability.
Details
Motivation: Traditional learning systems assume feedback reliability, but when reliability is unobservable, algorithms may converge stably to incorrect solutions due to biased feedback. Single-step feedback doesn't reveal whether experiences are informative or persistently biased.Method: Proposes Monitor-Trust-Regulator (MTR) framework that infers reliability from learning dynamics (monitoring phase) and modulates updates through a slow-timescale trust variable (trust regulation). The trust variable adjusts learning updates based on inferred reliability.
Result: Standard algorithms exhibit stable optimization but learn incorrect solutions under latent unreliability, while trust-modulated systems reduce bias accumulation and improve recovery in both reinforcement learning and supervised learning settings.
Conclusion: Learning dynamics provide valuable information about feedback reliability beyond just optimization traces. Trust-modulated learning can prevent convergence to incorrect solutions when feedback reliability is unobservable.
Abstract: Learning systems are typically optimized by minimizing loss or maximizing reward, assuming that improvements in these signals reflect progress toward the true objective. However, when feedback reliability is unobservable, this assumption can fail, and learning algorithms may converge stably to incorrect solutions. This failure arises because single-step feedback does not reveal whether an experience is informative or persistently biased. When information is aggregated over learning trajectories, however, systematic differences between reliable and unreliable regimes can emerge. We propose a Monitor-Trust-Regulator (MTR) framework that infers reliability from learning dynamics and modulates updates through a slow-timescale trust variable. Across reinforcement learning and supervised learning settings, standard algorithms exhibit stable optimization behavior while learning incorrect solutions under latent unreliability, whereas trust-modulated systems reduce bias accumulation and improve recovery. These results suggest that learning dynamics are not only optimization traces but also a source of information about feedback reliability.
[906] Multinoulli Extension: A Lossless Continuous Relaxation for Partition-Constrained Subset Selection
Qixin Zhang, Wei Huang, Yan Sun, Yao Shu, Yi Yu, Dacheng Tao
Main category: cs.LG
TL;DR: Multinoulli-SCG algorithm for subset selection under partition constraints with improved query complexity and parameter-free operation
Details
Motivation: Existing distorted local-search methods for subset selection under partition constraints have prohibitive query complexities and require prior knowledge of difficult-to-obtain structural parameters, limiting their practical applicability.Method: Introduces Multinoulli Extension (ME) framework that converts discrete subset selection into continuous optimization of multinoulli priors across partitions, enabling parameter-free Multinoulli-SCG algorithm with O(1/ε²) function evaluations.
Result: Achieves same approximation guarantees as distorted local-search methods with significantly fewer function evaluations: (1-e^{-α})OPT-ε for monotone α-weakly DR-submodular or (γ²(1-e^{-(β(1-γ)+γ²}))/(β(1-γ)+γ²)OPT-ε for (γ,β)-weakly submodular functions.
Conclusion: Multinoulli Extension provides a lossless rounding scheme for any set function and enables efficient algorithms for subset selection under partition constraints, with extensions to online settings.
Abstract: Identifying the most representative subset for a close-to-submodular objective while satisfying the predefined partition constraint is a fundamental task with numerous applications in machine learning. However, the existing distorted local-search methods are often hindered by their prohibitive query complexities and the rigid requirement for prior knowledge of difficult-to-obtain structural parameters. To overcome these limitations, we introduce a novel algorithm titled Multinoulli-SCG, which not only is parameter-free, but also can achieve the same approximation guarantees as the distorted local-search methods with significantly fewer function evaluations. More specifically, when the objective function is monotone $α$-weakly DR-submodular or $(γ,β)$-weakly submodular, our Multinoulli-SCG algorithm can attain a value of $(1-e^{-α})\text{OPT}-ε$ or $(\frac{γ^{2}(1-e^{-(β(1-γ)+γ^2)})}{β(1-γ)+γ^2})\text{OPT}-ε$ with only $O(1/ε^{2})$ function evaluations, where OPT denotes the optimal value. The cornerstone of our Multinoulli-SCG algorithm is an innovative continuous-relaxation framework named Multinoulli Extension(ME), which can effectively convert the discrete subset selection problem subject to partition constraints into a solvable continuous maximization focused on learning the optimal multinoulli priors across the concerned partition. In sharp contrast with the well-established multi-linear extension for submodular subset selection, a notable advantage of our proposed ME is its intrinsic capacity to provide a lossless rounding scheme for any set function. Furthermore, based on our proposed ME, we also present two novel online algorithms, namely, Multinoulli-OSCG and Multinoulli-OSGA, for the unexplored online subset selection problems over partition constraints.
[907] Demystifying Reinforcement Learning for Long-Horizon Tool-Using Agents: A Comprehensive Recipe
Xixi Wu, Qianguo Sun, Ruiyang Zhang, Chao Song, Junlong Wu, Yiyan Qi, Hong Cheng
Main category: cs.LG
TL;DR: Systematic empirical study of RL for LLM agents in complex multi-turn environments using TravelPlanner testbed, with 5-axis decomposition and 7 key takeaways for scaling RL effectively.
Details
Motivation: RL is essential for evolving LLMs into autonomous agents capable of long-horizon planning, but practical recipes for scaling RL in complex, multi-turn environments are lacking. The paper aims to provide systematic guidance through empirical study.Method: Uses TravelPlanner testbed requiring tool orchestration to satisfy multifaceted constraints. Decomposes agentic RL design space along 5 axes: reward shaping, model scaling, data composition, algorithm selection, and environmental stability. Conducts controlled experiments to derive insights.
Result: Identifies 7 key takeaways: (1) reward/algorithm choices are scale-dependent, (2) ~1K training samples with balanced difficulty is optimal, (3) environmental stability prevents policy degradation. RL-trained models achieve SOTA performance on TravelPlanner, significantly outperforming leading LLMs.
Conclusion: Provides a distilled recipe for scaling RL in complex agentic environments, demonstrating that systematic empirical study yields practical guidance for evolving LLMs into capable autonomous agents through RL training.
Abstract: Reinforcement Learning (RL) is essential for evolving Large Language Models (LLMs) into autonomous agents capable of long-horizon planning, yet a practical recipe for scaling RL in complex, multi-turn environments remains elusive. This paper presents a systematic empirical study using TravelPlanner, a challenging testbed requiring tool orchestration to satisfy multifaceted constraints. We decompose the agentic RL design space along 5 axes: reward shaping, model scaling, data composition, algorithm selection, and environmental stability. Our controlled experiments yield 7 key takeaways, e.g., (1) reward and algorithm choices are scale-dependent as smaller models benefit from staged rewards and enhanced exploration, whereas larger models converge efficiently with simpler dense rewards, (2) ~ 1K training samples with a balanced difficulty mixture mark a sweet spot for both in-domain and out-of-domain performance, and (3) environmental stability is critical to prevent policy degradation. Based on our distilled recipe, our RL-trained models achieve state-of-the-art performance on TravelPlanner, significantly outperforming leading LLMs.
[908] Quotient Geometry, Effective Curvature, and Implicit Bias in Simple Shallow Neural Networks
Hang-Cheng Dong, Pengcheng Cheng
Main category: cs.LG
TL;DR: A differential-geometric framework for analyzing shallow neural networks by modding out parameter symmetries (permutations, rescalings) to study intrinsic predictor properties rather than representation artifacts.
Details
Motivation: Overparameterized shallow neural networks have substantial parameter redundancy due to symmetries like hidden-unit permutations and rescalings. This causes geometric quantities computed in Euclidean parameter space to reflect representation artifacts rather than intrinsic predictor properties.Method: Develops a differential-geometric framework using quotient spaces obtained by modding out parameter symmetries on regular sets. Characterizes symmetry and quotient structure, induces natural metric on quotient manifold, defines symmetry-reduced Hessian, studies gradient flows on quotient, and formulates implicit-bias viewpoint at quotient level.
Result: Shows ambient flatness is representation-dependent, local dynamics are better organized by quotient-level curvature summaries, and in underdetermined regimes implicit bias is most naturally described in quotient coordinates.
Conclusion: Meaningful complexity should be assigned to predictor classes rather than individual parameter representatives. The quotient framework provides intrinsic geometric understanding of neural network behavior by removing symmetry artifacts.
Abstract: Overparameterized shallow neural networks admit substantial parameter redundancy: distinct parameter vectors may represent the same predictor due to hidden-unit permutations, rescalings, and related symmetries. As a result, geometric quantities computed directly in the ambient Euclidean parameter space can reflect artifacts of representation rather than intrinsic properties of the predictor. In this paper, we develop a differential-geometric framework for analyzing simple shallow networks through the quotient space obtained by modding out parameter symmetries on a regular set. We first characterize the symmetry and quotient structure of regular shallow-network parameters and show that the finite-sample realization map induces a natural metric on the quotient manifold. This leads to an effective notion of curvature that removes degeneracy along symmetry orbits and yields a symmetry-reduced Hessian capturing intrinsic local geometry. We then study gradient flows on the quotient and show that only the horizontal component of parameter motion contributes to first-order predictor evolution, while the vertical component corresponds purely to gauge variation. Finally, we formulate an implicit-bias viewpoint at the quotient level, arguing that meaningful complexity should be assigned to predictor classes rather than to individual parameter representatives. Our experiments confirm that ambient flatness is representation-dependent, that local dynamics are better organized by quotient-level curvature summaries, and that in underdetermined regimes, implicit bias is most naturally described in quotient coordinates.
[909] Optimizing Feature Extraction for On-device Model Inference with User Behavior Sequences
Chen Gong, Zhenzhe Zheng, Yiliu Chen, Sheng Wang, Fan Wu, Guihai Chen
Main category: cs.LG
TL;DR: AutoFeature optimizes on-device machine learning by eliminating redundant feature extraction operations from application logs, accelerating end-to-end model execution without compromising accuracy.
Details
Motivation: On-device ML models face latency bottlenecks not just in inference but in feature extraction from raw application logs, which has been overlooked in prior research focused only on inference acceleration.Method: AutoFeature uses graph abstraction to model extraction workflows as DAGs, graph optimization to fuse redundant operations across features, and efficient caching to minimize operations on overlapping raw data between consecutive inferences.
Result: Integrated into five industrial mobile services (search, video, e-commerce), AutoFeature reduced end-to-end on-device model execution latency by 1.33x-3.93x during daytime and 1.43x-4.53x at night.
Conclusion: Feature extraction optimization is a critical bottleneck in on-device ML pipelines, and AutoFeature demonstrates significant latency improvements through automated redundancy elimination and caching.
Abstract: Machine learning models are widely integrated into modern mobile apps to analyze user behaviors and deliver personalized services. Ensuring low-latency on-device model execution is critical for maintaining high-quality user experiences. While prior research has primarily focused on accelerating model inference with given input features, we identify an overlooked bottleneck in real-world on-device model execution pipelines: extracting input features from raw application logs. In this work, we explore a new direction of feature extraction optimization by analyzing and eliminating redundant extraction operations across different model features and consecutive model inferences. We then introduce AutoFeature, an automated feature extraction engine designed to accelerate on-device feature extraction process without compromising model inference accuracy. AutoFeature comprises three core designs: (1) graph abstraction to formulate the extraction workflows of different input features as one directed acyclic graph, (2) graph optimization to identify and fuse redundant operation nodes across different features within the graph; (3) efficient caching to minimize operations on overlapping raw data between consecutive model inferences. We implement a system prototype of AutoFeature and integrate it into five industrial mobile services spanning search, video and e-commerce domains. Online evaluations show that AutoFeature reduces end-to-end on-device model execution latency by 1.33x-3.93x during daytime and 1.43x-4.53x at night.
[910] BOxCrete: A Bayesian Optimization Open-Source AI Model for Concrete Strength Forecasting and Mix Optimization
Bayezid Baten, M. Ayyan Iqbal, Sebastian Ament, Julius Kusuma, Nishant Garg
Main category: cs.LG
TL;DR: BOxCrete is an open-source probabilistic framework using Gaussian Process regression to predict concrete compressive strength and optimize mix designs for both strength and carbon emissions, trained on a new open dataset of 500+ measurements.
Details
Motivation: Concrete mix design is complex with competing demands for performance, workability, durability, and sustainability. Existing AI/ML approaches use proprietary datasets and closed-source implementations, limiting reproducibility and accessibility.Method: Developed BOxCrete framework using Gaussian Process regression trained on new open-access dataset of 500+ strength measurements from 123 mixtures (69 mortar, 54 concrete) tested at five curing ages. Framework performs uncertainty quantification and multi-objective optimization of compressive strength and embodied carbon.
Result: Achieved average R² = 0.94 and RMSE = 0.69 ksi for strength prediction. Established reproducible open-source foundation for data-driven development of AI-optimized mix designs.
Conclusion: BOxCrete provides an open-source alternative to proprietary systems, enabling transparent, reproducible optimization of concrete mixtures for both performance and sustainability objectives.
Abstract: Modern concrete must simultaneously satisfy evolving demands for mechanical performance, workability, durability, and sustainability, making mix designs increasingly complex. Recent studies leveraging Artificial Intelligence (AI) and Machine Learning (ML) models show promise for predicting compressive strength and guiding mix optimization, but most existing efforts are based on proprietary industrial datasets and closed-source implementations. Here we introduce BOxCrete, an open-source probabilistic modeling and optimization framework trained on a new open-access dataset of over 500 strength measurements (1-15 ksi) from 123 mixtures - 69 mortar and 54 concrete mixes tested at five curing ages (1, 3, 5, 14, and 28 days). BOxCrete leverages Gaussian Process (GP) regression to predict strength development, achieving average R$^2$ = 0.94 and RMSE = 0.69 ksi, quantify uncertainty, and carry out multi-objective optimization of compressive strength and embodied carbon. The dataset and model establish a reproducible open-source foundation for data-driven development of AI-based optimized mix designs.
[911] ROM: Real-time Overthinking Mitigation via Streaming Detection and Intervention
Xinyan Wang, Xiaogeng Liu, Chaowei Xiao
Main category: cs.LG
TL;DR: ROM is a streaming detection method that mitigates overthinking in Large Reasoning Models by predicting when to stop reasoning and output final answers, reducing response length by 47.2% while maintaining high accuracy.
Details
Motivation: Large Reasoning Models suffer from overthinking - they continue generating redundant reasoning steps even after reaching correct answers, which increases latency, compute costs, and can cause answer drift. Existing methods require heavy backbone modifications or rely on hand-crafted heuristics that don't truly capture overthinking patterns.Method: ROM formulates overthinking mitigation as a streaming prediction-and-control problem. It attaches a lightweight detection head to late-layer hidden states of a frozen LLM backbone, monitors tokens in real time, and triggers early transition to final answer once overthinking is detected. Uses token-level supervision based on solution correctness boundaries and data augmentation to reduce distilled-data bias.
Result: Across seven benchmarks, ROM achieves highest accuracy (93.51%), shortest responses (1,159 tokens), and best response efficiency. Compared to vanilla baseline, reduces response length by 47.2% and improves efficiency by 121%.
Conclusion: Streaming detection is a promising approach to real-time overthinking mitigation in Large Reasoning Models, enabling efficient reasoning without sacrificing accuracy.
Abstract: Large Reasoning Models (LRMs) achieve strong accuracy on challenging tasks by generating long Chain-of-Thought traces, but suffer from overthinking. Even after reaching the correct answer, they continue generating redundant reasoning steps. This behavior increases latency and compute cost and can also lead to answer drift. Existing mitigation methods either require training-heavy backbone modification or rely on hand-crafted heuristics that do not truly capture overthinking patterns. We propose ROM, the first method that formulates overthinking mitigation as a streaming prediction-and-control problem. ROM attaches a lightweight detection head to the late-layer hidden states of a frozen large language model backbone. It monitors tokens in real time and triggers an early transition to the final answer once overthinking is detected. We also introduce token-level supervision based on solution correctness boundaries and a data augmentation strategy that reduces distilled-data bias. Across seven benchmarks, ROM achieves the highest accuracy (93.51%), the shortest responses (1,159 tokens), and the best response efficiency. Compared with the vanilla baseline, it reduces response length by 47.2% and improves efficiency by 121%. These results show that streaming detection is a promising approach to real-time overthinking mitigation.
[912] Generalization Limits of In-Context Operator Networks for Higher-Order Partial Differential Equations
Jamie Mahowald, Tan Bui-Thanh
Main category: cs.LG
TL;DR: ICONs (In-Context Operator Networks) show generalization capabilities for higher-order PDEs, maintaining qualitative accuracy despite point-wise degradation for complex problems like heat equation.
Details
Motivation: To extend the capabilities of In-Context Operator Networks (ICONs) to handle more complex and higher-order partial differential equations beyond simpler cases previously studied, exploring their generalization potential for operator learning in scientific computing.Method: Extends previous ICON framework to higher-order PDEs, requiring new computational methods for processing complex inputs while maintaining consistent underlying machine learning techniques. Tests on problems like the heat equation.
Result: While point-wise accuracy degrades for higher-order problems, the model retains qualitative accuracy in capturing solution dynamics and overall behavior, demonstrating ability to extrapolate fundamental solution characteristics beyond training regime.
Conclusion: ICONs show promising generalization capabilities for operator learning in scientific computing, maintaining qualitative understanding of PDE solutions even when quantitative accuracy decreases for complex problems.
Abstract: We investigate the generalization capabilities of In-Context Operator Networks (ICONs), a new class of operator networks that build on the principles of in-context learning, for higher-order partial differential equations. We extend previous work by expanding the type and scope of differential equations handled by the foundation model. We demonstrate that while processing complex inputs requires some new computational methods, the underlying machine learning techniques are largely consistent with simpler cases. Our implementation shows that although point-wise accuracy degrades for higher-order problems like the heat equation, the model retains qualitative accuracy in capturing solution dynamics and overall behavior. This demonstrates the model’s ability to extrapolate fundamental solution characteristics to problems outside its training regime.
[913] SPA: A Simple but Tough-to-Beat Baseline for Knowledge Injection
Kexian Tang, Jiani Wang, Shaowen Wang, Kaifeng Lyu
Main category: cs.LG
TL;DR: SPA is a simple but effective method for generating large-scale synthetic data using carefully designed prompts to inject knowledge into LLMs for specialized domains.
Details
Motivation: LLMs have incomplete knowledge coverage in specialized, data-scarce domains, motivating the need for synthetic data generation methods for knowledge injection.Method: SPA (Scaling Prompt-engineered Augmentation) uses a small set of carefully designed prompts to generate large-scale synthetic data for knowledge injection into LLMs.
Result: SPA outperforms several strong baselines and reveals limitations of prior approaches: RL-based methods suffer from diversity collapse at scale, and multi-stage prompting advantages disappear after careful prompt tuning.
Conclusion: Careful prompt design combined with straightforward large-scale augmentation is surprisingly effective for knowledge injection, and SPA serves as a strong baseline for future studies.
Abstract: While large language models (LLMs) are pretrained on massive amounts of data, their knowledge coverage remains incomplete in specialized, data-scarce domains, motivating extensive efforts to study synthetic data generation for knowledge injection. We propose SPA (Scaling Prompt-engineered Augmentation), a simple but tough-to-beat baseline that uses a small set of carefully designed prompts to generate large-scale synthetic data for knowledge injection. Through systematic comparisons, we find that SPA outperforms several strong baselines. Furthermore, we identify two key limitations of prior approaches: (1) while RL-based methods may improve the token efficiency of LLM-based data augmentation at small scale, they suffer from diversity collapse as data scales, leading to diminishing returns; and (2) while multi-stage prompting may outperform simple augmentation methods, their advantages can disappear after careful prompt tuning. Our results suggest that, for knowledge injection, careful prompt design combined with straightforward large-scale augmentation can be surprisingly effective, and we hope SPA can serve as a strong baseline for future studies in this area. Our code is available at https://github.com/Tangkexian/SPA.
[914] Sharper Generalization Bounds for Transformer
Yawen Li, Tao Hu, Zhouhui Lian, Wan Tian, Yijie Peng, Huiming Zhang, Zhongyi Li
Main category: cs.LG
TL;DR: Theoretical analysis of generalization error bounds for Transformer models using offset Rademacher complexity, covering various architectures and feature distributions.
Details
Motivation: To provide rigorous theoretical understanding of generalization properties of Transformers, which are widely used in practice but lack comprehensive theoretical analysis of their generalization capabilities across different architectures and data distributions.Method: Uses offset Rademacher complexity to derive generalization bounds, connects it to empirical covering numbers of hypothesis spaces, and bounds covering numbers using matrix ranks and norms. Extends analysis to unbounded (sub-Gaussian) features and heavy-tailed distributions.
Result: Derived sharper generalization bounds for single-layer single-head, single-layer multi-head, and multi-layer Transformers that achieve optimal convergence rates up to constant factors. Obtained architecture-dependent bounds based on matrix properties and extended results to more realistic data distributions.
Conclusion: Provides comprehensive theoretical framework for understanding Transformer generalization, with precise bounds that depend on architectural choices and can handle various data distributions, offering insights for practical model design.
Abstract: This paper studies generalization error bounds for Transformer models. Based on the offset Rademacher complexity, we derive sharper generalization bounds for different Transformer architectures, including single-layer single-head, single-layer multi-head, and multi-layer Transformers. We first express the excess risk of Transformers in terms of the offset Rademacher complexity. By exploiting its connection with the empirical covering numbers of the corresponding hypothesis spaces, we obtain excess risk bounds that achieve optimal convergence rates up to constant factors. We then derive refined excess risk bounds by upper bounding the covering numbers of Transformer hypothesis spaces using matrix ranks and matrix norms, leading to precise, architecture-dependent generalization bounds. Finally, we relax the boundedness assumption on feature mappings and extend our theoretical results to settings with unbounded (sub-Gaussian) features and heavy-tailed distributions.
[915] What Do World Models Learn in RL? Probing Latent Representations in Learned Environment Simulators
Xinyu Zhang
Main category: cs.LG
TL;DR: Interpretability analysis of two world models (IRIS and DIAMOND) reveals they develop linearly decodable representations of game state variables, with evidence these representations are functionally used rather than just correlated.
Details
Motivation: To understand what internal representations world models actually learn, applying interpretability techniques to analyze how these models encode environment state information.Method: Applied linear/nonlinear probing, causal interventions, attention analysis, and token ablation to two architecturally distinct world models (IRIS - discrete token transformer, DIAMOND - continuous diffusion UNet) trained on Atari Breakout and Pong.
Result: Both models develop linearly decodable representations of game state variables (object positions, scores). Causal interventions show representations are functionally used. IRIS attention heads show spatial specialization for game objects. Object-containing tokens are disproportionately important.
Conclusion: Learned world models develop structured, approximately linear internal representations of environment state across different games and architectures.
Abstract: World models learn to simulate environment dynamics from experience, enabling sample-efficient reinforcement learning. But what do these models actually represent internally? We apply interpretability techniques–including linear and nonlinear probing, causal interventions, and attention analysis–to two architecturally distinct world models: IRIS (discrete token transformer) and DIAMOND (continuous diffusion UNet), trained on Atari Breakout and Pong. Using linear probes, we find that both models develop linearly decodable representations of game state variables (object positions, scores), with MLP probes yielding only marginally higher R^2, confirming that these representations are approximately linear. Causal interventions–shifting hidden states along probe-derived directions–produce correlated changes in model predictions, providing evidence that representations are functionally used rather than merely correlated. Analysis of IRIS attention heads reveals spatial specialization: specific heads attend preferentially to tokens overlapping with game objects. Multi-baseline token ablation experiments consistently identify object-containing tokens as disproportionately important. Our findings provide interpretability evidence that learned world models develop structured, approximately linear internal representations of environment state across two games and two architectures.
[916] Kolmogorov Complexity Bounds for LLM Steganography and a Perplexity-Based Detection Proxy
Andrii Shportko
Main category: cs.LG
TL;DR: LLMs can embed hidden payloads in text while preserving meaning, but this requires increasing text complexity, which can be detected using perplexity-based metrics.
Details
Motivation: The paper investigates how large language models can create covert channels by embedding hidden payloads in text while maintaining surface-level meaning, which poses challenges for AI alignment monitoring and security.Method: Theoretical analysis using Kolmogorov complexity to establish information-theoretic bounds on steganographic embedding, followed by practical detection using language model perplexity as a proxy for complexity, specifically proposing the Binoculars perplexity-ratio score.
Result: Theoretical proof shows any non-trivial payload forces strict complexity increase in stegotext. Experimental validation with color-based LLM steganography shows statistically significant detection (t=5.11, p<10^-6 over 300 samples).
Conclusion: Steganographic embedding in LLMs inevitably increases text complexity, making detection possible through perplexity-based metrics, which has implications for AI security and alignment monitoring.
Abstract: Large language models can rewrite text to embed hidden payloads while preserving surface-level meaning, a capability that opens covert channels between cooperating AI systems and poses challenges for alignment monitoring. We study the information-theoretic cost of such embedding. Our main result is that any steganographic scheme that preserves the semantic load of a covertext~$M_1$ while encoding a payload~$P$ into a stegotext~$M_2$ must satisfy $K(M_2) \geq K(M_1) + K(P) - O(\log n)$, where $K$ denotes Kolmogorov complexity and $n$ is the combined message length. A corollary is that any non-trivial payload forces a strict complexity increase in the stegotext, regardless of how cleverly the encoder distributes the signal. Because Kolmogorov complexity is uncomputable, we ask whether practical proxies can detect this predicted increase. Drawing on the classical correspondence between lossless compression and Kolmogorov complexity, we argue that language-model perplexity occupies an analogous role in the probabilistic regime and propose the Binoculars perplexity-ratio score as one such proxy. Preliminary experiments with a color-based LLM steganographic scheme support the theoretical prediction: a paired $t$-test over 300 samples yields $t = 5.11$, $p < 10^{-6}$.
[917] SSAM: Singular Subspace Alignment for Merging Multimodal Large Language Models
Md Kaykobad Reza, Ameya Patil, Edward Ayrapetian, M. Salman Asif
Main category: cs.LG
TL;DR: SSAM is a training-free framework that merges independently trained specialist MLLMs (vision-language, audio-language, etc.) into a single multimodal model capable of handling any combination of input modalities without additional training data.
Details
Motivation: Building multimodal models or extending them to new modalities typically requires large paired datasets and substantial computational resources. Since many pretrained specialist MLLMs are publicly available, the authors explore whether they can be merged into a single model that handles multiple modalities, addressing challenges of representation differences and parameter interference.Method: SSAM (Singular Subspace Alignment and Merging) maintains modality-specific parameter updates separately, identifies a shared low-rank subspace for language-related parameter updates, aligns them within this subspace, and merges them to preserve complementary knowledge while minimizing parameter interference. The approach is training-free.
Result: Without using any multimodal training data, SSAM achieves state-of-the-art performance across four datasets, surpassing prior training-free merging methods and even jointly trained multimodal models.
Conclusion: Aligning models in parameter space provides a scalable and resource-efficient alternative to conventional joint multimodal training, enabling the creation of unified multimodal models from existing specialist models.
Abstract: Multimodal large language models (MLLMs) achieve strong performance by jointly processing inputs from multiple modalities, such as vision, audio, and language. However, building such models or extending them to new modalities often requires large paired datasets and substantial computational resources. Since many pretrained MLLMs (e.g., vision-language or audio-language) are publicly available, we ask whether we can merge them into a single MLLM that can handle multiple modalities? Merging MLLMs with different input modalities remains challenging, partly because of differences in the learned representations and interference between their parameter spaces. To address these challenges, we propose Singular Subspace Alignment and Merging (SSAM), a training-free model merging framework that unifies independently trained specialist MLLMs into a single model capable of handling any combination of input modalities. SSAM maintains modality-specific parameter updates separately and identifies a shared low-rank subspace for language-related parameter updates, aligns them within this subspace, and merges them to preserve complementary knowledge while minimizing parameter interference. Without using any multimodal training data, SSAM achieves state-of-the-art performance across four datasets, surpassing prior training-free merging methods and even jointly trained multimodal models. These results demonstrate that aligning models in parameter space provides a scalable and resource-efficient alternative to conventional joint multimodal training.
[918] In-network Attack Detection with Federated Deep Learning in IoT Networks: Real Implementation and Analysis
Devashish Chaudhary, Sutharshan Rajasegarar, Shiva Raj Pokhrel, Lei Pan, Ruby D
Main category: cs.LG
TL;DR: Lightweight autoencoder-based anomaly detection framework for IoT edge devices using federated learning to preserve privacy and reduce communication overhead.
Details
Motivation: IoT expansion increases security risks; centralized anomaly detection suffers from privacy, scalability, and latency issues due to large data transfers to central servers.Method: Proposes lightweight autoencoder-based framework for edge deployment with federated learning - local training on edge nodes, only model weight aggregation at central server.
Result: Implemented on Raspberry Pi IoT testbed, effectively identifies network attacks with significantly reduced communication overhead while maintaining comparable performance to centralized methods.
Conclusion: Federated learning enables effective real-time anomaly detection on resource-constrained edge devices while addressing privacy and scalability limitations of centralized approaches.
Abstract: The rapid expansion of the Internet of Things (IoT) and its integration with backbone networks have heightened the risk of security breaches. Traditional centralized approaches to anomaly detection, which require transferring large volumes of data to central servers, suffer from privacy, scalability, and latency limitations. This paper proposes a lightweight autoencoder-based anomaly detection framework designed for deployment on resource-constrained edge devices, enabling real-time detection while minimizing data transfer and preserving privacy. Federated learning is employed to train models collaboratively across distributed devices, where local training occurs on edge nodes and only model weights are aggregated at a central server. A real-world IoT testbed using Raspberry Pi sensor nodes was developed to collect normal and attack traffic data. The proposed federated anomaly detection system, implemented and evaluated on the testbed, demonstrates its effectiveness in accurately identifying network attacks. The communication overhead was reduced significantly while achieving comparable performance to the centralized method.
[919] Riemannian Geometry Speaks Louder Than Words: From Graph Foundation Model to Next-Generation Graph Intelligence
Philip S. Yu, Li Sun
Main category: cs.LG
TL;DR: Proposes Riemannian Foundation Model (RFM) as a new paradigm for Graph Foundation Models using Riemannian geometry to overcome limitations of GNNs and LLMs for graph learning.
Details
Motivation: Current approaches to Graph Foundation Models face challenges: GNNs have memory and interpretability limitations, while LLMs struggle with graph serialization due to structural complexity. There's a need for a general-purpose GFM that can capture complex structural patterns across domains.Method: Introduces Riemannian Foundation Model (RFM) based on Riemannian geometry, which provides an elegant mathematical framework for modeling graph structures while remaining compatible with semantic learning. Emphasizes intrinsic graph geometry and endogenous capacities for structural inference and generation.
Result: Proposes a new paradigm shift from designing graph models to solving graph-structured applications with RFM agents. Outlines a progressive agenda starting with universal structural understanding through intrinsic geometry, then rebuilding LLMs with a Riemannian engine.
Conclusion: RFM offers a new pathway for capturing complex structural patterns and uncovering cross-domain generalities, enabling next-generation graph intelligence by moving beyond representation-space switching to intrinsic geometric modeling.
Abstract: Graphs provide a natural description of the complex relationships among objects, and play a pivotal role in communications, transportation, social computing, the life sciences, etc. Currently, there is strong agreement that Graph Foundation Models (GFMs) are essential for advancing graph learning, yet considerable disagreement persists on how to build a powerful, general-purpose GFM analogous to Large Language Models (LLMs). Graph Neural Networks (GNNs) exhibit limitations in memory retention and principled interpretability when confronted with multi-domain pretraining and adaptation. The challenge of graph serialization hinders the direct application of LLMs, as the words struggle to capture the structural complexity and diversity inherent in graphs. In contrast, Riemannian geometry offers an elegant mathematical framework for modeling structures, while remaining compatible with graph semantic learning, even with LLMs. In this paper, we argue that, for graphs, Riemannian geometry speaks louder than words, and lay out the foundational principles for GFM. Reimagining with Riemannian geometry, we introduce a blue sky idea-Riemannian Foundation Model (RFM)-that opens a new pathway for capturing complex structural patterns and uncovering cross-domain generalities. RFM emphasizes intrinsic graph geometry and embodies endogenous capacities for structural inference and generation, moving beyond mere representation-space switching. Accordingly, we outline a progressive agenda that begins with universal structural understanding through intrinsic geometry, and then rebuilds LLM with a Riemannian engine for general-purpose graph modeling and beyond. Thus, RFM enables a paradigm shift from designing graph models to solving graph-structured applications with RFM agents, unlocking the next-generation graph intelligence.
[920] mSFT: Addressing Dataset Mixtures Overfiting Heterogeneously in Multi-task SFT
Woosung Koh, Jeyoung Jeon, Youngjin Song, Yujin Cheon, Soowon Oh, Jaehyeong Choi, Se-Young Yun
Main category: cs.LG
TL;DR: mSFT: An iterative, overfitting-aware search algorithm for multi-task supervised fine-tuning that dynamically adjusts data mixtures by identifying and excluding overfitting sub-datasets to optimize learning across heterogeneous tasks.
Details
Motivation: Current multi-task SFT uses homogeneous compute budgets across all sub-datasets, which is suboptimal because heterogeneous learning dynamics cause faster-learning tasks to overfit early while slower ones remain under-fitted.Method: mSFT trains on an active mixture, identifies and excludes the earliest overfitting sub-dataset, reverts to that specific optimal checkpoint before continuing, iteratively optimizing the data mixture based on overfitting patterns.
Result: mSFT consistently outperforms 4 baselines across 10 benchmarks and 6 base models, maintains robust gains across diverse dataset sizes and task granularities, and is insensitive to its single hyperparameter (compute budget). At low compute budgets, it can improve performance while lowering training FLOPs.
Conclusion: mSFT establishes a practical overfitting-aware algorithm for multi-task SFT that maximizes model potential across diverse data mixtures by dynamically adapting to heterogeneous learning dynamics.
Abstract: Current language model training commonly applies multi-task Supervised Fine-Tuning (SFT) using a homogeneous compute budget across all sub-datasets. This approach is fundamentally sub-optimal: heterogeneous learning dynamics cause faster-learning tasks to overfit early while slower ones remain under-fitted. To address this, we introduce mSFT, an iterative, overfitting-aware search algorithm for multi-task data mixtures. mSFT trains the model on an active mixture, identifies and excludes the earliest overfitting sub-dataset, and reverts to that specific optimal checkpoint before continuing. Extensive evaluations demonstrate that mSFT consistently outperforms 4 baselines across 10 benchmarks and 6 base models. Further analysis confirms mSFT maintains robust gains across diverse dataset sizes, task granularities, and is insensitive to its single new hyperparameter (compute budget). Notably, at low compute budget, mSFT can improve performance while lowering training FLOPs. Ultimately, mSFT establishes a practical overfitting-aware algorithm for multi-task SFT that maximizes the potential of models across diverse data mixtures.
[921] Rule-State Inference (RSI): A Bayesian Framework for Compliance Monitoring in Rule-Governed Domains
Abdou-Raouf Atarmla
Main category: cs.LG
TL;DR: RSI is a Bayesian framework for compliance monitoring that treats regulatory rules as structured priors and infers latent rule states from noisy observations, achieving efficient regulatory change absorption without retraining.
Details
Motivation: Existing ML frameworks for compliance monitoring assume observed data is ground truth and try to approximate rules from it, but this breaks down in rule-governed domains like taxation where authoritative rules are known a priori. The real challenge is inferring latent rule activation, compliance, and parametric drift from partial, noisy observations.Method: Rule-State Inference (RSI) inverts the traditional paradigm by encoding regulatory rules as structured priors and casting compliance monitoring as posterior inference over a latent rule-state space S = {(a_i, c_i, delta_i)}, where a_i captures rule activation, c_i models compliance rate, and delta_i quantifies parametric drift. The framework uses mean-field variational inference to maximize the Evidence Lower Bound (ELBO).
Result: On the Togolese fiscal system benchmark (RSI-Togo-Fiscal-Synthetic v1.0 with 2,000 synthetic enterprises), RSI achieves F1=0.519 and AUC=0.599 without labeled training data. It absorbs regulatory changes in under 1ms versus 683-1082ms for full model retraining (600x speedup). The framework provides theoretical guarantees including O(1) time regulatory change absorption and Bernstein-von Mises consistency.
Conclusion: RSI offers a principled Bayesian approach to compliance monitoring that treats rules as first-class citizens, enabling efficient inference of latent rule states and rapid adaptation to regulatory changes without retraining, making it suitable for real-world rule-governed domains.
Abstract: Existing machine learning frameworks for compliance monitoring – Markov Logic Networks, Probabilistic Soft Logic, supervised models – share a fundamental paradigm: they treat observed data as ground truth and attempt to approximate rules from it. This assumption breaks down in rule-governed domains such as taxation or regulatory compliance, where authoritative rules are known a priori and the true challenge is to infer the latent state of rule activation, compliance, and parametric drift from partial and noisy observations. We propose Rule-State Inference (RSI), a Bayesian framework that inverts this paradigm by encoding regulatory rules as structured priors and casting compliance monitoring as posterior inference over a latent rule-state space S = {(a_i, c_i, delta_i)}, where a_i captures rule activation, c_i models the compliance rate, and delta_i quantifies parametric drift. We prove three theoretical guarantees: (T1) RSI absorbs regulatory changes in O(1) time via a prior ratio correction, independently of dataset size; (T2) the posterior is Bernstein-von Mises consistent, converging to the true rule state as observations accumulate; (T3) mean-field variational inference monotonically maximizes the Evidence Lower BOund (ELBO). We instantiate RSI on the Togolese fiscal system and introduce RSI-Togo-Fiscal-Synthetic v1.0, a benchmark of 2,000 synthetic enterprises grounded in real OTR regulatory rules (2022-2025). Without any labeled training data, RSI achieves F1=0.519 and AUC=0.599, while absorbing regulatory changes in under 1ms versus 683-1082ms for full model retraining – at least a 600x speedup.
[922] Towards Multimodal Time Series Anomaly Detection with Semantic Alignment and Condensed Interaction
Shiyan Hu, Jianxin Jin, Yang Shu, Peng Chen, Bin Yang, Chenjuan Guo
Main category: cs.LG
TL;DR: MindTS is a multimodal time series anomaly detection model that integrates text information with numerical time series data through fine-grained semantic alignment and content condensation to improve detection performance.
Details
Motivation: Previous time series anomaly detection approaches rely solely on unimodal numerical data, missing valuable complementary information from other modalities like text. There's a need to effectively integrate heterogeneous multimodal data while addressing challenges of semantic alignment and redundancy filtering.Method: Proposes two key components: 1) Fine-grained Time-text Semantic Alignment that integrates exogenous and endogenous text through cross-view text fusion and multimodal alignment, and 2) Content Condenser Reconstruction that filters redundant text information and performs cross-modal reconstruction for interaction.
Result: Extensive experiments on six real-world multimodal datasets show MindTS achieves competitive or superior results compared to existing methods.
Conclusion: MindTS effectively addresses multimodal time series anomaly detection by solving semantic alignment and redundancy filtering challenges, demonstrating improved performance through multimodal integration.
Abstract: Time series anomaly detection plays a critical role in many dynamic systems. Despite its importance, previous approaches have primarily relied on unimodal numerical data, overlooking the importance of complementary information from other modalities. In this paper, we propose a novel multimodal time series anomaly detection model (MindTS) that focuses on addressing two key challenges: (1) how to achieve semantically consistent alignment across heterogeneous multimodal data, and (2) how to filter out redundant modality information to enhance cross-modal interaction effectively. To address the first challenge, we propose Fine-grained Time-text Semantic Alignment. It integrates exogenous and endogenous text information through cross-view text fusion and a multimodal alignment mechanism, achieving semantically consistent alignment between time and text modalities. For the second challenge, we introduce Content Condenser Reconstruction, which filters redundant information within the aligned text modality and performs cross-modal reconstruction to enable interaction. Extensive experiments on six real-world multimodal datasets demonstrate that the proposed MindTS achieves competitive or superior results compared to existing methods. The code is available at: https://github.com/decisionintelligence/MindTS.
[923] Proximal Policy Optimization in Path Space: A Schrödinger Bridge Perspective
Yuehu Gong, Zeyuan Wang, Yulin Chen, Yanwei Fu
Main category: cs.LG
TL;DR: GSB-PPO: A path-space formulation of PPO for generative policies using Generalized Schrödinger Bridge, enabling trajectory-level optimization instead of action-space probability ratios.
Details
Motivation: On-policy RL with generative policies (diffusion/flow-based) is promising but underexplored. Traditional PPO uses action-space probability ratios, which doesn't align well with trajectory-level generative processes. Need a framework that lifts PPO updates to full generation trajectories.Method: Propose GSB-PPO, a path-space formulation inspired by Generalized Schrödinger Bridge. Develop two objectives: GSB-PPO-Clip (clipping-based) and GSB-PPO-Penalty (penalty-based), both enabling on-policy training with trajectory-level proximal updates.
Result: Both objectives work with on-policy training, but penalty formulation consistently delivers better stability and performance than clipping counterpart. Path-space proximal regularization proves effective for training generative policies with PPO.
Conclusion: GSB-PPO provides unified framework for on-policy optimization of generative policies, with path-space proximal regularization as effective principle. Penalty-based objective outperforms clipping-based approach in stability and performance.
Abstract: On-policy reinforcement learning with generative policies is promising but remains underexplored. A central challenge is that proximal policy optimization (PPO) is traditionally formulated in terms of action-space probability ratios, whereas diffusion- and flow-based policies are more naturally represented as trajectory-level generative processes. In this work, we propose GSB-PPO, a path-space formulation of generative PPO inspired by the Generalized Schrödinger Bridge (GSB). Our framework lifts PPO-style proximal updates from terminal actions to full generation trajectories, yielding a unified view of on-policy optimization for generative policies. Within this framework, we develop two concrete objectives: a clipping-based objective, GSB-PPO-Clip, and a penalty-based objective, GSB-PPO-Penalty. Experimental results show that while both objectives are compatible with on-policy training, the penalty formulation consistently delivers better stability and performance than the clipping counterpart. Overall, our results highlight path-space proximal regularization as an effective principle for training generative policies with PPO.
[924] MISApp: Multi-Hop Intent-Aware Session Graph Learning for Next App Prediction
Yunchi Yang, Longlong Li, Jianliang Wu, Cunquan Qu
Main category: cs.LG
TL;DR: MISApp: A profile-free framework for next app prediction using multi-hop session graph learning to capture structural dependencies and evolving session intent.
Details
Motivation: Accurate next app prediction is challenging in real-world settings due to rapidly shifting user intent within short sessions and sparse/unavailable user profiles, especially under cold-start conditions. Existing approaches are limited in capturing higher-order structural dependencies and evolving session intent.Method: Proposes MISApp framework that constructs multi-hop session graphs to capture transition dependencies at different structural ranges, learns session representations through lightweight graph propagation, incorporates temporal and spatial context to characterize session conditions, and captures intent evolution from recent interactions.
Result: Experiments on two real-world app usage datasets show MISApp consistently outperforms competitive baselines under both standard and cold-start settings while maintaining good balance between predictive accuracy and practical efficiency. Learned hop-level attention weights align well with structural relevance.
Conclusion: MISApp effectively addresses next app prediction challenges through multi-hop session graph learning, offering interpretable evidence for the proposed modeling strategy and demonstrating strong performance in both standard and cold-start scenarios.
Abstract: Predicting the next mobile app a user will launch is essential for proactive mobile services. Yet accurate prediction remains challenging in real-world settings, where user intent can shift rapidly within short sessions and user-specific historical profiles are often sparse or unavailable, especially under cold-start conditions. Existing approaches mainly model app usage as sequential behavior or local session transitions, limiting their ability to capture higher-order structural dependencies and evolving session intent. To address this issue, we propose MISApp, a profile-free framework for next app prediction based on multi-hop session graph learning. MISApp constructs multi-hop session graphs to capture transition dependencies at different structural ranges, learns session representations through lightweight graph propagation, incorporates temporal and spatial context to characterize session conditions, and captures intent evolution from recent interactions. Experiments on two real-world app usage datasets show that MISApp consistently outperforms competitive baselines under both standard and cold-start settings, while maintaining a favorable balance between predictive accuracy and practical efficiency. Further analyses show that the learned hop-level attention weights align well with structural relevance, offering interpretable evidence for the effectiveness of the proposed multi-hop modeling strategy.
[925] TrustFed: Enabling Trustworthy Medical AI under Data Privacy Constraints
Vagish Kumar, Syed Bahauddin Alam, Souvik Chakraborty
Main category: cs.LG
TL;DR: TrustFed is a federated uncertainty quantification framework for healthcare AI that provides distribution-free coverage guarantees under data heterogeneity and class imbalance without centralized data access.
Details
Motivation: Patient privacy constraints prevent centralizing sensitive healthcare data, while federated learning faces challenges from data heterogeneity, site-specific biases, and class imbalance that degrade predictive reliability and make existing uncertainty quantification methods ineffective.Method: TrustFed introduces a representation-aware client assignment mechanism using internal model representations for effective calibration across institutions, and a soft-nearest threshold aggregation strategy to mitigate assignment uncertainty while producing compact prediction sets.
Result: Evaluated on over 430,000 medical images across six clinically distinct imaging modalities, demonstrating robust coverage guarantees across datasets with diverse class cardinalities and imbalance regimes.
Conclusion: Advances uncertainty-aware federated learning toward clinically meaningful, modality-agnostic deployment, positioning statistically guaranteed uncertainty as a core requirement for next-generation healthcare AI systems.
Abstract: Protecting patient privacy remains a fundamental barrier to scaling machine learning across healthcare institutions, where centralizing sensitive data is often infeasible due to ethical, legal, and regulatory constraints. Federated learning offers a promising alternative by enabling privacy-preserving, multi-institutional training without sharing raw patient data; however, real-world deployments face severe challenges from data heterogeneity, site-specific biases, and class imbalance, which degrade predictive reliability and render existing uncertainty quantification methods ineffective. Here, we present TrustFed, a federated uncertainty quantification framework that provides distribution-free, finite-sample coverage guarantees under heterogeneous and imbalanced healthcare data, without requiring centralized access. TrustFed introduces a representation-aware client assignment mechanism that leverages internal model representations to enable effective calibration across institutions, along with a soft-nearest threshold aggregation strategy that mitigates assignment uncertainty while producing compact and reliable prediction sets. Using over 430,000 medical images across six clinically distinct imaging modalities, we conduct one of the most comprehensive evaluations of uncertainty-aware federated learning in medical imaging, demonstrating robust coverage guarantees across datasets with diverse class cardinalities and imbalance regimes. By validating TrustFed at this scale and breadth, our study advances uncertainty-aware federated learning from proof-of-concept toward clinically meaningful, modality-agnostic deployment, positioning statistically guaranteed uncertainty as a core requirement for next-generation healthcare AI systems.
[926] Data-Free Layer-Adaptive Merging via Fisher Information for Long-to-Short Reasoning LLMs
Tian Xia
Main category: cs.LG
TL;DR: FIM-Merging: A layer-adaptive model merging method using Fisher Information Matrix as proxy for Hessian norm to optimize merging coefficients, achieving state-of-the-art performance in Long-to-Short reasoning scenarios without calibration data.
Details
Motivation: Existing model merging methods assume linear output variation with merging coefficients, which is systematically violated in Long-to-Short (L2S) scenarios where merging base models with long-chain-of-thought reasoning models. Need principled approach for layer-adaptive merging.Method: Proposes FIM-Merging: uses Fisher Information Matrix (FIM) as principled proxy for Hessian norm bound, computes diagonal FIM using random token inputs (no domain data), assigns per-layer merging coefficients based on FIM, implements as FIM-TIES variant.
Result: On 7B L2S benchmark: SOTA on 5/6 evaluation benchmarks, +6.2 point gain on MATH500 over ACM-TIES (90.2 vs 84.0). On 1.5B benchmark: average accuracy 47.3 vs previous best 43.3 (+3.9 points), reduces response length by 91.9% relative to long-CoT model.
Conclusion: FIM-Merging provides theoretical justification for layer-adaptive merging, outperforms existing methods without calibration data, offers unified explanation for why layer-adaptive methods work, enables efficient model combination for reasoning tasks.
Abstract: Model merging has emerged as a practical approach to combine capabilities of specialized large language models (LLMs) without additional training. In the Long-to-Short (L2S) scenario, merging a base model with a long-chain-of-thought reasoning model aims to preserve reasoning accuracy while reducing output length. Existing methods rely on Task Arithmetic and its variants, which implicitly assume that model outputs vary linearly with the merging coefficient – an assumption we show is systematically violated in L2S settings. We provide the first theoretical justification for layer-adaptive merging: we prove that merging error is bounded by a term proportional to the per-layer Hessian norm (Proposition~1), and establish that the Fisher Information Matrix (FIM) is a principled, computable proxy for this bound via the Fisher-Hessian equivalence at local optima. Building on this theory, we propose \textbf{FIM-Merging}, which computes diagonal FIM using only random token inputs (no domain-specific calibration data required) and uses it to assign per-layer merging coefficients. On the 7B L2S benchmark, FIM-TIES achieves state-of-the-art performance on five out of six evaluation benchmarks, including a \textbf{+6.2} point gain on MATH500 over ACM-TIES (90.2 vs.\ 84.0), while requiring no calibration data. On the 1.5B benchmark, FIM-TIES achieves an average accuracy of \textbf{47.3}, surpassing the previous best ACM-TIES (43.3) by \textbf{+3.9} points, while reducing average response length by \textbf{91.9%} relative to the long-CoT model. Our framework also provides a unified theoretical explanation for why existing layer-adaptive methods such as ACM empirically outperform uniform merging.
[927] When Exploration Comes for Free with Mixture-Greedy: Do we need UCB in Diversity-Aware Multi-Armed Bandits?
Bahar Dibaei Nia, Farzan Farnia
Main category: cs.LG
TL;DR: Mixture-Greedy strategy outperforms UCB-based approaches for diversity-aware generative model selection, showing that intrinsic exploration from objective geometry can be sufficient without explicit confidence bonuses.
Details
Motivation: Efficient selection among multiple generative models is costly when sampling from suboptimal models. Under diversity-aware metrics, mixtures outperform individual models, creating a different problem than classical best-arm identification. Prior UCB-based approaches slow convergence, suggesting need for better methods.Method: Proposes Mixture-Greedy strategy without explicit UCB-type optimism. Analyzes diversity-aware objectives (entropy-based, kernel-based, FID-type) that induce implicit exploration by favoring interior mixtures, leading to linear sampling of all arms.
Result: Mixture-Greedy converges faster and achieves better performance than UCB-based approaches across multiple datasets and evaluation metrics (FID, Vendi). Provides theoretical guarantees showing linear sampling of all arms and sublinear regret for diversity-aware objectives.
Conclusion: In diversity-aware multi-armed bandits for generative model selection, exploration can arise intrinsically from objective geometry, questioning the necessity of explicit confidence bonuses. Simple greedy approaches can outperform optimistic exploration strategies.
Abstract: Efficient selection among multiple generative models is increasingly important in modern generative AI, where sampling from suboptimal models is costly. This problem can be formulated as a multi-armed bandit task. Under diversity-aware evaluation metrics, a non-degenerate mixture of generators can outperform any individual model, distinguishing this setting from classical best-arm identification. Prior approaches therefore incorporate an Upper Confidence Bound (UCB) exploration bonus into the mixture objective. However, across multiple datasets and evaluation metrics, we observe that the UCB term consistently slows convergence and often reduces sample efficiency. In contrast, a simple \emph{Mixture-Greedy} strategy without explicit UCB-type optimism converges faster and achieves even better performance, particularly for widely used metrics such as FID and Vendi where tight confidence bounds are difficult to construct. We provide theoretical insight explaining this behavior: under transparent structural conditions, diversity-aware objectives induce implicit exploration by favoring interior mixtures, leading to linear sampling of all arms and sublinear regret guarantees for entropy-based, kernel-based, and FID-type objectives. These results suggest that in diversity-aware multi-armed bandits for generative model selection, exploration can arise intrinsically from the objective geometry, questioning the necessity of explicit confidence bonuses.
[928] Uncertainty Quantification for Distribution-to-Distribution Flow Matching in Scientific Imaging
Dongxia Wu, Yuhui Zhang, Serena Yeung-Levy, Emma Lundberg, Emily B. Fox
Main category: cs.LG
TL;DR: Bayesian Stochastic Flow Matching (BSFM) framework for uncertainty quantification in distribution-to-distribution generative models, combining Stochastic Flow Matching for improved generalization with MCD-Antithetic sampling for out-of-distribution detection.
Details
Motivation: Distribution-to-distribution generative models need both reliability (generalization across different conditions) and accountability (detecting out-of-distribution cases). Current uncertainty quantification methods for these models are underexplored despite being crucial for trustworthy generation in scientific imaging tasks.Method: Proposes Bayesian Stochastic Flow Matching (BSFM) framework: 1) Stochastic Flow Matching (SFM) augments deterministic flows with diffusion term to improve generalization, 2) MCD-Antithetic combines Monte Carlo Dropout with antithetic sampling for scalable Bayesian uncertainty quantification and effective anomaly scores for out-of-distribution detection.
Result: Experiments on cellular imaging (BBBC021, JUMP) and brain fMRI (Theory of Mind) across diverse scenarios show that SFM improves reliability while MCD-Antithetic enhances accountability through effective out-of-distribution detection.
Conclusion: BSFM provides a unified uncertainty quantification framework for distribution-to-distribution generative models that addresses both reliability and accountability needs in scientific imaging applications.
Abstract: Distribution-to-distribution generative models support scientific imaging tasks ranging from modeling cellular perturbation responses to translating medical images across conditions. Trustworthy generation requires both reliability (generalization across labs, devices, and experimental conditions) and accountability (detecting out-of-distribution cases where predictions may be unreliable). Uncertainty quantification (UQ) based approaches serve as promising candidates for these tasks, yet UQ for distribution-to-distribution generative models remains underexplored. We present a unified UQ framework, Bayesian Stochastic Flow Matching (BSFM), that disentangles aleatoric and epistemic uncertainty. The Stochastic Flow Matching (SFM) component augments deterministic flows with a diffusion term to improve model generalization to unseen scenarios. For UQ, we develop a scalable Bayesian approach – MCD-Antithetic – that combines Monte Carlo Dropout with sample-efficient antithetic sampling to produce effective anomaly scores for out-of-distribution detection. Experiments on cellular imaging (BBBC021, JUMP) and brain fMRI (Theory of Mind) across diverse scenarios show that SFM improves reliability while MCD-Antithetic enhances accountability.
[929] FISformer: Replacing Self-Attention with a Fuzzy Inference System in Transformer Models for Time Series Forecasting
Bulent Haznedar, Levent Karacan
Main category: cs.LG
TL;DR: FISFormer replaces standard Transformer attention with fuzzy inference systems to model uncertainty and nonlinear dependencies in time series forecasting.
Details
Motivation: Standard Transformers use deterministic dot-product attention which limits their ability to model uncertainty and nonlinear dependencies across multivariate temporal dimensions in time series forecasting.Method: Proposes FISFormer with FIS Interaction mechanism replacing conventional attention. Each query-key pair undergoes fuzzy inference with learnable membership functions and rule-based reasoning to estimate token-wise relational strengths, capturing uncertainty through continuous mappings.
Result: Extensive experiments show FISFormer achieves superior forecasting accuracy, noise robustness, and interpretability compared to state-of-the-art Transformer variants on multiple benchmark datasets.
Conclusion: Fuzzy inference is an effective alternative to conventional attention mechanisms, combining interpretability and uncertainty modeling of fuzzy logic with Transformer representational power.
Abstract: Transformers have achieved remarkable progress in time series forecasting, yet their reliance on deterministic dot-product attention limits their capacity to model uncertainty and nonlinear dependencies across multivariate temporal dimensions. To address this limitation, we propose FISFormer, a Fuzzy Inference System-driven Transformer that replaces conventional attention with a FIS Interaction mechanism. In this framework, each query-key pair undergoes a fuzzy inference process for every feature dimension, where learnable membership functions and rule-based reasoning estimate token-wise relational strengths. These FIS-derived interaction weights capture uncertainty and provide interpretable, continuous mappings between tokens. A softmax operation is applied along the token axis to normalize these weights, which are then combined with the corresponding value features through element-wise multiplication to yield the final context-enhanced token representations. This design fuses the interpretability and uncertainty modeling of fuzzy logic with the representational power of Transformers. Extensive experiments on multiple benchmark datasets demonstrate that FISFormer achieves superior forecasting accuracy, noise robustness, and interpretability compared to state-of-the-art Transformer variants, establishing fuzzy inference as an effective alternative to conventional attention mechanisms.
[930] CellFluxRL: Biologically-Constrained Virtual Cell Modeling via Reinforcement Learning
Dongxia Wu, Shiye Su, Yuhui Zhang, Elaine Sui, Emma Lundberg, Emily B. Fox, Serena Yeung-Levy
Main category: cs.LG
TL;DR: Post-training virtual cell generative models with reinforcement learning using biologically meaningful rewards to produce more physically and biologically plausible cell simulations.
Details
Motivation: Current image-based generative models for virtual cells often produce implausible images that violate physical and biological constraints, limiting their usefulness for drug discovery applications.Method: Propose reinforcement learning post-training of CellFlux model (CellFluxRL) using seven biologically meaningful reward functions spanning three categories: biological function, structural validity, and morphological correctness.
Result: CellFluxRL consistently outperforms CellFlux across all reward categories, with additional performance gains from test-time scaling, producing more biologically meaningful virtual cells.
Conclusion: The RL-based framework advances virtual cell modeling beyond visual realism to enforce physically-based constraints, creating more biologically meaningful simulations for drug discovery.
Abstract: Building virtual cells with generative models to simulate cellular behavior in silico is emerging as a promising paradigm for accelerating drug discovery. However, prior image-based generative approaches can produce implausible cell images that violate basic physical and biological constraints. To address this, we propose to post-train virtual cell models with reinforcement learning (RL), leveraging biologically meaningful evaluators as reward functions. We design seven rewards spanning three categories-biological function, structural validity, and morphological correctness-and optimize the state-of-the-art CellFlux model to yield CellFluxRL. CellFluxRL consistently improves over CellFlux across all rewards, with further performance boosts from test-time scaling. Overall, our results present a virtual cell modeling framework that enforces physically-based constraints through RL, advancing beyond “visually realistic” generations towards “biologically meaningful” ones.
[931] Extending Precipitation Nowcasting Horizons via Spectral Fusion of Radar Observations and Foundation Model Priors
Yuze Qin, Qingyong Li, Zhiqing Guo, Wen Wang, Yan Liu, Yangli-ao Geng
Main category: cs.LG
TL;DR: PW-FouCast: A frequency-domain fusion framework that integrates Pangu-Weather forecasts as spectral priors into Fourier-based precipitation nowcasting to improve long-term prediction accuracy.
Details
Motivation: Radar-only precipitation nowcasting models lack large-scale atmospheric context, leading to performance degradation at longer lead times. Existing architectures fail to reconcile representational heterogeneities between radar imagery and meteorological data from weather foundation models.Method: Proposes PW-FouCast with three innovations: 1) Pangu-Weather-guided Frequency Modulation to align spectral magnitudes/phases with meteorological priors, 2) Frequency Memory to correct phase discrepancies and preserve temporal evolution, and 3) Inverted Frequency Attention to reconstruct high-frequency details lost in spectral filtering.
Result: Achieves state-of-the-art performance on SEVIR and MeteoNet benchmarks, effectively extending reliable forecast horizon while maintaining structural fidelity.
Conclusion: PW-FouCast successfully bridges the gap between radar imagery and meteorological data through frequency-domain fusion, improving precipitation nowcasting accuracy.
Abstract: Precipitation nowcasting is critical for disaster mitigation and aviation safety. However, radar-only models frequently suffer from a lack of large-scale atmospheric context, leading to performance degradation at longer lead times. While integrating meteorological variables predicted by weather foundation models offers a potential remedy, existing architectures fail to reconcile the profound representational heterogeneities between radar imagery and meteorological data. To bridge this gap, we propose PW-FouCast, a novel frequency-domain fusion framework that leverages Pangu-Weather forecasts as spectral priors within a Fourier-based backbone. Our architecture introduces three key innovations: (i) Pangu-Weather-guided Frequency Modulation to align spectral magnitudes and phases with meteorological priors; (ii) Frequency Memory to correct phase discrepancies and preserve temporal evolution; and (iii) Inverted Frequency Attention to reconstruct high-frequency details typically lost in spectral filtering. Extensive experiments on the SEVIR and MeteoNet benchmarks demonstrate that PW-FouCast achieves state-of-the-art performance, effectively extending the reliable forecast horizon while maintaining structural fidelity. Our code is available at https://github.com/Onemissed/PW-FouCast.
[932] Show Me What You Don’t Know: Efficient Sampling from Invariant Sets for Model Validation
Armand Rousselot, Joran Wendebourg, Ullrich Köthe
Main category: cs.LG
TL;DR: A training-free method to visualize feature extractor invariances by sampling from equivalence classes (fibers) using guided diffusion/flow-matching models, revealing what variations models consider irrelevant vs. task-relevant.
Details
Motivation: To understand what features machine learning models learn and whether they appropriately ignore irrelevant variations while preserving task-relevant details, which is crucial for model interpretability and safety.Method: Proposes a training-free algorithm that uses pretrained diffusion or flow-matching models as priors. A fiber loss penalizes feature mismatches and guides the denoising process toward desired equivalence classes via non-linear diffusion trajectory matching.
Result: Experiments on ImageNet, CheXpert with models like ResNet, DINO, BiomedClip show the method can reveal invariances from desirable to concerning behaviors. For example, Qwen-2B places patients with situs inversus (heart on right side) in same fiber as typical anatomy.
Conclusion: The framework provides efficient visualization of feature extractor invariances without training dedicated generative models, replacing days of training with single guided generation while maintaining comparable fidelity.
Abstract: The performance of machine learning models is determined by the quality of their learned features. They should be invariant under irrelevant data variation but sensitive to task-relevant details. To visualize whether this is the case, we propose a method to analyze feature extractors by sampling from their fibers – equivalence classes defined by their invariances – given an arbitrary representative. Unlike existing work where a dedicated generative model is trained for each feature detector, our algorithm is training-free and exploits a pretrained diffusion or flow-matching model as a prior. The fiber loss – which penalizes mismatch in features – guides the denoising process toward the desired equivalence class, via non-linear diffusion trajectory matching. This replaces days of training for invariance learning with a single guided generation procedure at comparable fidelity. Experiments on popular datasets (ImageNet, CheXpert) and model types (ResNet, DINO, BiomedClip) demonstrate that our framework can reveal invariances ranging from very desirable to concerning behaviour. For instance, we show how Qwen-2B places patients with situs inversus (heart on the right side) in the same fiber as typical anatomy.
[933] CoRA: Boosting Time Series Foundation Models for Multivariate Forecasting through Correlation-aware Adapter
Hanyin Cheng, Xingjian Wu, Yang Shu, Zhongwen Rao, Lujia Pan, Bin Yang, Chenjuan Guo
Main category: cs.LG
TL;DR: CoRA is a lightweight plug-and-play adapter that enhances Time Series Foundation Models by capturing different types of channel correlations, improving multivariate time series forecasting performance.
Details
Motivation: Existing Time Series Foundation Models (TSFMs) use channel-independent modeling and focus only on temporal dependencies, neglecting vital correlations among channels that are crucial for multivariate time series forecasting.Method: Proposes CoRrelation-aware Adapter (CoRA) that decomposes correlation matrix into low-rank Time-Varying and Time-Invariant components. Uses learnable polynomials for dynamic correlations and introduces dual contrastive learning with Heterogeneous-Partial contrastive loss to identify positive/negative correlations among channels.
Result: Extensive experiments on 10 real-world datasets demonstrate that CoRA improves TSFMs in multivariate forecasting performance.
Conclusion: CoRA effectively captures different types of channel correlations in a lightweight manner, enhancing existing TSFMs for better multivariate time series forecasting.
Abstract: Most existing Time Series Foundation Models (TSFMs) use channel independent modeling and focus on capturing and generalizing temporal dependencies, while neglecting the correlations among channels or overlooking the different aspects of correlations. However, these correlations play a vital role in Multivariate time series forecasting. To address this, we propose a CoRrelation-aware Adapter (CoRA), a lightweight plug-and-play method that requires only fine-tuning with TSFMs and is able to capture different types of correlations, so as to improve forecast performance. Specifically, to reduce complexity, we innovatively decompose the correlation matrix into low-rank Time-Varying and Time-Invariant components. For the Time-Varying component, we further design learnable polynomials to learn dynamic correlations by capturing trends or periodic patterns. To learn positive and negative correlations that appear only among some channels, we introduce a novel dual contrastive learning method that identifies correlations through projection layers, regulated by a Heterogeneous-Partial contrastive loss during training, without introducing additional complexity in the inference stage. Extensive experiments on 10 real-world datasets demonstrate that CoRA can improve TSFMs in multivariate forecasting performance.
[934] Deriving Health Metrics from the Photoplethysmogram: Benchmarks and Insights from MIMIC-III-Ext-PPG
Mohammad Moulaeifard, Philip J. Aston, Peter H. Charlton, Nils Strodthoff
Main category: cs.LG
TL;DR: Comprehensive benchmark for PPG-based clinical prediction using deep learning, covering heart rhythm classification and physiological parameter estimation with strong performance and subgroup analysis.
Details
Motivation: PPG is widely used for clinical prediction but algorithms are typically trained on small, uncertain-quality datasets, hindering meaningful comparisons and comprehensive evaluation.Method: Established comprehensive benchmark using a dataset with established deep learning architectures for multi-class heart rhythm classification and regression of respiratory rate, heart rate, and blood pressure.
Result: Achieved strong performance: AF detection (AUROC=0.96), accurate physiological parameter estimation (RR MAE: 2.97 bpm; HR MAE: 1.13 bpm; BP MAE: 16.13/8.70 mmHg), excellent cross-dataset generalizability for AF detection (AUROC=0.97).
Conclusion: PPG signals can effectively support multiple simultaneous monitoring tasks; benchmark reveals performance variations across subgroups reflecting population-specific waveform differences rather than systematic bias.
Abstract: Photoplethysmography (PPG) is one of the most widely captured biosignals for clinical prediction tasks, yet PPG-based algorithms are typically trained on small-scale datasets of uncertain quality, which hinders meaningful algorithm comparisons. We present a comprehensive benchmark for PPG-based clinical prediction using the \dbname~dataset, establishing baselines across the full spectrum of clinically relevant applications: multi-class heart rhythm classification, and regression of physiological parameters including respiratory rate (RR), heart rate (HR), and blood pressure (BP). Most notably, we provide the first comprehensive assessment of PPG for general arrhythmia detection beyond atrial fibrillation (AF) and atrial flutter (AFLT), with performance stratified by BP, HR, and demographic subgroups. Using established deep learning architectures, we achieved strong performance for AF detection (AUROC = 0.96) and accurate physiological parameter estimation (RR MAE: 2.97 bpm; HR MAE: 1.13 bpm; SBP/DBP MAE: 16.13/8.70 mmHg). Cross-dataset validation demonstrates excellent generalizability for AF detection (AUROC = 0.97), while clinical subgroup analysis reveals marked performance differences across subgroups by BP, HR, and demographic strata. These variations appear to reflect population-specific waveform differences rather than systematic bias in model behavior. This framework establishes the first integrated benchmark for multi-task PPG-based clinical prediction, demonstrating that PPG signals can effectively support multiple simultaneous monitoring tasks and providing essential baselines for future algorithm development.
[935] On the Number of Conditional Independence Tests in Constraint-based Causal Discovery
Marc Franquesa Monés, Jiaqi Zhang, Caroline Uhler
Main category: cs.LG
TL;DR: New constraint-based causal discovery algorithm achieves p^O(s) complexity, improving over exponential worst-case complexity of PC algorithm, with s being maximum undirected clique size of essential graph.
Details
Motivation: Constraint-based causal discovery methods like PC algorithm require exponential number of conditional independence tests in worst case, creating computational bottlenecks for large-scale causal inference problems.Method: Proposes new algorithm that reduces complexity to p^O(s) conditional independence tests, where p is number of nodes and s is maximum undirected clique size of essential graph, with theoretical proof of optimality.
Result: Algorithm achieves exponent-optimal complexity up to logarithmic factor, validated through simulations, semi-synthetic gene-expression data, and real-world datasets showing efficiency improvements.
Conclusion: Establishes fundamental complexity bounds for constraint-based causal discovery, providing practical algorithm with improved efficiency while proving theoretical optimality.
Abstract: Learning causal relations from observational data is a fundamental problem with wide-ranging applications across many fields. Constraint-based methods infer the underlying causal structure by performing conditional independence tests. However, existing algorithms such as the prominent PC algorithm need to perform a large number of independence tests, which in the worst case is exponential in the maximum degree of the causal graph. Despite extensive research, it remains unclear if there exist algorithms with better complexity without additional assumptions. Here, we establish an algorithm that achieves a better complexity of $p^{\mathcal{O}(s)}$ tests, where $p$ is the number of nodes in the graph and $s$ denotes the maximum undirected clique size of the underlying essential graph. Complementing this result, we prove that any constraint-based algorithm must perform at least $2^{Ω(s)}$ conditional independence tests, establishing that our proposed algorithm achieves exponent-optimality up to a logarithmic factor in terms of the number of conditional independence tests needed. Finally, we validate our theoretical findings through simulations, on semi-synthetic gene-expression data, and real-world data, demonstrating the efficiency of our algorithm compared to existing methods in terms of number of conditional independence tests needed.
[936] Holistic Scaling Laws for Optimal Mixture-of-Experts Architecture Optimization
Weilin Wan, Jingtao Han, Weizhong Zhang, Cheng Jin
Main category: cs.LG
TL;DR: A framework for optimizing Mixture-of-Experts (MoE) architectures that establishes joint constraints for FLOPs, active parameters, and total parameters, reducing 16D search space to sequential low-dimensional phases for robust scaling laws.
Details
Motivation: Existing MoE scaling studies face limitations: either augment scaling formulas with extra MoE variables (risking unreliable fits) or fix all non-MoE factors (ignoring global interactions). There's a need for a holistic optimization framework that bridges this gap.Method: Proposes a reusable framework that: 1) Shows FLOPs per token alone is inadequate for MoE fairness due to differing computational densities, 2) Establishes joint constraint triad of FLOPs per token, active parameters, and total parameters, 3) Reduces 16D architectural search space to two sequential low-dimensional phases using algebraic constraints and rank-preserving property of hidden dimension.
Result: Validated across hundreds of MoE models spanning six orders of magnitude in compute, the framework yields robust scaling laws that map any compute budget to a complete, optimal MoE architecture. Key finding: near-optimal configuration band widens with scale, giving practitioners flexibility.
Conclusion: The framework provides a systematic approach for MoE architectural optimization, bridging the gap between scaling laws and practical architectural configurations, with validated robustness across wide compute ranges.
Abstract: Scaling laws for Large Language Models govern macroscopic resource allocation, yet translating them into precise Mixture-of-Experts (MoE) architectural configurations remains an open problem due to the combinatorially vast design space. Existing MoE scaling studies are constrained by experimental budgets to either augment scaling formulas with extra MoE variables, risking unreliable fits, or fix all non-MoE factors, ignoring global interactions. We propose a reusable framework for holistic MoE architectural optimization that bridges this gap. We first show that FLOPs per token alone is an inadequate fairness metric for MoE models because differing computational densities across layer types can inflate parameters without proportional compute cost, and establish a joint constraint triad of FLOPs per token, active parameters, and total parameters. We then reduce the 16-dimensional architectural search space to two sequential low-dimensional phases through algebraic constraints and a rank-preserving property of the hidden dimension. Validated across hundreds of MoE models spanning six orders of magnitude in compute, our framework yields robust scaling laws that map any compute budget to a complete, optimal MoE architecture. A key finding is that the near-optimal configuration band widens with scale, giving practitioners quantitative flexibility to balance scaling law recommendations against infrastructure constraints.
[937] P^2O: Joint Policy and Prompt Optimization
Xinyu Lu, Kaiqi Zhang, Jinglin Yang, Boxi Cao, Yaojie Lu, Hongyu Lin, Min He, Xianpei Han, Le Sun
Main category: cs.LG
TL;DR: P^2O combines prompt optimization with policy optimization to improve RLVR for LLMs, addressing exploration inefficiency on hard samples by evolving prompts to guide successful trajectories and distilling reasoning gains into model parameters.
Details
Motivation: Vanilla RLVR suffers from inefficient exploration on "hard samples" with near-zero success rates, leading to zero-advantage estimates and lack of supervision signals despite high informational value of these instances.Method: P^2O synergizes Prompt Optimization with Policy Optimization: identifies hard samples during training, uses GeneticPareto (GEPA) algorithm to evolve prompt templates that guide models toward successful trajectories, and distills reasoning gains directly into model parameters rather than relying on input augmentation.
Result: P^2O achieves superior performance on in-distribution datasets and exhibits strong generalization with substantial improvements on out-of-distribution benchmarks (+4.7% average improvement).
Conclusion: P^2O effectively addresses exploration inefficiency in RLVR by providing denser positive supervision signals for hard samples through prompt optimization and parameter distillation, accelerating convergence and improving generalization.
Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as a powerful paradigm for enhancing the reasoning capabilities of Large Language Models (LLMs). However, vanilla RLVR suffers from inefficient exploration, particularly when confronting “hard samples” that yield nearzero success rates. In such scenarios, the reliance on sparse outcome rewards typically results in zero-advantage estimates, effectively starving the model of supervision signals despite the high informational value of these instances. To address this, we propose P^2O, a novel framework that synergizes Prompt Optimization with Policy Optimization. P^2O identifies hard samples during training iterations and leverages the GeneticPareto (GEPA) prompt optimization algorithm to evolve prompt templates that guide the model toward discovering successful trajectories. Crucially, unlike traditional prompt engineering methods that rely on input augmentation, P^2O distills the reasoning gains induced by these optimized prompts directly into the model parameters. This mechanism provides denser positive supervision signals for hard samples and accelerates convergence. Extensive experiments demonstrate that P^2O not only achieves superior performance on in-distribution datasets but also exhibits strong generalization, yielding substantial improvements on out-of-distribution benchmarks (+4.7% avg.).
[938] SmaAT-QMix-UNet: A Parameter-Efficient Vector-Quantized UNet for Precipitation Nowcasting
Nikolas Stavrou, Siamak Mehrkanoon
Main category: cs.LG
TL;DR: Enhanced weather nowcasting model SmaAT-QMix-UNet combines vector quantization bottleneck and mixed kernel convolutions to improve precipitation prediction while reducing model size.
Details
Motivation: Traditional Numerical Weather Prediction (NWP) systems are computationally intensive, while recent deep learning models show promise for nowcasting. The paper aims to develop a more efficient and accurate precipitation nowcasting model.Method: Proposes SmaAT-QMix-UNet, an enhanced variant of SmaAT-UNet with two key innovations: 1) vector quantization (VQ) bottleneck at encoder-decoder bridge for compression, and 2) mixed kernel depth-wise convolutions (MixConv) replacing selected encoder/decoder blocks to capture multi-scale features.
Result: Model trained on Dutch radar precipitation dataset (2016-2019) for 30-minute ahead prediction. Three configurations benchmarked: VQ-only, MixConv-only, and full SmaAT-QMix-UNet. Grad-CAM saliency maps identify influential regions, and UMAP embedding shows VQ layer clustering encoder outputs.
Conclusion: SmaAT-QMix-UNet reduces model size while improving nowcasting performance, demonstrating the effectiveness of combining VQ compression with multi-scale feature extraction for precipitation prediction.
Abstract: Weather forecasting supports critical socioeconomic activities and complements environmental protection, yet operational Numerical Weather Prediction (NWP) systems remain computationally intensive, thus being inefficient for certain applications. Meanwhile, recent advances in deep data-driven models have demonstrated promising results in nowcasting tasks. This paper presents SmaAT-QMix-UNet, an enhanced variant of SmaAT-UNet that introduces two key innovations: a vector quantization (VQ) bottleneck at the encoder-decoder bridge, and mixed kernel depth-wise convolutions (MixConv) replacing selected encoder and decoder blocks. These enhancements both reduce the model’s size and improve its nowcasting performance. We train and evaluate SmaAT-QMix-UNet on a Dutch radar precipitation dataset (2016-2019), predicting precipitation 30 minutes ahead. Three configurations are benchmarked: using only VQ, only MixConv, and the full SmaAT-QMix-UNet. Grad-CAM saliency maps highlight the regions influencing each nowcast, while a UMAP embedding of the codewords illustrates how the VQ layer clusters encoder outputs. The source code for SmaAT-QMix-UNet is publicly available on GitHub \footnote{\href{https://github.com/nstavr04/MasterThesisSnellius}{https://github.com/nstavr04/MasterThesisSnellius}}.
[939] SparseDVFS: Sparse-Aware DVFS for Energy-Efficient Edge Inference
Ziyang Zhang, Zheshun Wu, Jie Liu, Luca Mottola
Main category: cs.LG
TL;DR: SparseDVFS: A fine-grained, sparse-aware DVFS framework for energy-efficient edge inference that uses operator sparsity as a metric for hardware frequency modulation to overcome switching latency challenges.
Details
Motivation: Deploying DNNs on power-sensitive edge devices is challenging. Traditional DVFS approaches are either too coarse (model-level) or suffer from performance degradation due to hardware switching latency (operator-level). There's a need for a fine-grained DVFS approach that can handle intra-inference variations without prohibitive switching overheads.Method: SparseDVFS uses operator sparsity as the primary metric for hardware frequency modulation. It distinguishes between compute-bound dense operators and memory-bound sparse operators, applying specialized frequency triplets (CPU/GPU/EMC). The framework includes: 1) offline modeler for deterministic mapping between operator sparsity and optimal frequencies, 2) runtime graph partitioner using greedy merging to create super-blocks that balance granularity and switching latency, and 3) unified co-governor with FUSE and look-ahead instruction queue to eliminate controller interference and hide transition latencies.
Result: SparseDVFS achieves an average 78.17% energy efficiency gain over state-of-the-art solutions while maintaining a superior 14% cost-gain ratio.
Conclusion: SparseDVFS provides an effective fine-grained DVFS framework for edge inference that overcomes traditional switching latency challenges through sparse-aware frequency modulation and innovative hardware coordination techniques, significantly improving energy efficiency for DNN deployment on edge devices.
Abstract: Deploying deep neural networks (DNNs) on power-sensitive edge devices presents a formidable challenge. While Dynamic Voltage and Frequency Scaling (DVFS) is widely employed for energy optimization, traditional model-level scaling is often too coarse to capture intra-inference variations, whereas fine-grained operator-level scaling suffers from prohibitive performance degradation due to significant hardware switching latency. This paper presents SparseDVFS, a fine-grained, sparse-aware DVFS framework designed for energy-efficient edge inference. Our key insight is that operator sparsity is a primary metric for hardware frequency modulation. By distinguishing between compute-bound dense operators and memory-bound sparse operators, the system can apply specialized frequency triplets to maximize energy efficiency. To overcome switching overheads and component interference, SparseDVFS incorporates three key innovations: (1) an offline modeler that established a deterministic mapping between operator sparsity and optimal frequency triplets (CPU/GPU/EMC) via white-box timeline analysis; (2) a runtime graph partitioner that utilizes a greedy merging heuristic to aggregate operators into super-blocks, balancing scaling granularity and DVFS switching latency through a latency amortization constraint; and (3) a unified co-governor that employs a frequency unified scaling engine (FUSE) and a look-ahead instruction queue to eliminate antagonistic effects between independent controllers and hide hardware transition latencies. Extensive evaluations show that SparseDVFS achieves an average 78.17% energy efficiency gain over state-of-the-art solutions while maintaining a superior 14% cost-gain ratio.
[940] Deep Reinforcement Learning and The Tale of Two Temporal Difference Errors
Juan Sebastian Rojas, Chi-Guhn Lee
Main category: cs.LG
TL;DR: The paper reveals that two historical interpretations of TD error (difference between successive predictions vs. bootstrapped target minus prediction) are not equivalent in deep RL, affecting algorithm performance.
Details
Motivation: To investigate whether two established interpretations of temporal difference (TD) error are actually equivalent in modern deep reinforcement learning settings, given their interchangeable use in literature.Method: Theoretical analysis and empirical evaluation showing how increasingly-nonlinear deep RL architectures cause divergence between the two TD error interpretations, with experiments on deep differential RL methods.
Result: Found that the two interpretations yield increasingly different numerical values in deep RL, and choice of interpretation affects performance of algorithms using TD error, particularly in deep differential RL.
Conclusion: The standard interpretation of TD error as bootstrapped target minus prediction doesn’t always hold in deep RL, requiring careful consideration of which interpretation to use in algorithm design.
Abstract: The temporal difference (TD) error was first formalized in Sutton (1988), where it was first characterized as the difference between temporally successive predictions, and later, in that same work, formulated as the difference between a bootstrapped target and a prediction. Since then, these two interpretations of the TD error have been used interchangeably in the literature, with the latter eventually being adopted as the standard critic loss in deep reinforcement learning (RL) architectures. In this work, we show that these two interpretations of the TD error are not always equivalent. In particular, we show that increasingly-nonlinear deep RL architectures can cause these interpretations of the TD error to yield increasingly different numerical values. Then, building on this insight, we show how choosing one interpretation of the TD error over the other can affect the performance of deep RL algorithms that utilize the TD error to compute other quantities, such as with deep differential (i.e., average-reward) RL methods. All in all, our results show that the default interpretation of the TD error as the difference between a bootstrapped target and a prediction does not always hold in deep RL settings.
[941] BOOST-RPF: Boosted Sequential Trees for Radial Power Flow
Ehimare Okoyomon, Christoph Goebel
Main category: cs.LG
TL;DR: BOOST-RPF reformulates power flow voltage prediction from global graph regression to sequential path-based learning using gradient-boosted decision trees, achieving superior generalization and computational efficiency.
Details
Motivation: Classical power flow solvers have scalability issues, and current machine learning models struggle with generalization across different distribution system topologies and configurations.Method: Decomposes radial power networks into root-to-leaf paths and uses gradient-boosted decision trees (XGBoost) to model local voltage-drop regularities. Evaluates three variants: Absolute Voltage, Parent Residual, and Physics-Informed Residual.
Result: Achieves state-of-the-art results on Kerber Dorfnetz grid and ENGAGE suite benchmarks. Parent Residual variant consistently outperforms analytical and neural baselines in accuracy and generalization, with linear O(N) computational scaling.
Conclusion: BOOST-RPF offers a scalable, generalizable alternative for real-time distribution system operator applications by aligning model architecture with recursive physics of power flow, ensuring size-agnostic application and superior out-of-distribution robustness.
Abstract: Accurate power flow analysis is critical for modern distribution systems, yet classical solvers face scalability issues, and current machine learning models often struggle with generalization. We introduce BOOST-RPF, a novel method that reformulates voltage prediction from a global graph regression task into a sequential path-based learning problem. By decomposing radial networks into root-to-leaf paths, we leverage gradient-boosted decision trees (XGBoost) to model local voltage-drop regularities. We evaluate three architectural variants: Absolute Voltage, Parent Residual, and Physics-Informed Residual. This approach aligns the model architecture with the recursive physics of power flow, ensuring size-agnostic application and superior out-of-distribution robustness. Benchmarked against the Kerber Dorfnetz grid and the ENGAGE suite, BOOST-RPF achieves state-of-the-art results with its Parent Residual variant which consistently outperforms both analytical and neural baselines in standard accuracy and generalization tasks. While global Multi-Layer Perceptrons (MLPs) and Graph Neural Networks (GNNs) often suffer from performance degradation under topological shifts, BOOST-RPF maintains high precision across unseen feeders. Furthermore, the framework displays linear $O(N)$ computational scaling and significantly increased sample efficiency through per-edge supervision, offering a scalable and generalizable alternative for real-time distribution system operator (DSO) applications.
[942] TREX: Trajectory Explanations for Multi-Objective Reinforcement Learning
Dilina Rajapakse, Juan C. Rosero, Ivana Dusparic
Main category: cs.LG
TL;DR: TREX: A trajectory-based explainability framework for multi-objective reinforcement learning policies that analyzes behavioral patterns across different user preferences
Details
Motivation: Multi-objective RL handles conflicting objectives but lacks explainability for trade-off decisions. Current XRL methods only work with single scalar rewards and don't explain distinct objectives or user preferences.Method: Generates trajectories from learned expert policy across user preferences, clusters them into semantic temporal segments, trains complementary policies excluding specific clusters, and measures relative deviation in rewards/actions
Result: Experiments on multi-objective MuJoCo environments (HalfCheetah, Ant, Swimmer) demonstrate ability to isolate and quantify specific behavioral patterns
Conclusion: TREX provides explainability for multi-objective RL policies by analyzing trajectory segments and their influence on Pareto trade-offs
Abstract: Reinforcement Learning (RL) has demonstrated its ability to solve complex decision-making problems in a variety of domains, by optimizing reward signals obtained through interaction with an environment. However, many real-world scenarios involve multiple, potentially conflicting objectives that cannot be easily represented by a single scalar reward. Multi-Objective Reinforcement Learning (MORL) addresses this limitation by enabling agents to optimize several objectives simultaneously, explicitly reasoning about trade-offs between them. However, the ``black box" nature of the RL models makes the decision process behind chosen objective trade-offs unclear. Current Explainable Reinforcement Learning (XRL) methods are typically designed for single scalar rewards and do not account for explanations with respect to distinct objectives or user preferences. To address this gap, in this paper we propose TREX, a Trajectory based Explainability framework to explain Multi-objective Reinforcement Learning policies, based on trajectory attribution. TREX generates trajectories directly from the learned expert policy, across different user preferences and clusters them into semantically meaningful temporal segments. We quantify the influence of these behavioural segments on the Pareto trade-off by training complementary policies that exclude specific clusters, measuring the resulting relative deviation on the observed rewards and actions compared to the original expert policy. Experiments on multi-objective MuJoCo environments - HalfCheetah, Ant and Swimmer, demonstrate the framework’s ability to isolate and quantify the specific behavioural patterns.
[943] λ-GELU: Learning Gating Hardness for Controlled ReLU-ization in Deep Networks
Cristian Pérez-Corral, Alberto Fernández-Hernández, Jose I. Mestre, Manuel F. Dolz, Enrique S. Quintana-Ortí
Main category: cs.LG
TL;DR: λ-GELU is a hardness-parameterized version of GELU activation that bridges smooth training with ReLU-compatible models through controlled gate sharpness adjustment
Details
Motivation: GELU is widely used but many deployment, compression, and analysis toolchains are designed for piecewise-linear (ReLU-type) networks. There's a need to bridge smooth gated training with ReLU-compatible models for better downstream compatibility.Method: Introduces λ-GELU: f(x;λ)=xΦ(λx) where Φ is Gaussian CDF and λ∈[1,∞) controls gate sharpness. Develops constrained reparameterization and optimizer-aware update scheme for stable learning of λ. Studies deterministic ReLU-ization strategy to progressively harden gates toward ReLU.
Result: Across MLPs, CNNs, and Transformers on diverse datasets, observes structured layerwise hardness profiles. Shows robustness under different initializations. Enables post-training substitution of λ-GELU by ReLU with reduced disruption through controlled hardening.
Conclusion: λ-GELU provides minimal and interpretable control over gating hardness, bridging smooth training with ReLU-centric downstream pipelines while maintaining training stability and enabling controlled transition to ReLU-compatible models.
Abstract: Gaussian Error Linear Unit (GELU) is a widely used smooth alternative to Rectifier Linear Unit (ReLU), yet many deployment, compression, and analysis toolchains are most naturally expressed for piecewise-linear (ReLU-type) networks. We study a hardness-parameterized formulation of GELU, f(x;λ)=xΦ(λ x), where Φ is the Gaussian CDF and λ \in [1, infty) controls gate sharpness, with the goal of turning smooth gated training into a controlled path toward ReLU-compatible models. Learning λ is non-trivial: naive updates yield unstable dynamics and effective gradient attenuation, so we introduce a constrained reparameterization and an optimizer-aware update scheme. Empirically, across a diverse set of model–dataset pairs spanning MLPs, CNNs, and Transformers, we observe structured layerwise hardness profiles and assess their robustness under different initializations. We further study a deterministic ReLU-ization strategy in which the learned gates are progressively hardened toward a principled target, enabling a post-training substitution of λ-GELU by ReLU with reduced disruption. Overall, λ-GELU provides a minimal and interpretable knob to profile and control gating hardness, bridging smooth training with ReLU-centric downstream pipelines.
[944] CRPS-Optimal Binning for Conformal Regression
Paolo Toccaceli
Main category: cs.LG
TL;DR: A non-parametric method for conditional distribution estimation using covariate-sorted binning with optimal partition selection via LOO-CRPS minimization, producing predictive distributions, Venn prediction bands, and conformal prediction sets with coverage guarantees.
Details
Motivation: To develop a flexible non-parametric approach for conditional distribution estimation that produces accurate predictive distributions and narrow prediction intervals with guaranteed coverage, addressing limitations of existing conformal prediction methods.Method: Sort observations by covariates, partition into contiguous bins minimizing leave-one-out Continuous Ranked Probability Score (LOO-CRPS) via dynamic programming. Select optimal number of bins K* using alternating held-out split validation. Generate two predictive objects: Venn prediction bands and conformal prediction sets using CRPS as nonconformity score.
Result: The method produces substantially narrower prediction intervals while maintaining near-nominal coverage compared to split-conformal competitors (Gaussian split conformal, CQR, and CQR-QRF) on real benchmarks.
Conclusion: The proposed non-parametric conditional distribution estimation method effectively combines optimal binning with conformal prediction to yield accurate predictive distributions with guaranteed coverage and narrower intervals than existing approaches.
Abstract: We propose a method for non-parametric conditional distribution estimation based on partitioning covariate-sorted observations into contiguous bins and using the within-bin empirical CDF as the predictive distribution. Bin boundaries are chosen to minimise the total leave-one-out Continuous Ranked Probability Score (LOO-CRPS), which admits a closed-form cost function with $O(n^2 \log n)$ precomputation and $O(n^2)$ storage; the globally optimal $K$-partition is recovered by a dynamic programme in $O(n^2 K)$ time. Minimisation of Within-sample LOO-CRPS turns out to be inappropriate for selecting $K$ as it results in in-sample optimism. So we instead select $K$ by evaluating test CRPS on an alternating held-out split, which yields a U-shaped criterion with a well-defined minimum. Having selected $K^*$ and fitted the full-data partition, we form two complementary predictive objects: the Venn prediction band and a conformal prediction set based on CRPS as the nonconformity score, which carries a finite-sample marginal coverage guarantee at any prescribed level $\varepsilon$. On real benchmarks against split-conformal competitors (Gaussian split conformal, CQR, and CQR-QRF), the method produces substantially narrower prediction intervals while maintaining near-nominal coverage.
[945] AdditiveLLM2: A Multi-modal Large Language Model for Additive Manufacturing
Peter Pak, Amir Barati Farimani
Main category: cs.LG
TL;DR: AdditiveLLM2 is a multimodal domain-adapted LLM based on Gemma 3, specialized for additive manufacturing using 50M tokens of domain-specific data, achieving over 90% accuracy on AM tasks.
Details
Motivation: To create a specialized multimodal LLM for the additive manufacturing domain that can handle both language and vision tasks, demonstrating an accessible method for domain adaptation of large language models.Method: Built upon instruction-tuned Gemma 3 model using domain-adaptive pretraining and visual instruction tuning with AdditiveLLM2-OA dataset (50M tokens from open-access AM journal articles). Evaluated using Additive-Manufacturing-Benchmark with domain-specific tasks.
Result: AdditiveLLM2 exhibits proficiency in both language and vision tasks, achieving accuracies upwards of 90% in general additive manufacturing knowledge, demonstrating effective domain specialization.
Conclusion: The domain adaptive pretraining and instruction tuning strategy provides an accessible specialization method for LLMs to domains like additive manufacturing, enabling effective multimodal understanding.
Abstract: This work presents AdditiveLLM2 a multi-modal, domain adapted large language model built upon the instruction tuned variant of the Gemma 3 model using a relatively small dataset of around 50 million tokens. The dataset (AdditiveLLM2-OA) consists of open-access additive manufacturing journal articles with data extracted for the domain adaptive pretraining and visual instruction tuning processes. Various stages of the developed model are evaluated with the Additive-Manufacturing-Benchmark which consists of additive manufacturing domain specific tasks compiled published resources. AdditiveLLM2 exhibits proficiency in both language and vision based tasks, achieving accuracies upwards of 90% in general additive manufacturing knowledge. This domain adaptive pretraining and instruction tuning strategy outline an accessible specialization method for large language models to a domain such as additive manufacturing.
[946] Do Papers Match Code? A Benchmark and Framework for Paper-Code Consistency Detection in Bioinformatics Software
Tianxiang Xu, Xiaoyan Zhu, Xin Lai, Sizhe Dang, Xin Lian, Hangyu Cheng, Jiayin Wang
Main category: cs.LG
TL;DR: Paper introduces BioCon benchmark for detecting consistency between research papers and code implementations in bioinformatics, proposing a cross-modal framework for semantic alignment between natural language descriptions and code snippets.
Details
Motivation: Addresses the underexplored problem of discrepancies between methodological descriptions in research papers and their actual code implementations, particularly prevalent in bioinformatics, which affects software reliability and scientific reproducibility.Method: Creates BioCon benchmark with 48 bioinformatics software projects, aligns sentence-level algorithmic descriptions with function-level code snippets, uses expert annotations and hybrid negative sampling. Proposes cross-modal consistency detection framework with unified input representation, pre-trained models for semantic alignment, and weighted focal loss to handle class imbalance and hard samples.
Result: Framework achieves accuracy of 0.9056 and F1 score of 0.8011 in identifying consistency between papers and code in bioinformatics, demonstrating effective cross-modal semantic alignment.
Conclusion: Opens new research direction for paper-code consistency analysis, lays foundation for automated reproducibility assessment and cross-modal understanding in scientific software, with potential applications beyond bioinformatics.
Abstract: Ensuring consistency between research papers and their corresponding software implementations is fundamental to software reliability and scientific reproducibility. However, this problem remains underexplored, particularly in the domain of bioinformatics, where discrepancies between methodological descriptions in papers and their actual code implementations are prevalent. To address this gap, this paper introduces a new task, namely paper-code consistency detection, and curates a collection of 48 bioinformatics software projects along with their associated publications. We systematically align sentence-level algorithmic descriptions from papers with function-level code snippets. Combined with expert annotations and a hybrid negative sampling strategy, we construct the first benchmark dataset in the bioinformatics domain tailored to this task, termed BioCon. Based on this benchmark, we further propose a cross-modal consistency detection framework designed to model the semantic relationships between natural language descriptions and code implementations. The framework adopts a unified input representation and leverages pre-trained models to capture deep semantic alignment between papers and code. To mitigate the effects of class imbalance and hard samples, we incorporate a weighted focal loss to enhance model robustness. Experimental results demonstrate that our framework effectively identifies consistency between papers and code in bioinformatics, achieving an accuracy of 0.9056 and an F1 score of 0.8011. Overall, this study opens a new research direction for paper-code consistency analysis and lays the foundation for automated reproducibility assessment and cross-modal understanding in scientific software.
[947] On the Interplay of Priors and Overparametrization in Bayesian Neural Network Posteriors
Julius Kobialka, Emanuel Sommer, Chris Kolb, Juntae Kwon, Daniel Dold, David Rügamer
Main category: cs.LG
TL;DR: Overparametrization in Bayesian neural networks induces structured, prior-aligned weight posterior distributions through balancedness, weight reallocation, and prior conformity phenomena.
Details
Motivation: BNN posteriors are often considered impractical due to symmetries fragmenting them, non-identifiabilities inflating dimensionality, and weight-space priors being seen as meaningless. The paper aims to understand how overparametrization and priors together reshape BNN posteriors.Method: Theoretical analysis of how redundancy in overparametrized networks introduces three key phenomena: balancedness, weight reallocation on equal-probability manifolds, and prior conformity. Validated through extensive experiments with posterior sampling budgets exceeding earlier works.
Result: Overparametrization induces structured, prior-aligned weight posterior distributions. The identified phenomena fundamentally reshape posterior geometry and provide better understanding of the interplay between overparametrization and priors.
Conclusion: The work provides new insights into BNN posterior geometry, showing how overparametrization combined with priors leads to structured, interpretable posterior distributions, addressing practical concerns about BNN inference.
Abstract: Bayesian neural network (BNN) posteriors are often considered impractical for inference, as symmetries fragment them, non-identifiabilities inflate dimensionality, and weight-space priors are seen as meaningless. In this work, we study how overparametrization and priors together reshape BNN posteriors and derive implications allowing us to better understand their interplay. We show that redundancy introduces three key phenomena that fundamentally reshape the posterior geometry: balancedness, weight reallocation on equal-probability manifolds, and prior conformity. We validate our findings through extensive experiments with posterior sampling budgets that far exceed those of earlier works, and demonstrate how overparametrization induces structured, prior-aligned weight posterior distributions.
[948] On the Failure of Topic-Matched Contrast Baselines in Multi-Directional Refusal Abliteration
Valentin Petrov
Main category: cs.LG
TL;DR: Topic-matched contrast baselines fail to produce functional refusal directions in language model safety ablation, while unmatched baselines successfully eliminate refusal behavior.
Details
Motivation: Existing literature treats contrast baseline construction as an implementation detail rather than a methodological concern in directional ablation for removing refusal behavior from language models. This work investigates whether topically matched contrast baselines yield superior refusal directions compared to unmatched baselines.Method: Used Qwen~3.5 2B model with per-category matched prompt pairs, per-class Self-Organizing Map extraction, and Singular Value Decomposition orthogonalization. Compared topic-matched vs unmatched contrast baselines for extracting refusal-mediating directions from residual stream activation space.
Result: Topic-matched contrast produced no functional refusal directions at any tested weight level on any layer, while unmatched contrast on the same model, extraction code, and evaluation protocol achieved complete refusal elimination on six layers. Geometric analysis showed topic-matched subtraction cancels dominant activation components shared between harmful and harmless prompts.
Conclusion: Topic-matched contrast baselines are ineffective for refusal direction extraction because they cancel shared activation components, reducing direction magnitude below the threshold needed to perturb the residual stream. This has important implications for contrast baseline design in ablation research.
Abstract: Inasmuch as the removal of refusal behavior from instruction-tuned language models by directional abliteration requires the extraction of refusal-mediating directions from the residual stream activation space, and inasmuch as the construction of the contrast baseline against which harmful prompt activations are compared has been treated in the existing literature as an implementation detail rather than a methodological concern, the present work investigates whether a topically matched contrast baseline yields superior refusal directions. The investigation is carried out on the Qwen~3.5 2B model using per-category matched prompt pairs, per-class Self-Organizing Map extraction, and Singular Value Decomposition orthogonalization. It was found that topic-matched contrast produces no functional refusal directions at any tested weight level on any tested layer, while unmatched contrast on the same model, same extraction code, and same evaluation protocol achieves complete refusal elimination on six layers. The geometric analysis of the failure establishes that topic-matched subtraction cancels the dominant activation component shared between harmful and harmless prompts of the same subject, reducing the extracted direction magnitude below the threshold at which weight-matrix projection perturbs the residual stream. The implications for the design of contrast baselines in abliteration research are discussed.
[949] MIHT: A Hoeffding Tree for Time Series Classification using Multiple Instance Learning
Aurora Esteban, Amelia Zafra, Sebastián Ventura
Main category: cs.LG
TL;DR: MIHT algorithm uses multi-instance learning with bags of subseries and incremental decision trees to classify multivariate, variable-length time series with interpretability.
Details
Motivation: Existing time series classification models struggle with variable-length or high-dimensional series, lacking interpretability for complex temporal data.Method: Represents time series as “bags of subseries,” uses incremental decision trees (Hoeffding trees) to distinguish relevant parts from noise, creating interpretable white-box models.
Result: Outperforms 11 state-of-the-art time series classification models on 28 public datasets, including high-dimensional ones, while providing interpretable insights.
Conclusion: MIHT offers superior accuracy and interpretability for complex, dynamic time series data, making it a promising solution for real-world temporal classification problems.
Abstract: Due to the prevalence of temporal data and its inherent dependencies in many real-world problems, time series classification is of paramount importance in various domains. However, existing models often struggle with series of variable length or high dimensionality. This paper introduces the MIHT (Multi-instance Hoeffding Tree) algorithm, an efficient model that uses multi-instance learning to classify multivariate and variable-length time series while providing interpretable results. The algorithm uses a novel representation of time series as “bags of subseries,” together with an optimization process based on incremental decision trees that distinguish relevant parts of the series from noise. This methodology extracts the underlying concept of series with multiple variables and variable lengths. The generated decision tree is a compact, white-box representation of the series’ concept, providing interpretability insights into the most relevant variables and segments of the series. Experimental results demonstrate MIHT’s superiority, as it outperforms 11 state-of-the-art time series classification models on 28 public datasets, including high-dimensional ones. MIHT offers enhanced accuracy and interpretability, making it a promising solution for handling complex, dynamic time series data.
[950] On the Direction of RLVR Updates for LLM Reasoning: Identification and Exploitation
Kexin Huang, Haoming Meng, Junkang Wu, Jinda Lu, Chiyu Ma, Ziqian Chen, Xue Wang, Bolin Ding, Jiancan Wu, Xiang Wang, Xiangnan He, Guoyin Wang, Jingren Zhou
Main category: cs.LG
TL;DR: RLVR improves reasoning via sparse updates; direction (Δlog p) matters more than magnitude for identifying critical reasoning changes, enabling test-time extrapolation and training-time reweighting.
Details
Motivation: Existing analyses of RLVR focus on magnitude of updates but overlook direction, which may be more critical for understanding how RLVR improves reasoning capabilities in LLMs.Method: Use signed token-level log probability difference (Δlog p) between base and final RLVR models; statistical analysis and token-replacement interventions; propose test-time extrapolation (amplifying policy along Δlog p direction) and training-time reweighting (focusing on low-probability/high Δlog p tokens).
Result: Δlog p better identifies sparse reasoning-critical updates than magnitude-based metrics; both proposed methods improve reasoning accuracy across models and benchmarks without additional training.
Conclusion: Direction of change (Δlog p) is key principle for analyzing and improving RLVR; enables practical applications for enhancing reasoning performance.
Abstract: Reinforcement learning with verifiable rewards (RLVR) has substantially improved the reasoning capabilities of large language models. While existing analyses identify that RLVR-induced changes are sparse, they primarily focus on the \textbf{magnitude} of these updates, largely overlooking their \textbf{direction}. In this work, we argue that the direction of updates is a more critical lens for understanding RLVR’s effects, which can be captured by the signed, token-level log probability difference $Δ\log p$ between the base and final RLVR models. Through statistical analysis and token-replacement interventions, we demonstrate that $Δ\log p$ more effectively identifies sparse, yet reasoning-critical updates than magnitude-based metrics (\eg divergence or entropy). Building on this insight, we propose two practical applications: (1) a \textit{test-time extrapolation} method that amplifies the policy along the learned $Δ\log p$ direction to improve reasoning accuracy without further training; (2) a \textit{training-time reweighting} method that focuses learning on low-probability (corresponding to higher $Δ\log p$) tokens, which improves reasoning performance across models and benchmarks. Our work establishes the direction of change as a key principle for analyzing and improving RLVR.
[951] Computationally lightweight classifiers with frequentist bounds on predictions
Shreeram Murali, Cristian R. Rojas, Dominik Baumann
Main category: cs.LG
TL;DR: A computationally efficient classification algorithm based on Nadaraya-Watson estimator that provides uncertainty bounds with O(n) or O(log n) complexity, evaluated on ECG data.
Details
Motivation: Existing classifiers lack uncertainty bounds for safety-critical applications, and kernel-based methods that do provide bounds are computationally expensive (O(n^3)), making them impractical for large datasets.Method: Proposes a novel classification algorithm using the Nadaraya-Watson estimator, deriving frequentist uncertainty intervals for predictions while achieving linear or logarithmic computational complexity.
Result: Achieves competitive accuracy >96% on synthetic data and ECG heartbeat signals from MIT-BIH Arrhythmia database, with O(n) or O(log n) operations while providing actionable uncertainty bounds.
Conclusion: The method enables uncertainty-aware classification suitable for real-time safety-critical applications like diagnostic monitoring or implantable devices, overcoming computational limitations of existing approaches.
Abstract: While both classical and neural network classifiers can achieve high accuracy, they fall short on offering uncertainty bounds on their predictions, making them unfit for safety-critical applications. Existing kernel-based classifiers that provide such bounds scale with $\mathcal O (n^{\sim3})$ in time, making them computationally intractable for large datasets. To address this, we propose a novel, computationally efficient classification algorithm based on the Nadaraya-Watson estimator, for whose estimates we derive frequentist uncertainty intervals. We evaluate our classifier on synthetically generated data and on electrocardiographic heartbeat signals from the MIT-BIH Arrhythmia database. We show that the method achieves competitive accuracy $>$\SI{96}{\percent} at $\mathcal O(n)$ and $\mathcal O(\log n)$ operations, while providing actionable uncertainty bounds. These bounds can, e.g., aid in flagging low-confidence predictions, making them suitable for real-time settings with resource constraints, such as diagnostic monitoring or implantable devices.
[952] dynActivation: A Trainable Activation Family for Adaptive Nonlinearity
Alois Bachmann
Main category: cs.LG
TL;DR: dynActivation is a trainable activation function that learns to interpolate between nonlinear and linear paths, improving training efficiency and performance across vision and language tasks.
Details
Motivation: The paper aims to address limitations of static activation functions by introducing dynamic, learnable activations that can adaptively linearize deep layers while maintaining performance, potentially improving training efficiency and model robustness.Method: Proposes dynActivation: f_i(x) = BaseAct(x)(α_i - β_i) + β_i x, where α_i and β_i are lightweight learned scalars per layer that interpolate between a base nonlinear activation (ReLU-like) and a linear path. The method is evaluated across vision tasks (CIFAR-10, MNIST depth scaling, adversarial robustness) and language modeling tasks.
Result: dynActivation variants show significant improvements: +14.02% over static Mish on CIFAR-10, maintains >95% accuracy on deep MNIST models where ReLU collapses, provides 7.40% adversarial robustness advantage, and achieves 10.3% relative perplexity reduction in language modeling compared to SwiGLU at early training steps.
Conclusion: dynActivation enables adaptive linearization of deep layers while maintaining performance, improving training efficiency by up to +54% over ReLU and enhancing model robustness. The method generalizes well across vision and language tasks.
Abstract: This paper proposes $\mathrm{dynActivation}$, a per-layer trainable activation defined as $f_i(x) = \mathrm{BaseAct}(x)(α_i - β_i) + β_i x$, where $α_i$ and $β_i$ are lightweight learned scalars that interpolate between the base nonlinearity and a linear path and $\mathrm{BaseAct}(x)$ resembles any ReLU-like function. The static and dynamic ReLU-like variants are then compared across multiple vision tasks, language modeling tasks, and ablation studies. The results suggest that dynActivation variants tend to linearize deep layers while maintaining high performance, which can improve training efficiency by up to $+54%$ over ReLU. On CIFAR-10, dynActivation(Mish) improves over static Mish by up to $+14.02%$ on AttentionCNN with an average improvment by $+6.00%$, with a $24%$ convergence-AUC reduction relative to Mish (2120 vs. 2785). In a 1-to-75-layer MNIST depth-scaling study, dynActivation never drops below $95%$ test accuracy ($95.3$–$99.3%$), while ReLU collapses below $80%$ at 25 layers. Under FGSM at $\varepsilon{=}0.08$, dynActivation(Mish) incurs a $55.39%$ accuracy drop versus $62.79%$ for ReLU ($7.40%$ advantage). Transferred to language modeling, a new proposed dynActGLU-variant achieves a $10.3%$ relative perplexity reduction over SwiGLU at 5620 steps (4.047 vs. 4.514), though the gap vanishes at 34300 steps.
[953] RAMPAGE: RAndomized Mid-Point for debiAsed Gradient Extrapolation
Abolfazl Hashemi
Main category: cs.LG
TL;DR: RAMPAGE and RAMPAGE+ are randomized midpoint methods for variational inequalities that eliminate discretization bias in Extragradient methods, with RAMPAGE+ using antithetic sampling for variance reduction.
Details
Motivation: Extragradient (EG) methods for variational inequalities suffer from discretization bias when applied to non-linear vector fields, which limits their effectiveness in various optimization problems including convex-concave games.Method: Introduces RAMPAGE (Randomized Mid-Point for debiAsed Gradient Extrapolation) and its variance-reduced counterpart RAMPAGE+ which uses antithetic sampling. Both methods are unbiased geometric path-integrators that eliminate internal first-order terms from variance.
Result: Proves O(1/k) convergence guarantees for root finding under co-coercive, co-hypomonotone, and generalized Lipschitzness regimes. Extends results to constrained VIs via symmetrically scaled variants and provides convergence guarantees for stochastic/deterministic smooth convex-concave games.
Conclusion: RAMPAGE+ achieves unbiased estimation with improved variance properties compared to RAMPAGE, and despite being randomized, attains purely deterministic bounds in many settings, making it a promising alternative to biased EG methods.
Abstract: A celebrated method for Variational Inequalities (VIs) is Extragradient (EG), which can be viewed as a standard discrete-time integration scheme. With this view in mind, in this paper we show that EG may suffer from discretization bias when applied to non-linear vector fields, conservative or otherwise. To resolve this discretization shortcoming, we introduce RAndomized Mid-Point for debiAsed Gradient Extrapolation (RAMPAGE) and its variance-reduced counterpart, RAMPAGE+ which leverages antithetic sampling. In contrast with EG, both methods are unbiased. Furthermore, leveraging negative correlation, RAMPAGE+ acts as an unbiased, geometric path-integrator that completely removes internal first-order terms from the variance, provably improving upon RAMPAGE. We further demonstrate that both methods enjoy provable $\mathcal{O}(1/k)$ convergence guarantees for a range of problems including root finding under co-coercive, co-hypomonotone, and generalized Lipschitzness regimes. Furthermore, we introduce symmetrically scaled variants to extend our results to constrained VIs. Finally, we provide convergence guarantees of both methods for stochastic and deterministic smooth convex-concave games. Somewhat interestingly, despite being a randomized method, RAMPAGE+ attains purely deterministic bounds for a number of the studied settings.
[954] Multimodal Survival Analysis with Locally Deployable Large Language Models
Moritz Gögl, Christopher Yau
Main category: cs.LG
TL;DR: Multimodal survival analysis combining clinical text, tabular data, and genomics using lightweight, locally-deployable LLMs with teacher-student distillation for calibrated survival probabilities and evidence-based prognosis text generation.
Details
Motivation: Many healthcare institutions face computational and privacy constraints that prevent using cloud-based LLMs, creating need for lightweight, on-premises models for multimodal survival analysis while avoiding hallucination and miscalibration issues of base LLMs.Method: Uses teacher-student distillation with principled multimodal fusion to integrate clinical text, tabular covariates, and genomic profiles. Jointly estimates calibrated survival probabilities and generates concise, evidence-grounded prognosis text using locally deployable LLMs.
Result: Outperforms standard baselines on TCGA cohort, avoids reliance on cloud services and privacy concerns, and reduces risk of hallucinated or miscalibrated estimates compared to base LLMs.
Conclusion: Demonstrates effective multimodal survival analysis with locally deployable LLMs that provide both calibrated survival estimates and interpretable prognosis text while addressing privacy and computational constraints.
Abstract: We study multimodal survival analysis integrating clinical text, tabular covariates, and genomic profiles using locally deployable large language models (LLMs). As many institutions face tight computational and privacy constraints, this setting motivates the use of lightweight, on-premises models. Our approach jointly estimates calibrated survival probabilities and generates concise, evidence-grounded prognosis text via teacher-student distillation and principled multimodal fusion. On a TCGA cohort, it outperforms standard baselines, avoids reliance on cloud services and associated privacy concerns, and reduces the risk of hallucinated or miscalibrated estimates that can be observed in base LLMs.
[955] Causal Evidence that Language Models use Confidence to Drive Behavior
Dharshan Kumaran, Nathaniel Daw, Simon Osindero, Petar Velickovic, Viorica Patraucean
Main category: cs.LG
TL;DR: LLMs demonstrate metacognitive-like behavior where internal confidence estimates drive abstention decisions through threshold-based policies, similar to biological metacognition.
Details
Motivation: To investigate whether LLMs actively use internal confidence signals to regulate behavior (metacognition), particularly for autonomous decision-making about when to act or abstain.Method: Four-phase abstention paradigm: (1) establish internal confidence without abstention, (2) reveal implicit confidence thresholds for abstention decisions, (3) causal manipulation via activation steering, (4) systematic variation of abstention policies based on instructed thresholds.
Result: Confidence was the dominant predictor of abstention behavior (effect sizes 10x larger than RAG scores or semantic features). Activation steering causally shifted abstention rates, and models could vary abstention policies based on instructed thresholds.
Conclusion: LLMs exhibit two-stage metacognitive control with internal confidence representations and threshold-based policies, essential for autonomous agents to recognize uncertainty and decide when to act or seek help.
Abstract: Metacognition – the ability to assess one’s own cognitive performance – is documented across species, with internal confidence estimates serving as a key signal for adaptive behavior. While confidence can be extracted from Large Language Model (LLM) outputs, whether models actively use these signals to regulate behavior remains a fundamental question. We investigate this through a four-phase abstention paradigm.Phase 1 established internal confidence estimates in the absence of an abstention option. Phase 2 revealed that LLMs apply implicit thresholds to these estimates when deciding to answer or abstain. Confidence emerged as the dominant predictor of behavior, with effect sizes an order of magnitude larger than knowledge retrieval accessibility (RAG scores) or surface-level semantic features. Phase 3 provided causal evidence through activation steering: manipulating internal confidence signals correspondingly shifted abstention rates. Finally, Phase 4 demonstrated that models can systematically vary abstention policies based on instructed thresholds.Our findings indicate that abstention arises from the joint operation of internal confidence representations and threshold-based policies, mirroring the two-stage metacognitive control found in biological systems. This capacity is essential as LLMs transition into autonomous agents that must recognize their own uncertainty to decide when to act or seek help.
[956] Calibeating Made Simple
Yurong Chen, Zhiyi Huang, Michael I. Jordan, Haipeng Luo
Main category: cs.LG
TL;DR: The paper studies calibeating - post-processing external forecasts online to minimize losses while matching informativeness benchmarks, connecting it to online learning techniques for general proper losses.
Details
Motivation: Previous work analyzed calibeating for specific losses with specific arguments, but there's a need for a more general framework that connects calibeating to established online learning techniques for broader applicability.Method: The paper reduces calibeating to existing online learning techniques by showing it’s minimax-equivalent to regret minimization, and extends this to multi-calibeating by combining calibeating with the classical expert problem.
Result: Recovers optimal O(log T) calibeating rates for Brier and log losses, obtains new optimal rates for mixable and general bounded losses, and achieves first calibrated algorithm with optimal calibeating rate for binary predictions.
Conclusion: Calibeating can be systematically connected to online learning theory, enabling optimal rates for various loss functions and providing a unified framework for forecast post-processing with calibration guarantees.
Abstract: We study calibeating, the problem of post-processing external forecasts online to minimize cumulative losses and match an informativeness-based benchmark. Unlike prior work, which analyzed calibeating for specific losses with specific arguments, we reduce calibeating to existing online learning techniques and obtain results for general proper losses. More concretely, we first show that calibeating is minimax-equivalent to regret minimization. This recovers the $O(\log T)$ calibeating rate of Foster and Hart [FH23] for the Brier and log losses and its optimality, and yields new optimal calibeating rates for mixable losses and general bounded losses. Second, we prove that multi-calibeating is minimax-equivalent to the combination of calibeating and the classical expert problem. This yields new optimal multi-calibeating rates for mixable losses, including Brier and log losses, and general bounded losses. Finally, we obtain new bounds for achieving calibeating and calibration simultaneously for the Brier loss. For binary predictions, our result gives the first calibrated algorithm that at the same time also achieves the optimal $O(\log T)$ calibeating rate.
[957] Revisiting Quantum Code Generation: Where Should Domain Knowledge Live?
Oscar Novo, Oscar Bastidas-Jossa, Alberto Calvo, Antonio Peris, Carlos Kuchkovsky
Main category: cs.LG
TL;DR: General-purpose LLMs with execution-feedback agents outperform specialized fine-tuned models for quantum software code generation, achieving up to 85% pass@1 on Qiskit-HumanEval benchmark.
Details
Motivation: To determine the best approach for incorporating domain knowledge into LLM-based assistants for quantum software development while maintaining flexibility as libraries evolve, comparing specialized fine-tuning versus inference-time augmentation methods.Method: Compared parameter-specialized fine-tuned baseline against general-purpose LLMs enhanced with retrieval-augmented generation (RAG) and agent-based inference with execution feedback on Qiskit-HumanEval benchmark for quantum code generation.
Result: General-purpose LLMs achieved 60-65% pass@1 in zero-shot/RAG settings, and up to 85% with execution-feedback agents, outperforming fine-tuned baseline (47%). Agentic execution feedback provided most consistent improvements, while RAG offered modest gains.
Conclusion: Performance gains can be achieved without domain-specific fine-tuning using inference-time augmentation (especially execution-feedback agents), enabling more flexible and maintainable LLM-assisted quantum software development.
Abstract: Recent advances in large language models (LLMs) have enabled the automation of an increasing number of programming tasks, including code generation for scientific and engineering domains. In rapidly evolving software ecosystems such as quantum software development, where frameworks expose complex abstractions, a central question is how best to incorporate domain knowledge into LLM-based assistants while preserving maintainability as libraries evolve. In this work, we study specialization strategies for Qiskit code generation using the Qiskit-HumanEval benchmark. We compare a parameter-specialized fine-tuned baseline introduced in prior work against a range of recent general-purpose LLMs enhanced with retrieval-augmented generation (RAG) and agent-based inference with execution feedback. Our results show that modern general-purpose LLMs consistently outperform the parameter-specialized baseline. While the fine-tuned model achieves approximately 47% pass@1 on Qiskit-HumanEval, recent general-purpose models reach 60-65% under zero-shot and retrieval-augmented settings, and up to 85% for the strongest evaluated model when combined with iterative execution-feedback agents -representing an improvement of more than 20% over zero-shot general-purpose performance and more than 35% over the parameter-specialized baseline. Agentic execution feedback yields the most consistent improvements, albeit at increased runtime cost, while RAG provides modest and model-dependent gains. These findings indicate that performance gains can be achieved without domain-specific fine-tuning, instead relying on inference-time augmentation, thereby enabling a more flexible and maintainable approach to LLM-assisted quantum software development.
[958] Chimera: Latency- and Performance-Aware Multi-agent Serving for Heterogeneous LLMs
Kangqi Ni, Wenyue Hua, Xiaoxiang Shi, Jiang Guo, Shiyu Chang, Tianlong Chen
Main category: cs.LG
TL;DR: Chimera is a predictive scheduling system for multi-agent LLM workflows on heterogeneous clusters that optimizes both latency and task performance through semantic routing and congestion-aware load balancing.
Details
Motivation: Existing LLM serving systems assume homogeneous clusters with identical model replicas, missing opportunities for heterogeneous deployments that enable better latency-performance trade-offs. However, heterogeneity introduces scheduling challenges across models with diverse throughput and performance characteristics.Method: Chimera uses semantic routing to estimate per-model confidence scores for each request, predicts the total remaining output length of workflows, and estimates per-model congestion using in-flight predicted token volumes for intelligent load balancing across heterogeneous LLMs.
Result: Chimera traces the best latency-performance frontier, reducing end-to-end latency by 1.2-2.4× and improving task performance by 8.0-9.5 percentage points on average over competitive baselines including vLLM, evaluated on code generation and math reasoning workflows.
Conclusion: Chimera demonstrates that predictive scheduling for heterogeneous LLM clusters can significantly improve both latency and task performance for multi-agent workflows, enabling more efficient use of diverse model capabilities.
Abstract: Multi-agent applications often execute complex tasks as multi-stage workflows, where each stage is an LLM call whose output becomes part of context for subsequent steps. Existing LLM serving systems largely assume homogeneous clusters with identical model replicas. This design overlooks the potential of heterogeneous deployments, where models of different sizes and capabilities enable finer trade-offs between latency and performance. However, heterogeneity introduces new challenges in scheduling across models with diverse throughput and performance. We present Chimera, a predictive scheduling system for multi-agent workflow serving on heterogeneous LLM clusters that jointly improves end-to-end latency and task performance. Chimera applies semantic routing to estimate per-model confidence scores for each request, predicts the total remaining output length of the workflow, and estimates per-model congestion using in-flight predicted token volumes for load balancing. We evaluate Chimera on representative agentic workflows for code generation and math reasoning using multiple heterogeneous LLM configurations. Across comparable settings, Chimera traces the best latency-performance frontier, reducing end-to-end latency by 1.2–2.4$\times$ and improving task performance by 8.0-9.5 percentage points on average over competitive baselines including vLLM.
[959] Noise Titration: Exact Distributional Benchmarking for Probabilistic Time Series Forecasting
Qilin Wang
Main category: cs.LG
TL;DR: A paradigm shift from passive observation to interventionist benchmarking for time series forecasting, using calibrated noise injection into known dynamical systems to enable exact statistical evaluation, revealing foundation models fail under non-stationarity while the proposed Fern architecture maintains structural fidelity.
Details
Motivation: Current time series forecasting evaluation relies on passive observation of single historical trajectories, making claims about model robustness to non-stationarity fundamentally unfalsifiable. There's a need for more rigorous, interventionist benchmarking that enables exact statistical evaluation.Method: Systematically inject calibrated Gaussian observation noise into known chaotic and stochastic dynamical systems, transforming forecasting into an exact distributional inference task. Extend the Fern architecture into a probabilistic generative model that parameterizes the Symmetric Positive Definite (SPD) cone to output calibrated joint covariance structures without computational bottlenecks.
Result: State-of-the-art zero-shot foundation models behave consistently with context-parroting mechanisms, failing systematically under non-stationary regime shifts and elevated noise. In contrast, Fern explicitly captures the invariant measure and multivariate geometry of underlying dynamics, maintaining structural fidelity and statistically sharp calibration where massive sequence-matching models collapse.
Conclusion: The interventionist benchmarking paradigm enables rigorous evaluation of forecasting models, revealing fundamental limitations of current foundation models in handling non-stationarity and noise, while demonstrating the superiority of architectures like Fern that explicitly model dynamical system geometry.
Abstract: Modern time series forecasting is evaluated almost entirely through passive observation of single historical trajectories, rendering claims about a model’s robustness to non-stationarity fundamentally unfalsifiable. We propose a paradigm shift toward interventionist, exact-statistical benchmarking. By systematically titrating calibrated Gaussian observation noise into known chaotic and stochastic dynamical systems, we transform forecasting from a black-box sequence matching game into an exact distributional inference task. Because the underlying data-generating process and noise variance are mathematically explicit, evaluation can rely on exact negative log-likelihoods and calibrated distributional tests rather than heuristic approximations. To fully leverage this framework, we extend the Fern architecture into a probabilistic generative model that natively parameterizes the Symmetric Positive Definite (SPD) cone, outputting calibrated joint covariance structures without the computational bottleneck of generic Jacobian modeling. Under this rigorous evaluation, we find that state-of-the-art zero-shot foundation models behave consistently with the context-parroting mechanism, failing systematically under non-stationary regime shifts and elevated noise. In contrast, Fern explicitly captures the invariant measure and multivariate geometry of the underlying dynamics, maintaining structural fidelity and statistically sharp calibration precisely where massive sequence-matching models collapse.
[960] Confidence-Based Decoding is Provably Efficient for Diffusion Language Models
Changxiao Cai, Gen Li
Main category: cs.LG
TL;DR: Theoretical analysis of confidence-based decoding strategies for diffusion language models, showing entropy-based unmasking achieves efficient sampling with complexity scaling with data distribution entropy.
Details
Motivation: Diffusion language models offer flexible generation order and parallel token generation, but decoding strategy (order and number of tokens generated per iteration) critically affects sampling efficiency. While confidence-based methods show strong empirical performance, theoretical understanding remains limited.Method: Develops first theoretical analysis framework for confidence-based decoding in DLMs. Focuses on entropy sum-based strategy that continues unmasking tokens within each iteration until cumulative entropy exceeds a threshold.
Result: Shows entropy-based strategy achieves ε-accurate sampling in KL divergence with expected number of iterations scaling as Õ(H(X₀)/ε), where H(X₀) is target data distribution entropy. Provides substantial acceleration for low-entropy distributions relative to sequence length.
Conclusion: Provides theoretical foundation for confidence-based decoding in diffusion language models, showing entropy-based strategies can adapt to data complexity without prior knowledge or hyperparameter tuning, potentially informing more efficient decoding strategy design.
Abstract: Diffusion language models (DLMs) have emerged as a promising alternative to autoregressive (AR) models for language modeling, allowing flexible generation order and parallel generation of multiple tokens. However, this flexibility introduces a challenge absent in AR models: the \emph{decoding strategy} – which determines the order and number of tokens generated at each iteration – critically affects sampling efficiency. Among decoding strategies explored in practice, confidence-based methods, which adaptively select which and how many tokens to unmask based on prediction confidence, have shown strong empirical performance. Despite this success, our theoretical understanding of confidence-based decoding remains limited. In this work, we develop the first theoretical analysis framework for confidence-based decoding in DLMs. We focus on an entropy sum-based strategy that continues unmasking tokens within each iteration until the cumulative entropy exceeds a threshold, and show that it achieves $\varepsilon$-accurate sampling in KL divergence with an expected number of iterations $\widetilde O(H(X_0)/\varepsilon)$, where $H(X_0)$ denotes the entropy of the target data distribution. Notably, this strategy yields substantial sampling acceleration when the data distribution has low entropy relative to the sequence length, while automatically adapting to the intrinsic complexity of data without requiring prior knowledge or hyperparameter tuning. Overall, our results provide a theoretical foundation for confidence-based decoding and may inform the design of more efficient decoding strategies for DLMs.
[961] Decoupling Exploration and Policy Optimization: Uncertainty Guided Tree Search for Hard Exploration
Zakaria Mhammedi, James Cohan
Main category: cs.LG
TL;DR: A new exploration paradigm separates exploration from exploitation, using tree-search with epistemic uncertainty instead of RL for exploration, achieving efficient state coverage and SOTA results on hard exploration tasks.
Details
Motivation: Current RL approaches use intrinsic motivation with policy optimization for exploration, which incurs unnecessary overhead. The authors propose that while policy optimization is needed for task execution, it's inefficient for pure exploration.Method: Separates exploration from exploitation, using tree-search inspired by Go-With-The-Winner algorithm with epistemic uncertainty measures to drive exploration without RL. Discovered trajectories are distilled into policies using supervised backward learning.
Result: Explores an order of magnitude more efficiently than intrinsic motivation baselines on hard Atari benchmarks. Achieves SOTA scores on Montezuma’s Revenge, Pitfall!, and Venture. Solves MuJoCo Adroit dexterous manipulation and AntMaze tasks from image observations without expert demonstrations.
Conclusion: The proposed paradigm of separating exploration from exploitation and bypassing RL during exploration is more efficient and achieves superior performance on hard exploration tasks, demonstrating generality across discrete and continuous domains.
Abstract: The process of discovery requires active exploration – the act of collecting new and informative data. However, efficient autonomous exploration remains a major unsolved problem. The dominant paradigm addresses this challenge by using Reinforcement Learning (RL) to train agents with intrinsic motivation, maximizing a composite objective of extrinsic and intrinsic rewards. We suggest that this approach incurs unnecessary overhead: while policy optimization is necessary for precise task execution, employing such machinery solely to expand state coverage may be inefficient. In this paper, we propose a new paradigm that explicitly separates exploration from exploitation and bypasses RL during the exploration phase. Our method uses a tree-search strategy inspired by the Go-With-The-Winner algorithm, paired with a measure of epistemic uncertainty to systematically drive exploration. By removing the overhead of policy optimization, our approach explores an order of magnitude more efficiently than standard intrinsic motivation baselines on hard Atari benchmarks. Further, we demonstrate that the discovered trajectories can be distilled into deployable policies using existing supervised backward learning algorithms, achieving state-of-the-art scores by a wide margin on Montezuma’s Revenge, Pitfall!, and Venture without relying on domain-specific knowledge. Finally, we demonstrate the generality of our framework in high-dimensional continuous action spaces by solving the MuJoCo Adroit dexterous manipulation and AntMaze tasks in a sparse-reward setting, directly from image observations and without expert demonstrations or offline datasets. To the best of our knowledge, this has not been achieved before.
[962] Scaling DoRA: High-Rank Adaptation via Factored Norms and Fused Kernels
Alexandra Zelenin, Alexandra Zhuravlyova
Main category: cs.LG
TL;DR: Efficient implementation of DoRA (Weight-Decomposed Low-Rank Adaptation) using factored norm computation and fused Triton kernels to reduce memory usage and improve speed for vision-language models.
Details
Motivation: DoRA extends LoRA by decoupling weight magnitude from direction, but its forward pass requires materializing dense matrices that consume excessive memory (512 MB per module at high dimensions), making it infeasible on single-GPU setups with many adapted modules.Method: Two systems contributions: 1) Factored norm decomposes squared norm into base, cross, and Gram terms computable through O(d_out r + r^2) intermediates, eliminating dense product. 2) Fused Triton kernels collapse four-kernel DoRA composition into single pass, reducing memory traffic by 4x with numerically stable form.
Result: Across six 8-32B vision-language models on three NVIDIA GPUs, fused implementation is 1.5-2.0x faster for inference and 1.5-1.9x faster for gradient computation than Hugging Face PEFT’s DoRA, with up to 7 GB lower peak VRAM. Cosine similarity exceeds 0.9999 and training curves match within 7.1×10^-4 loss delta.
Conclusion: The proposed efficient DoRA implementation enables practical high-rank adaptation for vision-language models on single-GPU setups by significantly reducing memory requirements and improving computational speed while maintaining numerical accuracy.
Abstract: Weight-Decomposed Low-Rank Adaptation (DoRA) extends LoRA by decoupling weight magnitude from direction, but its forward pass requires the row-wise norm of W + sBA, a computation that every major framework we surveyed implements by materializing the dense [d_out, d_in] product BA. At d_in = 8192 and rank r = 384, a single module’s norm requires about 512 MB of transient working memory in bf16, making high-rank DoRA costly and often infeasible on common single-GPU setups once hundreds of adapted modules and checkpointing are involved. We present two systems contributions. A factored norm decomposes the squared norm into base, cross, and Gram terms computable through O(d_out r + r^2) intermediates, eliminating the dense product. Fused Triton kernels collapse the four-kernel DoRA composition into a single pass, reducing memory traffic by about 4x and using a numerically stable form that avoids catastrophic cancellation in the near-unity rescaling regime where magnitude scales concentrate in practice. Across six 8-32B vision-language models (VLMs) on three NVIDIA GPUs (RTX 6000 PRO, H200, B200) at r = 384 in bf16, the fused implementation is 1.5-2.0x faster than Hugging Face PEFT’s DoRA implementation for inference and 1.5-1.9x faster for gradient computation (optimizer step excluded), with up to 7 GB lower peak VRAM. Microbenchmarks on six GPUs spanning four architecture generations (L40S, A100, RTX 6000 PRO, H200, B200, B300) confirm 1.5-2.7x compose-kernel speedup. Final-logit cosine similarity exceeds 0.9999 across all model/GPU pairs, and multi-seed training curves match within 7.1 x 10^-4 mean per-step loss delta over 2000 steps.
[963] Your Absorbing Discrete Diffusion Secretly Models the Conditional Distributions of Clean Data
Jingyang Ou, Shen Nie, Kaiwen Xue, Fengqi Zhu, Jiacheng Sun, Zhenguo Li, Chongxuan Li
Main category: cs.LG
TL;DR: RADD is a reparameterized absorbing discrete diffusion model that simplifies concrete score estimation by expressing it as time-independent conditional probabilities, enabling sampling acceleration through caching and achieving SOTA perplexity on language modeling benchmarks.
Details
Motivation: The paper aims to improve discrete diffusion models for language modeling by simplifying the concrete score estimation in absorbing diffusion processes, which typically requires time-conditioned networks and suffers from computational inefficiency during sampling.Method: RADD reparameterizes absorbing discrete diffusion to express concrete scores as time-independent conditional probabilities multiplied by analytic time-dependent scalars, eliminating the need for time-conditioned networks and enabling caching for accelerated sampling.
Result: RADD achieves state-of-the-art performance among diffusion models on 5 zero-shot language modeling benchmarks at GPT-2 scale, with improved sampling efficiency through caching mechanisms.
Conclusion: The paper demonstrates that absorbing discrete diffusion can be effectively reparameterized to simplify training and accelerate sampling while maintaining strong performance, and establishes connections between diffusion models and autoregressive models.
Abstract: Discrete diffusion models with absorbing processes have shown promise in language modeling. The key quantities to be estimated are the ratios between the marginal probabilities of two transitive states at all timesteps, called the concrete score. In this paper, we reveal that the concrete score in absorbing diffusion can be expressed as conditional probabilities of clean data, multiplied by a time-dependent scalar in an analytic form. Motivated by this finding, we propose reparameterized absorbing discrete diffusion (RADD), a dedicated diffusion model without time-condition that characterizes the time-independent conditional probabilities. Besides its simplicity, RADD can reduce the number of function evaluations (NFEs) by caching the output of the time-independent network when the noisy sample remains unchanged in a sampling interval, which enables sampling acceleration. Built upon the new perspective of conditional distributions, we further unify absorbing discrete diffusion and any-order autoregressive models (AO-ARMs), showing that the upper bound on the negative log-likelihood for the diffusion model can be interpreted as an expected negative log-likelihood for AO-ARMs. Further, our RADD models achieve SOTA performance among diffusion models on 5 zero-shot language modeling benchmarks (measured by perplexity) at the GPT-2 scale. Our code is available at https://github.com/ML-GSAI/RADD.
[964] RLHF in an SFT Way: From Optimal Solution to Reward-Weighted Alignment
Yuhao Du, Zhuo Li, Pengyu Cheng, Zhihong Chen, Yuejiao Xie, Xiang Wan, Anningzhe Gao
Main category: cs.LG
TL;DR: VAR: A novel RLHF simplification using variational inference to transform alignment into offline reward-driven re-weighted SFT, improving stability and efficiency over DPO and PPO.
Details
Motivation: RLHF faces challenges with complexity, computational cost, and training instability. Existing methods like PPO are computationally expensive, while DPO suffers from over-fitting and instability. Need for simpler, more stable alignment methods.Method: Variational Alignment with Re-weighting (VAR) minimizes distribution gap between LLM policy and optimal RLHF solution. Transforms alignment into offline reward-driven re-weighted SFT with minor adjustments to SFT loss.
Result: Outperforms DPO by avg. 7.16% in helpfulness/harmlessness metrics. Comparable/better than online methods while reducing computational overhead and accelerating convergence (5× faster than GRPO).
Conclusion: VAR provides efficient and effective solution bridging efficiency-performance gap in LLM alignment, offering stable training with reduced computational cost.
Abstract: Reinforcement Learning from Human Feedback (RLHF) is crucial for aligning Large Language Models (LLMs) with human values. However, RLHF has been continuously challenged by its high complexity in implementation and computation consumption, specifically for online sampling-based methods like Proximal Policy Optimization (PPO) and Group Relative Policy Optimization (GRPO). Even with recent simplifications, such as Direct Preference Optimization (DPO) that designs an offline implicit reward learning objective relying on pre-collected preference datasets, the problems of over-fitting and training instability remain hindering the alignment process from the expected optimal performance. To address the existing challenges, we propose a novel simplification of RLHF from the perspective of variational inference, called Variational Alignment with Re-weighting (VAR). Specifically, by directly minimizing the distribution gap between the learning LLM policy and the optimal solution of RLHF, we transform the alignment objective into an offline reward-driven re-weighted supervised fine-tuning (SFT) form, which only requires minor adjustment on the SFT loss to obtain noticeable improvement on training stability and effectiveness. In comprehensive evaluation benchmarks, our objective empowers LLMs to outperform offline alignments, demonstrating superior performance in both helpfulness and harmlessness metrics (avg. $\uparrow7.16%$ than DPO). Meanwhile, when compared to online sampling methods, our method is also comparable even better while significantly reducing computational overhead and accelerating convergence speed (over $5\times$ faster than GRPO), suggesting our approach as an efficient and effective solution in bridging the gap between efficiency and performance in LLM alignment.
[965] An evolutionary perspective on modes of learning in Transformers
Alexander Y. Ku, Thomas L. Griffiths, Stephanie C. Y. Chan
Main category: cs.LG
TL;DR: Transformers use two learning strategies: in-weight learning (IWL) for permanent parameter updates and in-context learning (ICL) for temporary context-based adaptation. The paper analyzes how environmental stability and cue reliability determine which strategy is preferred, drawing parallels from evolutionary biology.
Details
Motivation: To understand why Transformers use both in-weight learning (permanent parameter updates) and in-context learning (temporary context-based adaptation), and to determine the environmental conditions that favor each strategy, drawing inspiration from evolutionary biology principles.Method: The study operationalizes environmental stability and cue reliability in controlled task settings (sinusoid regression and Omniglot classification) to characterize their influence on learning in Transformers. It analyzes learning dynamics and transitions between ICL and IWL strategies.
Result: Stable environments favor in-weight learning (IWL), often with sharp transitions when conditions are static. Reliable cues favor in-context learning (ICL), especially in volatile environments. Task-dependent transitions between strategies are governed by asymptotic optimality and optimization costs.
Conclusion: The choice between in-weight learning and in-context learning in Transformers depends on environmental stability and cue reliability, similar to evolutionary adaptations. Understanding these dynamics helps explain when and why Transformers prefer different learning strategies.
Abstract: The success of Transformers lies in their ability to improve inference through two complementary strategies: the permanent refinement of model parameters via in-weight learning (IWL), and the ephemeral modulation of inferences via in-context learning (ICL), which leverages contextual information maintained in the model’s activations. Evolutionary biology tells us that the predictability of the environment across timescales predicts the extent to which analogous strategies should be preferred. Genetic evolution adapts to stable environmental features by gradually modifying the genotype over generations. Conversely, environmental volatility favors plasticity, which enables a single genotype to express different traits within a lifetime, provided there are reliable cues to guide the adaptation. We operationalize these dimensions (environmental stability and cue reliability) in controlled task settings (sinusoid regression and Omniglot classification) to characterize their influence on learning in Transformers. We find that stable environments favor IWL, often exhibiting a sharp transition when conditions are static. Conversely, reliable cues favor ICL, particularly when the environment is volatile. Furthermore, an analysis of learning dynamics reveals task-dependent transitions between strategies (ICL to IWL and vice versa). We demonstrate that these transitions are governed by (1) the asymptotic optimality of the strategy with respect to the environment, and (2) the optimization cost of acquiring that strategy, which depends on the task structure and the learner’s inductive bias.
[966] Learning to Reason without External Rewards
Xuandong Zhao, Zhewei Kang, Aosong Feng, Sergey Levine, Dawn Song
Main category: cs.LG
TL;DR: Intuitor: An RLIF method that uses LLM’s self-certainty as intrinsic reward signal for unsupervised learning, matching supervised RLVR performance on math tasks while achieving better generalization to code generation.
Details
Motivation: Current RLVR methods require costly domain-specific supervision and verifiable rewards, limiting scalability. Need for autonomous learning without external rewards or labeled data.Method: Proposes Reinforcement Learning from Internal Feedback (RLIF) framework. Intuitor method replaces external rewards in Group Relative Policy Optimization (GRPO) with model’s self-certainty scores as intrinsic reward signal.
Result: Matches GRPO performance on mathematical benchmarks, achieves better generalization to out-of-domain tasks like code generation without requiring gold solutions or test cases.
Conclusion: Intrinsic model signals (self-certainty) can drive effective learning across domains, offering scalable alternative to RLVR for autonomous AI systems where verifiable rewards are unavailable.
Abstract: Training large language models (LLMs) for complex reasoning via Reinforcement Learning with Verifiable Rewards (RLVR) is effective but limited by reliance on costly, domain-specific supervision. We explore Reinforcement Learning from Internal Feedback (RLIF), a framework that enables LLMs to learn from intrinsic signals without external rewards or labeled data. We propose Intuitor, an RLIF method that uses a model’s own confidence-termed self-certainty-as its sole reward signal. Intuitor replaces external rewards in Group Relative Policy Optimization (GRPO) with self-certainty scores, enabling fully unsupervised learning. Experiments demonstrate that Intuitor matches GRPO’s performance on mathematical benchmarks while achieving better generalization to out-of-domain tasks like code generation, without requiring gold solutions or test cases. Our findings show that intrinsic model signals can drive effective learning across domains, offering a scalable alternative to RLVR for autonomous AI systems where verifiable rewards are unavailable. Code is available at https://github.com/sunblaze-ucb/Intuitor
[967] Masked Diffusion Models as Energy Minimization
Sitong Chen, Shen Nie, Jiacheng Sun, Zijin Feng, Zhenguo Li, Ji-Rong Wen, Chongxuan Li
Main category: cs.LG
TL;DR: MDMs are shown to minimize three equivalent energy formulations in optimal transport, enabling efficient schedule design via Beta distributions for improved sampling.
Details
Motivation: To provide a unified theoretical foundation for masked diffusion models by connecting them to energy minimization in discrete optimal transport, and to enable practical schedule optimization.Method: Prove mathematical equivalence of kinetic, conditional kinetic, and geodesic energy formulations under MDMs. Parameterize interpolation schedules using Beta distributions to reduce design space to 2D search for efficient post-training tuning.
Result: Energy-inspired schedules outperform hand-crafted baselines, especially in low-step sampling settings, on synthetic and real-world benchmarks.
Conclusion: MDMs unify multiple energy perspectives in optimal transport, and the Beta parameterization enables practical schedule optimization without model modification.
Abstract: We present a systematic theoretical framework that interprets masked diffusion models (MDMs) as solutions to energy minimization problems in discrete optimal transport. Specifically, we prove that three distinct energy formulations–kinetic, conditional kinetic, and geodesic energy–are mathematically equivalent under the structure of MDMs, and that MDMs minimize all three when the mask schedule satisfies a closed-form optimality condition. This unification not only clarifies the theoretical foundations of MDMs, but also motivates practical improvements in sampling. By parameterizing interpolation schedules via Beta distributions, we reduce the schedule design space to a tractable 2D search, enabling efficient post-training tuning without model modification. Experiments on synthetic and real-world benchmarks demonstrate that our energy-inspired schedules outperform hand-crafted baselines, particularly in low-step sampling settings.
[968] Learning to Interpret Weight Differences in Language Models
Avichal Goel, Yoon Kim, Nir Shavit, Tony T. Wang
Main category: cs.LG
TL;DR: DIT trains models to describe their own finetuning-induced weight changes using natural language, enabling interpretability of model modifications.
Details
Motivation: Finetuning language models updates their knowledge but weight changes are not interpretable, and finetuning datasets are often unavailable or too large to analyze directly.Method: Diff Interpretation Tuning (DIT) uses synthetic, labeled weight diffs to train a DIT-adapter that can be applied to finetuned models to make them describe their changes in natural language.
Result: In proof-of-concept settings (reporting hidden behaviors and summarizing finetuned knowledge), DIT enables models to accurately describe their finetuning-induced modifications using natural language.
Conclusion: DIT provides a method for making weight diffs interpretable through natural language descriptions, addressing the black-box nature of model finetuning.
Abstract: Finetuning (pretrained) language models is a standard approach for updating their internal parametric knowledge and specializing them to new tasks and domains. However, the corresponding model weight changes (“weight diffs”) are not generally interpretable. While inspecting the finetuning dataset can give a sense of how the model might have changed, these datasets are often not publicly available or are too large to work with directly. Towards the goal of comprehensively understanding weight diffs in natural language, we introduce Diff Interpretation Tuning (DIT), a method that trains models to describe their own finetuning-induced modifications. Our approach uses synthetic, labeled weight diffs to train a DIT-adapter, which can be applied to a compatible finetuned model to make it describe how it has changed. We demonstrate in two proof-of-concept settings (reporting hidden behaviors and summarizing finetuned knowledge) that our method enables models to describe their finetuning-induced modifications using accurate natural language descriptions.
[969] Relative Error Embeddings for the Gaussian Kernel Distance
Di Chen, Jeff M. Phillips
Main category: cs.LG
TL;DR: Unable to analyze paper 1602.05350 due to HTTP 429 error when fetching summary from arXiv API
Details
Motivation: Cannot determine motivation as paper content was not accessible due to API rate limitingMethod: Cannot determine method as paper content was not accessible
Result: Cannot determine results as paper content was not accessible
Conclusion: Cannot draw conclusions about the paper due to technical limitations in accessing the content
Abstract: Failed to fetch summary for 1602.05350: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=1602.05350&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[970] Geometric Imbalance in Semi-Supervised Node Classification
Liang Yan, Shengzhong Zhang, Bisheng Li, Menglin Yang, Chen Yang, Min Zhou, Weiyang Ding, Yutong Xie, Zengfeng Huang
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to technical error in fetching paper contentMethod: Unable to determine method due to technical error in fetching paper content
Result: Unable to determine results due to technical error in fetching paper content
Conclusion: Unable to draw conclusions due to technical error in fetching paper content
Abstract: Failed to fetch summary for 2303.10371: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2303.10371&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[971] Inhibitor Transformers and Gated RNNs for Torus Efficient Fully Homomorphic Encryption
Rickard Brännvall, Tony Zhang, Henrik Forsgren, Andrei Stoian, Fredrik Sandin, Marcus Liwicki
Main category: cs.LG
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot draw conclusions as paper content is unavailable
Abstract: Failed to fetch summary for 2308.05629: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2308.05629&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[972] Asymptotically and Minimax Optimal Regret Bounds for Multi-Armed Bandits with Abstention
Junwen Yang, Tianyuan Jin, Vincent Y. F. Tan
Main category: cs.LG
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting). The paper ID 2402.15127 appears to be from February 2024.
Details
Motivation: Cannot determine motivation without access to the paper content.Method: Cannot determine method without access to the paper content.
Result: Cannot determine results without access to the paper content.
Conclusion: Cannot determine conclusion without access to the paper content.
Abstract: Failed to fetch summary for 2402.15127: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2402.15127&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[973] List Sample Compression and Uniform Convergence
Steve Hanneke, Shay Moran, Tom Waknine
Main category: cs.LG
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting). The paper ID 2403.10889 suggests it’s from March 2024, but no abstract or content available for analysis.
Details
Motivation: Cannot determine motivation without access to paper content.Method: Cannot determine method without access to paper content.
Result: Cannot determine results without access to paper content.
Conclusion: Cannot draw conclusions without access to paper content.
Abstract: Failed to fetch summary for 2403.10889: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2403.10889&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[974] Revisit, Extend, and Enhance Hessian-Free Influence Functions
Ziao Yang, Han Yue, Jian Chen, Hongfu Liu
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to failed paper fetchMethod: Unable to determine method due to failed paper fetch
Result: Unable to determine results due to failed paper fetch
Conclusion: Unable to draw conclusions due to failed paper fetch
Abstract: Failed to fetch summary for 2405.17490: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2405.17490&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[975] RoboMorph: Evolving Robot Morphology using Large Language Models
Kevin Qiu, Władysław Pałucki, Krzysztof Ciebiera, Paweł Fijałkowski, Marek Cygan, Łukasz Kuciński
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to missing paper contentMethod: Unable to determine method due to missing paper content
Result: Unable to determine results due to missing paper content
Conclusion: Unable to draw conclusions due to missing paper content
Abstract: Failed to fetch summary for 2407.08626: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2407.08626&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[976] Finite Neural Networks as Mixtures of Gaussian Processes: From Provable Error Bounds to Prior Selection
Steven Adams, Andrea Patanè, Morteza Lahijanian, Luca Laurenti
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to data fetch failureMethod: Unable to determine method due to data fetch failure
Result: Unable to determine results due to data fetch failure
Conclusion: Unable to determine conclusion due to data fetch failure
Abstract: Failed to fetch summary for 2407.18707: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2407.18707&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[977] Learning Hidden Physics and System Parameters with Deep Operator Networks
Dibakar Roy Sarkar, Vijay Kag, Birupaksha Pal, Somdatta Goswami
Main category: cs.LG
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot determine conclusion as paper content is unavailable
Abstract: Failed to fetch summary for 2412.05133: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2412.05133&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[978] Effectiveness of Distributed Gradient Descent with Local Steps for Overparameterized Models
Heng Zhu, Harsh Vardhan, Arya Mazumdar
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to technical error in fetching paper informationMethod: Unable to determine method due to technical error in fetching paper information
Result: Unable to determine results due to technical error in fetching paper information
Conclusion: Unable to draw conclusions due to technical error in fetching paper information
Abstract: Failed to fetch summary for 2412.07971: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2412.07971&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[979] The Cost of Replicability in Active Learning
Rupkatha Hira, Dominik Kau, Jessica Sorrell
Main category: cs.LG
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot draw conclusion as paper content is unavailable
Abstract: Failed to fetch summary for 2412.09686: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2412.09686&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[980] C-HDNet: Hyperdimensional Computing for Causal Effect Estimation from Observational Data Under Network Interference
Abhishek Dalvi, Neil Ashtekar, Vasant Honavar
Main category: cs.LG
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation without access to paper contentMethod: Cannot determine method without access to paper content
Result: Cannot determine results without access to paper content
Conclusion: Cannot draw conclusions without access to paper content
Abstract: Failed to fetch summary for 2501.16562: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2501.16562&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[981] Herglotz-NET: Implicit Neural Representation of Spherical Data with Harmonic Positional Encoding
Théo Hanon, Nicolas Mil-Homens Cavaco, John Kiely, Laurent Jacques
Main category: cs.LG
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation due to access limitationsMethod: Cannot determine method due to access limitations
Result: Cannot determine results due to access limitations
Conclusion: Cannot determine conclusion due to access limitations
Abstract: Failed to fetch summary for 2502.13777: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2502.13777&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[982] TimeXL: Explainable Multi-modal Time Series Prediction with LLM-in-the-Loop
Yushan Jiang, Wenchao Yu, Geon Lee, Dongjin Song, Kijung Shin, Wei Cheng, Yanchi Liu, Haifeng Chen
Main category: cs.LG
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot draw conclusions as paper content is unavailable
Abstract: Failed to fetch summary for 2503.01013: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2503.01013&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[983] SSR: Speculative Parallel Scaling Reasoning in Test-time
Yuanlin Chu, Bo Wang, Xiang Liu, Hong Chen, Aiwei Liu, Xuming Hu
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to access errorMethod: Unable to determine method due to access error
Result: Unable to determine results due to access error
Conclusion: Unable to determine conclusion due to access error
Abstract: Failed to fetch summary for 2505.15340: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2505.15340&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[984] Exemplar-Free Continual Learning for State Space Models
Isaac Ning Lee, Leila Mahmoodi, Trung Le, Mehrtash Harandi
Main category: cs.LG
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot draw conclusions as paper content is unavailable
Abstract: Failed to fetch summary for 2505.18604: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2505.18604&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[985] Beyond Static Models: Hypernetworks for Adaptive and Generalizable Forecasting in Complex Parametric Dynamical Systems
Pantelis R. Vlachas, Konstantinos Vlachas, Eleni Chatzi
Main category: cs.LG
TL;DR: Unable to analyze paper 2506.19609 due to HTTP 429 error when fetching abstract from arXiv API
Details
Motivation: Cannot determine motivation as abstract is unavailable due to rate limiting errorMethod: Cannot determine method as abstract is unavailable due to rate limiting error
Result: Cannot determine results as abstract is unavailable due to rate limiting error
Conclusion: Cannot draw conclusions about paper content due to inability to access abstract
Abstract: Failed to fetch summary for 2506.19609: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2506.19609&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[986] A Probabilistic Approach to Wildfire Spread Prediction Using a Denoising Diffusion Surrogate Model
Wenbo Yu, Anirbit Ghosh, Tobias Sebastian Finn, Rossella Arcucci, Marc Bocquet, Sibo Cheng
Main category: cs.LG
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation without access to paper contentMethod: Cannot determine method without access to paper content
Result: Cannot determine results without access to paper content
Conclusion: Cannot determine conclusion without access to paper content
Abstract: Failed to fetch summary for 2507.00761: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2507.00761&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[987] TIC-GRPO: Provable and Efficient Optimization for Reinforcement Learning from Human Feedback
Lei Pang, Jun Luo, Ruinan Jin
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to access restrictionsMethod: Unable to determine method due to access restrictions
Result: Unable to determine results due to access restrictions
Conclusion: Unable to determine conclusion due to access restrictions
Abstract: Failed to fetch summary for 2508.02833: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2508.02833&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[988] Agnostics: Learning to Code in Any Programming Language via Reinforcement with a Universal Learning Environment
Aleksander Boruch-Gruszecki, Yangtian Zi, Zixuan Wu, Tejas Oberoi, Carolyn Jane Anderson, Joydeep Biswas, Arjun Guha
Main category: cs.LG
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting). The paper ID 2508.04865 cannot be analyzed without access to its abstract or content.
Details
Motivation: Cannot determine motivation without access to paper content.Method: Cannot determine method without access to paper content.
Result: Cannot determine results without access to paper content.
Conclusion: Cannot draw conclusions without access to paper content.
Abstract: Failed to fetch summary for 2508.04865: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2508.04865&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[989] Learning from Similarity-Confidence and Confidence-Difference
Tomoya Tate, Kosuke Sugiyama, Masato Uchida
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to draw conclusions due to fetch failure
Abstract: Failed to fetch summary for 2508.05108: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2508.05108&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[990] Deep Neural Networks with General Activations: Super-Convergence in Sobolev Norms
Yahong Yang, Juncai He
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2508.05141: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2508.05141&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[991] Parameter-free Optimal Rates for Nonlinear Semi-Norm Contractions with Applications to $Q$-Learning
Ankur Naskar, Gugan Thoppe, Vijay Gupta
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to API rate limiting preventing access to paper contentMethod: Cannot analyze method as paper content is unavailable due to HTTP 429 error
Result: No results available due to API rate limiting preventing access to paper content
Conclusion: Cannot provide conclusion as paper content is inaccessible due to HTTP 429 error
Abstract: Failed to fetch summary for 2508.05984: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2508.05984&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[992] From Nodes to Narratives: Explaining Graph Neural Networks with LLMs and Graph Context
Peyman Baghershahi, Gregoire Fournier, Pranav Nyati, Sourav Medya
Main category: cs.LG
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot determine conclusion as paper content is unavailable
Abstract: Failed to fetch summary for 2508.07117: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2508.07117&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[993] Tight Bounds for Schrödinger Potential Estimation in Unpaired Data Translation
Nikita Puchkin, Denis Suchkov, Alexey Naumov, Denis Belomestny
Main category: cs.LG
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation without access to paper contentMethod: Cannot determine method without access to paper content
Result: Cannot determine results without access to paper content
Conclusion: Cannot draw conclusions without access to paper content
Abstract: Failed to fetch summary for 2508.07392: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2508.07392&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[994] SHAining on Process Mining: Explaining Event Log Characteristics Impact on Algorithms
Andrea Maldonado, Christian M. M. Frey, Sai Anirudh Aryasomayajula, Ludwig Zellner, Stephan A. Fahrenkrog-Petersen, Thomas Seidl
Main category: cs.LG
TL;DR: Unable to analyze paper 2509.08482 due to HTTP 429 error when fetching from arXiv API
Details
Motivation: Cannot determine motivation without access to paper contentMethod: Cannot determine method without access to paper content
Result: Cannot determine results without access to paper content
Conclusion: Cannot draw conclusions without access to paper content
Abstract: Failed to fetch summary for 2509.08482: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.08482&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[995] Variational Neural Networks for Observable Thermodynamics (V-NOTS)
Christopher Eldred, François Gay-Balmaz, Vakhtang Putkaradze
Main category: cs.LG
TL;DR: Unable to analyze paper 2509.09899 due to HTTP 429 error when fetching from arXiv API
Details
Motivation: Cannot determine motivation as paper content could not be retrievedMethod: Cannot determine method as paper content could not be retrieved
Result: Cannot determine results as paper content could not be retrieved
Conclusion: Cannot draw conclusions as paper content could not be retrieved
Abstract: Failed to fetch summary for 2509.09899: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.09899&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[996] HDC-X: Efficient Medical Data Classification for Embedded Devices
Jianglan Wei, Zhenyu Zhang, Pengcheng Wang, Mingjie Zeng, Zhigang Zeng
Main category: cs.LG
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting). No content available for analysis.
Details
Motivation: Cannot determine motivation as paper content is unavailable due to HTTP 429 error from arXiv API.Method: Cannot determine method as paper content is unavailable due to HTTP 429 error from arXiv API.
Result: Cannot determine results as paper content is unavailable due to HTTP 429 error from arXiv API.
Conclusion: Cannot draw conclusions as paper content is unavailable due to HTTP 429 error from arXiv API.
Abstract: Failed to fetch summary for 2509.14617: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.14617&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[997] Scaling Laws are Redundancy Laws
Yuda Bi, Vince D Calhoun
Main category: cs.LG
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot determine conclusion as paper content is unavailable
Abstract: Failed to fetch summary for 2509.20721: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.20721&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[998] SpecMol: A Spectroscopy-Grounded Foundation Model for Multi-Task Molecular Learning
Shuaike Shen, Jiaqing Xie, Zhuo Yang, Antong Zhang, Shuzhou Sun, Ben Gao, Tianfan Fu, Biqing Qi, Yuqiang Li
Main category: cs.LG
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot determine conclusion as paper content is unavailable
Abstract: Failed to fetch summary for 2509.21861: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.21861&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[999] LEAF: Language-EEG Aligned Foundation Model for Brain-Computer Interfaces
Muyun Jiang, Shuailei Zhang, Zhenjie Yang, Mengjun Wu, Weibang Jiang, Zhiwei Guo, Wei Zhang, Rui Liu, Shangen Zhang, Yong Li, Yi Ding, Cuntai Guan
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to access errorMethod: Unable to determine method due to access error
Result: Unable to determine results due to access error
Conclusion: Unable to determine conclusion due to access error
Abstract: Failed to fetch summary for 2509.24302: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.24302&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1000] DiVeQ: Differentiable Vector Quantization Using the Reparameterization Trick
Mohammad Hassan Vali, Tom Bäckström, Arno Solin
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2509.26469: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.26469&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1001] Robust Batched Bandits
Yunwen Guo, Yunlun Shu, Gongyi Zhuo, Tianyu Wang
Main category: cs.LG
TL;DR: Paper 2510.03798 summary unavailable due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation as abstract content could not be fetchedMethod: Unable to determine method as abstract content could not be fetched
Result: Unable to determine results as abstract content could not be fetched
Conclusion: Unable to determine conclusion as abstract content could not be fetched
Abstract: Failed to fetch summary for 2510.03798: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.03798&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1002] Unlearning in Diffusion models under Data Constraints: A Variational Inference Approach
Subhodip Panda, Varun M S, Shreyans Jain, Sarthak Kumar Maharana, Prathosh A.P
Main category: cs.LG
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation without access to paper contentMethod: Cannot determine method without access to paper content
Result: Cannot determine results without access to paper content
Conclusion: Cannot draw conclusions without access to paper content
Abstract: Failed to fetch summary for 2510.04058: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.04058&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1003] Correlating Cross-Iteration Noise for DP-SGD using Model Curvature
Xin Gu, Yingtai Xiao, Guanlin He, Jiamu Bai, Daniel Kifer, Kiwan Maeng
Main category: cs.LG
TL;DR: I cannot analyze paper 2510.05416 because the arXiv API returned an HTTP 429 error (too many requests), preventing access to the abstract and content.
Details
Motivation: Unable to determine motivation due to API access error.Method: Unable to determine method due to API access error.
Result: Unable to determine results due to API access error.
Conclusion: Unable to determine conclusion due to API access error.
Abstract: Failed to fetch summary for 2510.05416: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.05416&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1004] Regularization Implies balancedness in the deep linear network
Kathryn Lindsey, Govind Menon
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2511.01137: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.01137&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1005] Scaling Kinetic Monte-Carlo Simulations of Grain Growth with Combined Convolutional and Graph Neural Networks
Zhihui Tian, Ethan Suwandi, Tomas Oppelstrup, Vasily V. Bulatov, Joel B. Harley, Fei Zhou
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to failed API requestMethod: Unable to determine method due to failed API request
Result: Unable to determine results due to failed API request
Conclusion: Unable to determine conclusion due to failed API request
Abstract: Failed to fetch summary for 2511.17848: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.17848&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1006] ASPEN: An Adaptive Spectral Physics-Enabled Network for Ginzburg-Landau Dynamics
Julian Evan Chrisnanto, Nurfauzi Fadillah, Yulison Herry Chrisnanto
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to failed API requestMethod: Unable to determine method due to failed API request
Result: Unable to determine results due to failed API request
Conclusion: Unable to analyze paper due to technical issues with arXiv API
Abstract: Failed to fetch summary for 2512.03290: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.03290&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1007] Always Keep Your Promises: A Model-Agnostic Attribution Algorithm for Neural Networks
Kevin Lee, Duncan Smith-Halverson, Pablo Millan Arias
Main category: cs.LG
TL;DR: Paper 2512.07010: Unable to fetch abstract due to HTTP 429 error (rate limiting).
Details
Motivation: Cannot determine motivation without access to paper content.Method: Cannot determine method without access to paper content.
Result: Cannot determine results without access to paper content.
Conclusion: Cannot determine conclusion without access to paper content.
Abstract: Failed to fetch summary for 2512.07010: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.07010&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1008] Token Sample Complexity of Attention
Léa Bohbot, Cyril Letrouit, Gabriel Peyré, François-Xavier Vialard
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to access limitationsMethod: Unable to determine method due to access limitations
Result: Unable to determine results due to access limitations
Conclusion: Unable to draw conclusions due to access limitations
Abstract: Failed to fetch summary for 2512.10656: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.10656&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1009] Guided Transfer Learning for Discrete Diffusion Models
Julian Kleutgens, Claudio Battiloro, Lingkai Kong, Benjamin Grewe, Francesca Dominici, Mauricio Tec
Main category: cs.LG
TL;DR: Paper 2512.10877 summary unavailable due to HTTP 429 error when fetching from arXiv API
Details
Motivation: Unable to determine motivation due to failed summary fetchMethod: Unable to determine method due to failed summary fetch
Result: Unable to determine results due to failed summary fetch
Conclusion: Unable to determine conclusion due to failed summary fetch
Abstract: Failed to fetch summary for 2512.10877: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.10877&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1010] XNNTab – Interpretable Neural Networks for Tabular Data using Sparse Autoencoders
Khawla Elhadri, Jörg Schlötterer, Christin Seifert
Main category: cs.LG
TL;DR: Unable to analyze paper 2512.13442 due to HTTP 429 error when fetching from arXiv API
Details
Motivation: Cannot determine motivation due to inability to access paper contentMethod: Cannot determine method due to inability to access paper content
Result: Cannot determine results due to inability to access paper content
Conclusion: Cannot draw conclusions due to inability to access paper content
Abstract: Failed to fetch summary for 2512.13442: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.13442&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1011] Improving Fairness of Large Language Model-Based ICU Mortality Prediction via Case-Based Prompting
Gangxiong Zhang, Yongchao Long, Yuxi Zhou, Yong Zhang, Shenda Hong
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2512.19735: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.19735&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1012] Contractive Diffusion Policies: Robust Action Diffusion via Contractive Score-Based Sampling with Differential Equations
Amin Abyaneh, Charlotte Morissette, Mohamad H. Danesh, Anas El Houssaini, David Meger, Gregory Dudek, Hsiu-Chin Lin
Main category: cs.LG
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot draw conclusions as paper content is unavailable
Abstract: Failed to fetch summary for 2601.01003: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.01003&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1013] SIGMA: Scalable Spectral Insights for LLM Model Collapse
Yi Gu, Lingyou Pang, Xiangkun Ye, Tianyu Wang, Jianyu Lin, Carey E. Priebe, Alexander Aue
Main category: cs.LG
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation without access to paper contentMethod: Cannot determine method without access to paper content
Result: Cannot determine results without access to paper content
Conclusion: Cannot determine conclusion without access to paper content
Abstract: Failed to fetch summary for 2601.03385: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.03385&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1014] Gradient Structure Estimation under Label-Only Oracles via Spectral Sensitivity
Jun Liu, Leo Yu Zhang, Fengpeng Li, Isao Echizen, Jiantao Zhou
Main category: cs.LG
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot draw conclusions as paper content is unavailable
Abstract: Failed to fetch summary for 2601.14300: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.14300&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1015] An explainable framework for the relationship between dementia and glucose metabolism patterns
C. Vázquez-García, F. J. Martínez-Murcia, F. Segovia Román, A. Forte, J. Ramírez, I. Illán, A. Hernández-Segura, C. Jiménez-Mesa, Juan M. Górriz
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to failed paper fetchMethod: Unable to determine method due to failed paper fetch
Result: Unable to determine results due to failed paper fetch
Conclusion: Unable to determine conclusion due to failed paper fetch
Abstract: Failed to fetch summary for 2601.20480: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.20480&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1016] Reasoning Cache: Continual Improvement Over Long Horizons via Short-Horizon RL
Ian Wu, Yuxiao Qu, Amrith Setlur, Aviral Kumar
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2602.03773: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.03773&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1017] Logical Guidance for the Exact Composition of Diffusion Models
Francesco Alesiani, Jonathan Warrell, Tanja Bien, Henrik Christiansen, Matheus Ferraz, Mathias Niepert
Main category: cs.LG
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation as paper content is unavailable due to API rate limitingMethod: Cannot determine method as paper content is unavailable due to API rate limiting
Result: Cannot determine results as paper content is unavailable due to API rate limiting
Conclusion: Cannot draw conclusions as paper content is unavailable due to API rate limiting
Abstract: Failed to fetch summary for 2602.05549: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.05549&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1018] Flow Matching from Viewpoint of Proximal Operators
Kenji Fukumizu, Wei Huang, Han Bao, Shuntuo Xu, Nisha Chandramoorthy
Main category: cs.LG
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting). Need to try again later or use alternative methods to access the paper information.
Details
Motivation: Cannot determine motivation without access to the paper content.Method: Cannot determine method without access to the paper content.
Result: Cannot determine results without access to the paper content.
Conclusion: Cannot draw conclusions without access to the paper content.
Abstract: Failed to fetch summary for 2602.12683: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.12683&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1019] On the Geometric Coherence of Global Aggregation in Federated Graph Neural Networks
Chethana Prasad Kabgere, Shylaja SS
Main category: cs.LG
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation as paper content is unavailable due to API rate limitingMethod: Cannot determine method as paper content is unavailable due to API rate limiting
Result: Cannot determine results as paper content is unavailable due to API rate limiting
Conclusion: Cannot draw conclusions as paper content is unavailable due to API rate limiting
Abstract: Failed to fetch summary for 2602.15510: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.15510&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1020] CAMEL: An ECG Language Model for Forecasting Cardiac Events
Neelay Velingker, Alaia Solko-Breslin, Mayank Keoliya, Seewon Choi, Jiayi Xin, Anika Marathe, Alireza Oraii, Rajat Deo, Sameed Khatana, Rajeev Alur, Mayur Naik, Eric Wong
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to failed paper fetchMethod: Unable to determine method due to failed paper fetch
Result: Unable to determine results due to failed paper fetch
Conclusion: Unable to draw conclusions due to failed paper fetch
Abstract: Failed to fetch summary for 2602.15677: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.15677&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1021] T1: One-to-One Channel-Head Binding for Multivariate Time-Series Imputation
Dongik Park, Hyunwoo Ryu, Suahn Bae, Keondo Park, Hyung-Sin Kim
Main category: cs.LG
TL;DR: Unable to analyze paper 2602.21043 due to HTTP 429 error when fetching summary from arXiv API
Details
Motivation: Cannot determine motivation due to inability to access paper contentMethod: Cannot determine method due to inability to access paper content
Result: Cannot determine results due to inability to access paper content
Conclusion: Cannot determine conclusion due to inability to access paper content
Abstract: Failed to fetch summary for 2602.21043: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.21043&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1022] Support Tokens, Stability Margins, and a New Foundation for Robust LLMs
Deepak Agarwal, Dhyey Dharmendrakumar Mavani, Suyash Gupta, Karthik Sethuraman, Tejas Dharamsi
Main category: cs.LG
TL;DR: Paper ID 2602.22271 could not be fetched due to HTTP 429 error (rate limiting), so no abstract or content is available for analysis.
Details
Motivation: Unable to determine motivation due to lack of accessible paper content.Method: Unable to determine method due to lack of accessible paper content.
Result: Unable to determine results due to lack of accessible paper content.
Conclusion: Unable to draw conclusions due to lack of accessible paper content.
Abstract: Failed to fetch summary for 2602.22271: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.22271&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1023] Demystifying Group Relative Policy Optimization: Its Policy Gradient is a U-Statistic
Hongyi Zhou, Kai Ye, Erhan Xu, Jin Zhu, Ying Yang, Shijin Gong, Chengchun Shi
Main category: cs.LG
TL;DR: Unable to analyze paper 2603.01162 due to HTTP 429 error when fetching abstract from arXiv API
Details
Motivation: Cannot determine motivation without access to the paper abstractMethod: Cannot determine method without access to the paper abstract
Result: Cannot determine results without access to the paper abstract
Conclusion: Cannot draw conclusions without access to the paper abstract
Abstract: Failed to fetch summary for 2603.01162: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.01162&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1024] Speculative Speculative Decoding
Tanishq Kumar, Tri Dao, Avner May
Main category: cs.LG
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation without access to paper contentMethod: Cannot determine method without access to paper content
Result: Cannot determine results without access to paper content
Conclusion: Cannot determine conclusion without access to paper content
Abstract: Failed to fetch summary for 2603.03251: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.03251&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1025] Missingness Bias Calibration in Feature Attribution Explanations
Shailesh Sridhar, Anton Xue, Eric Wong
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2603.04831: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.04831&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1026] Invisible Safety Threat: Malicious Finetuning for LLM via Steganography
Guangnian Wan, Xinyin Ma, Gongfan Fang, Xinchao Wang
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2603.08104: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.08104&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1027] Lost in Aggregation: On a Fundamental Expressivity Limit of Message-Passing Graph Neural Networks
Eran Rosenbluth
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2603.14846: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.14846&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1028] Decomposing Probabilistic Scores: Reliability, Information Loss and Uncertainty
Arthur Charpentier, Agathe Fernandes Machado
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to draw conclusions due to fetch failure
Abstract: Failed to fetch summary for 2603.15232: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.15232&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1029] The Finetuner’s Fallacy: When to Pretrain with Your Finetuning Data
Christina Baek, Ricardo Pio Monti, David Schwab, Amro Abbas, Rishabh Adiga, Cody Blakeney, Maximilian Böther, Paul Burstein, Aldo Gael Carranza, Alvin Deng, Parth Doshi, Vineeth Dorna, Alex Fang, Tony Jiang, Siddharth Joshi, Brett W. Larsen, Jason Chan Lee, Katherine L. Mentzer, Luke Merrick, Haakon Mongstad, Fan Pan, Anshuman Suri, Darren Teh, Jason Telanoff, Jack Urbanek, Zhengping Wang, Josh Wills, Haoli Yin, Aditi Raghunathan, J. Zico Kolter, Bogdan Gaza, Ari Morcos, Matthew Leavitt, Pratyush Maini
Main category: cs.LG
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation due to inability to access paper contentMethod: Cannot determine method due to inability to access paper content
Result: Cannot determine results due to inability to access paper content
Conclusion: Cannot determine conclusion due to inability to access paper content
Abstract: Failed to fetch summary for 2603.16177: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.16177&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1030] PRISM: Demystifying Retention and Interaction in Mid-Training
Bharat Runwal, Ashish Agrawal, Anurag Roy, Rameswar Panda
Main category: cs.LG
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot draw conclusions as paper content is unavailable
Abstract: Failed to fetch summary for 2603.17074: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.17074&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1031] PCA-Based Interpretable Knowledge Representation and Analysis of Geometric Design Parameters
Alexander Köhler, Michael Breuß
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2603.17535: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.17535&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1032] RE-SAC: Disentangling aleatoric and epistemic risks in bus fleet control: A stable and robust ensemble DRL approach
Yifan Zhang, Liang Zheng
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Unable to determine motivation due to failed paper fetchMethod: Unable to determine method due to failed paper fetch
Result: Unable to determine results due to failed paper fetch
Conclusion: Unable to draw conclusions due to failed paper fetch
Abstract: Failed to fetch summary for 2603.18396: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.18396&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1033] Enhancing the Parameterization of Reservoir Properties for Data Assimilation Using Deep VAE-GAN
M. A. Sampaio, P. H. Ranazzi, M. J. Blunt
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to paper fetch failureMethod: Unable to determine method due to paper fetch failure
Result: Unable to determine results due to paper fetch failure
Conclusion: Unable to draw conclusions due to paper fetch failure
Abstract: Failed to fetch summary for 2603.18766: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.18766&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1034] Towards Noise-adaptive, Problem-adaptive (Accelerated) Stochastic Gradient Descent
Sharan Vaswani, Benjamin Dubois-Taine, Reza Babanezhad
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2110.11442: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2110.11442&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1035] A Novel TSK Fuzzy System Incorporating Multi-view Collaborative Transfer Learning for Personalized Epileptic EEG Detection
Andong Li, Zhaohong Deng, Qiongdan Lou
Main category: cs.LG
TL;DR: Unable to analyze paper 2111.08457 due to HTTP 429 error when fetching abstract from arXiv API
Details
Motivation: Cannot determine motivation as abstract retrieval failedMethod: Cannot determine method as abstract retrieval failed
Result: Cannot determine results as abstract retrieval failed
Conclusion: Cannot draw conclusion due to missing paper information
Abstract: Failed to fetch summary for 2111.08457: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2111.08457&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1036] No-Regret Bayesian Recommendation to Homogeneous Users
Yiding Feng, Wei Tang, Haifeng Xu
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to technical error in fetching paper contentMethod: Unable to determine method due to technical error in fetching paper content
Result: Unable to determine results due to technical error in fetching paper content
Conclusion: Unable to draw conclusions due to technical error in fetching paper content
Abstract: Failed to fetch summary for 2202.06135: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2202.06135&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1037] LOCO Feature Importance Inference without Data Splitting via Minipatch Ensembles
Luqin Gan, Lili Zheng, Genevera I. Allen
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to failed paper fetchMethod: Unable to determine method due to failed paper fetch
Result: Unable to determine results due to failed paper fetch
Conclusion: Unable to draw conclusions due to failed paper fetch
Abstract: Failed to fetch summary for 2206.02088: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2206.02088&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1038] Noise-contrastive Online Change Point Detection
Nikita Puchkin, Artur Goldman, Konstantin Yakovlev, Valeriia Dzis, Uliana Vinogradova
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2206.10143: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2206.10143&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1039] Hybrid Quantum Generative Adversarial Networks for Molecular Simulation and Drug Discovery
Prateek Jain, Param Pathak, Krishna Bhatia, Shalini Devendrababu, Srinjoy Ganguly
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to failed paper retrievalMethod: Unable to determine method due to failed paper retrieval
Result: Unable to determine results due to failed paper retrieval
Conclusion: Unable to determine conclusion due to failed paper retrieval
Abstract: Failed to fetch summary for 2212.07826: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2212.07826&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1040] On Consistency of Signature Using Lasso
Xin Guo, Binnan Wang, Ruixun Zhang, Chaoyi Zhao
Main category: cs.LG
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot draw conclusions as paper content is unavailable
Abstract: Failed to fetch summary for 2305.10413: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2305.10413&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1041] High Confidence Level Inference is Almost Free using Parallel Stochastic Optimization
Wanrong Zhu, Zhipeng Lou, Ziyang Wei, Wei Biao Wu
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting). The paper ID 2401.09346 could not be retrieved from arXiv API.
Details
Motivation: Unable to determine motivation as the paper content could not be fetched due to API rate limiting.Method: Unable to determine method as the paper content could not be fetched due to API rate limiting.
Result: Unable to determine results as the paper content could not be fetched due to API rate limiting.
Conclusion: Unable to draw conclusions as the paper content could not be fetched due to API rate limiting.
Abstract: Failed to fetch summary for 2401.09346: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2401.09346&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1042] Interacting Particle Systems on Networks: joint inference of the network and the interaction kernel
Quanjun Lang, Xiong Wang, Fei Lu, Mauro Maggioni
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting) for arXiv ID 2402.08412
Details
Motivation: Unable to determine motivation as paper content could not be retrieved due to API rate limitingMethod: No method information available due to failed API request
Result: No results available due to failed API request
Conclusion: Unable to provide analysis due to technical limitations in accessing the paper content
Abstract: Failed to fetch summary for 2402.08412: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2402.08412&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1043] SPABA: A Single-Loop and Probabilistic Stochastic Bilevel Algorithm Achieving Optimal Sample Complexity
Tianshu Chu, Dachuan Xu, Wei Yao, Jin Zhang
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to access errorMethod: Unable to determine method due to access error
Result: Unable to determine results due to access error
Conclusion: Unable to determine conclusion due to access error
Abstract: Failed to fetch summary for 2405.18777: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2405.18777&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1044] Fast convergence of a Federated Expectation-Maximization Algorithm
Zhixu Tao, Rajita Chandak, Sanjeev Kulkarni
Main category: cs.LG
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation as paper content is unavailable due to server rate limitingMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot draw conclusions about paper content due to access limitations
Abstract: Failed to fetch summary for 2408.05819: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2408.05819&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1045] Multilevel Picard approximations and deep neural networks with ReLU, leaky ReLU, and softplus activation overcome the curse of dimensionality when approximating semilinear parabolic partial differential equations in $L^p$-sense
Ariel Neufeld, Tuan Anh Nguyen
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2409.20431: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2409.20431&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1046] Variance reduction combining pre-experiment and in-experiment data
Zhexiao Lin, Pablo Crespo
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2410.09027: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2410.09027&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1047] Dirichlet process mixtures of block $g$ priors for model selection and prediction in linear models
Anupreet Porwal, Abel Rodriguez
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation as paper content could not be retrievedMethod: Unable to determine method as paper content could not be retrieved
Result: Unable to determine results as paper content could not be retrieved
Conclusion: Unable to determine conclusion as paper content could not be retrieved
Abstract: Failed to fetch summary for 2411.00471: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2411.00471&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1048] Efficient transformer adaptation for analog in-memory computing via low-rank adapters
Chen Li, Elena Ferro, Corey Lammie, Manuel Le Gallo, Irem Boybat, Bipin Rajendran
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to technical error in fetching paper contentMethod: Unable to determine method due to technical error in fetching paper content
Result: Unable to determine results due to technical error in fetching paper content
Conclusion: Unable to draw conclusions due to technical error in fetching paper content
Abstract: Failed to fetch summary for 2411.17367: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2411.17367&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1049] Scalable Learning from Probability Measures with Mean Measure Quantization
Erell Gachon, Elsa Cazelles, Jérémie Bigot
Main category: cs.LG
TL;DR: Failed to fetch summary for arXiv ID 2502.04907 due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation as the paper summary could not be retrieved from arXiv API due to rate limitingMethod: Unable to determine method as the paper content was not accessible
Result: No results available due to API rate limiting preventing access to paper information
Conclusion: Cannot provide analysis due to technical limitations in accessing the paper content
Abstract: Failed to fetch summary for 2502.04907: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2502.04907&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1050] Tightening optimality gap with confidence through conformal prediction
Miao Li, Michael Klamkin, Russell Bent, Pascal Van Hentenryck
Main category: cs.LG
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation without access to paper contentMethod: Cannot determine method without access to paper content
Result: Cannot determine results without access to paper content
Conclusion: Cannot draw conclusions without access to paper content
Abstract: Failed to fetch summary for 2503.04071: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2503.04071&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1051] Optimization on the Oblique Manifold for Sparse Simplex Constraints via Multiplicative Updates
Flavia Esposito, Andersen Ang
Main category: cs.LG
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot draw conclusions as paper content is unavailable
Abstract: Failed to fetch summary for 2503.24075: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2503.24075&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1052] Learning collision risk proactively from naturalistic driving data at scale
Yiru Jiao, Simeon C. Calvert, Sander van Cranenburgh, Hans van Lint
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to missing paper contentMethod: Unable to determine method due to missing paper content
Result: Unable to determine results due to missing paper content
Conclusion: Unable to draw conclusions due to missing paper content
Abstract: Failed to fetch summary for 2505.13556: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2505.13556&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1053] Proximal Point Nash Learning from Human Feedback
Daniil Tiapkin, Daniele Calandriello, Denis Belomestny, Eric Moulines, Alexey Naumov, Kashif Rasul, Michal Valko, Pierre Menard
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2505.19731: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2505.19731&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1054] Differentially Private Distribution Release of Gaussian Mixture Models via KL-Divergence Minimization
Hang Liu, Anna Scaglione, Sean Peisert
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to access errorMethod: Unable to determine method due to access error
Result: Unable to determine results due to access error
Conclusion: Unable to determine conclusion due to access error
Abstract: Failed to fetch summary for 2506.03467: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2506.03467&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1055] Differentiable Simulation of Hard Contacts with Soft Gradients for Learning and Control
Anselm Paulus, A. René Geist, Pierre Schumacher, Vít Musil, Simon Rappenecker, Georg Martius
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2506.14186: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2506.14186&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1056] Multi-scale species richness estimation with deep learning
Victor Boussange, Bert Wuyts, Philipp Brun, Johanna T. Malle, Gabriele Midolo, Jeanne Portier, Théophile Sanchez, Niklaus E. Zimmermann, Irena Axmanová, Helge Bruelheide, Milan Chytrý, Stephan Kambach, Zdeňka Lososová, Martin Večeřa, Idoia Biurrun, Klaus T. Ecker, Jonathan Lenoir, Jens-Christian Svenning, Dirk Nikolaus Karger
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to access errorMethod: Unable to determine method due to access error
Result: Unable to determine results due to access error
Conclusion: Unable to determine conclusion due to access error
Abstract: Failed to fetch summary for 2507.06358: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2507.06358&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1057] Information-Theoretic Decentralized Secure Aggregation with Passive Collusion Resilience
Xiang Zhang, Zhou Li, Shuangyang Li, Kai Wan, Derrick Wing Kwan Ng, Giuseppe Caire
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to failed paper fetchMethod: Unable to determine method due to failed paper fetch
Result: Unable to determine results due to failed paper fetch
Conclusion: Unable to determine conclusion due to failed paper fetch
Abstract: Failed to fetch summary for 2508.00596: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2508.00596&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1058] SpectraLLM: Uncovering the Ability of LLMs for Molecule Structure Elucidation from Multi-Spectral
Yunyue Su, Jiahui Chen, Zao Jiang, Zhenyi Zhong, Liang Wang, Qiang Liu, Zhaoxiang Zhang
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2508.08441: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2508.08441&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1059] Neural Stochastic Differential Equations on Compact State Spaces: Theory, Methods, and Application to Suicide Risk Modeling
Malinda Lu, Yue-Jane Liu, Matthew K. Nock, Yaniv Yacoby
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to technical error fetching paper contentMethod: Unable to determine method due to technical error fetching paper content
Result: Unable to determine results due to technical error fetching paper content
Conclusion: Unable to draw conclusions due to technical error fetching paper content
Abstract: Failed to fetch summary for 2508.17090: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2508.17090&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1060] Modeling and benchmarking quantum optical neurons for efficient neural computation
Andrea Andrisani, Gennaro Vessio, Fabrizio Sgobba, Francesco Di Lena, Luigi Amato Santamaria, Giovanna Castellano
Main category: cs.LG
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation without access to paper contentMethod: Cannot determine method without access to paper content
Result: Cannot determine results without access to paper content
Conclusion: Cannot determine conclusion without access to paper content
Abstract: Failed to fetch summary for 2509.01784: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.01784&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1061] Learning Magnetic Order Classification from Large-Scale Materials Databases
Ahmed E. Fahmy
Main category: cs.LG
TL;DR: Paper 2509.05909 summary unavailable due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to technical error in fetching paper informationMethod: Method unknown - paper content not accessible
Result: No results available - failed to retrieve paper data
Conclusion: Cannot provide analysis due to technical limitations in accessing the paper
Abstract: Failed to fetch summary for 2509.05909: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.05909&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1062] BioBO: Biology-informed Bayesian Optimization for Perturbation Design
Yanke Li, Tianyu Cui, Tommaso Mansi, Mangal Prakash, Rui Liao
Main category: cs.LG
TL;DR: Unable to analyze paper 2509.19988 due to HTTP 429 error when fetching abstract from arXiv API
Details
Motivation: Cannot determine motivation as abstract retrieval failed due to rate limiting (HTTP 429)Method: No method information available due to API rate limiting error
Result: No results available - paper content could not be retrieved
Conclusion: Analysis impossible due to technical limitations in accessing the paper abstract
Abstract: Failed to fetch summary for 2509.19988: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.19988&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1063] Stiff Circuit System Modeling via Transformer
Weiman Yan, Yi-Chia Chang, Wanyu Zhao
Main category: cs.LG
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation as paper content is unavailable due to API rate limitingMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot determine conclusion as paper content is unavailable
Abstract: Failed to fetch summary for 2510.24727: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.24727&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1064] Machine Learning-Driven Analysis of kSZ Maps to Predict CMB Optical Depth $τ$
Farshid Farhadi Khouzani, Abinash Kumar Shaw, Paul La Plante, Bryar Mustafa Shareef, Laxmi Gewali
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting) for arXiv ID 2511.04770
Details
Motivation: Unable to determine motivation due to failed API requestMethod: Unable to determine method due to failed API request
Result: Unable to determine results due to failed API request
Conclusion: Unable to determine conclusion due to failed API request
Abstract: Failed to fetch summary for 2511.04770: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.04770&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1065] Reinforcement Learning for Chemical Ordering in Alloy Nanoparticles
Jonas Elsborg, Emma L. Hovmand, Arghya Bhowmik
Main category: cs.LG
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot determine conclusion as paper content is unavailable
Abstract: Failed to fetch summary for 2511.12260: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.12260&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1066] Scalable learning of macroscopic stochastic dynamics
Mengyi Chen, Pengru Huang, Kostya S. Novoselov, Qianxiao Li
Main category: cs.LG
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation without access to paper contentMethod: Cannot determine method without access to paper content
Result: Cannot determine results without access to paper content
Conclusion: Cannot draw conclusions without access to paper content
Abstract: Failed to fetch summary for 2511.12842: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.12842&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1067] BITS for GAPS: Bayesian Information-Theoretic Sampling for hierarchical GAussian Process Surrogates
Kyla D. Jones, Alexander W. Dowling
Main category: cs.LG
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation as paper content could not be retrievedMethod: Cannot determine method as paper content could not be retrieved
Result: Cannot determine results as paper content could not be retrieved
Conclusion: Cannot draw conclusion as paper content could not be retrieved
Abstract: Failed to fetch summary for 2511.16815: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.16815&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1068] Physics Enhanced Deep Surrogates for the Phonon Boltzmann Transport Equation
Antonio Varagnolo, Giuseppe Romano, Raphaël Pestourie
Main category: cs.LG
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting). Need to try again later or use alternative methods to access the paper information.
Details
Motivation: Cannot determine motivation without access to the paper content.Method: Cannot determine method without access to the paper content.
Result: Cannot determine results without access to the paper content.
Conclusion: Cannot draw conclusions without access to the paper content.
Abstract: Failed to fetch summary for 2512.05976: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.05976&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1069] AETAS: Analysis of Evolving Temporal Affect and Semantics for Legal History
Qizhi Wang
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to API rate limiting preventing access to paper contentMethod: Cannot analyze method as paper content is unavailable due to HTTP 429 error
Result: No results available - technical issue with arXiv API rate limiting
Conclusion: Cannot provide analysis due to technical limitations in accessing the paper
Abstract: Failed to fetch summary for 2512.22196: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.22196&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1070] Active learning for photonic crystals
Ryan Lopez, Charlotte Loh, Rumen Dangovski, Marin Soljačić
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2601.16287: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.16287&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1071] Building a Robust Risk-Based Access Control System to Combat Ransomware’s Capability to Encrypt
Kenan Begovic, Abdulaziz Al-Ali, Qutaibah Malluhi
Main category: cs.LG
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation without access to paper contentMethod: Cannot determine method without access to paper content
Result: Cannot determine results without access to paper content
Conclusion: Cannot draw conclusions without access to paper content
Abstract: Failed to fetch summary for 2601.16795: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.16795&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1072] Trigger Optimization and Event Classification for Dark Matter Searches in the CYGNO Experiment Using Machine Learning
F. D. Amaro, R. Antonietti, E. Baracchini, L. Benussi, C. Capoccia, M. Caponero, L. G. M. de Carvalho, G. Cavoto, I. A. Costa, A. Croce, M. D’Astolfo, G. D’Imperio, G. Dho, E. Di Marco, J. M. F. dos Santos, D. Fiorina, F. Iacoangeli, Z. Islam, E. Kemp, H. P. Lima Jr, G. Maccarrone, R. D. P. Mano, D. J. G. Marques, G. Mazzitelli, P. Meloni, A. Messina, C. M. B. Monteiro, R. A. Nobrega, G. M. Oppedisano, I. F. Pains, E. Paoletti, F. Petrucci, S. Piacentini, D. Pierluigi, D. Pinci, F. Renga, A. Russo, G. Saviano, P. A. O. C. Silva, N. J. Spooner, R. Tesauro, S. Tomassini, D. Tozzi
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2601.20626: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.20626&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1073] Twinning Complex Networked Systems: Data-Driven Calibration of the mABCD Synthetic Graph Generator
Piotr Bródka, Michał Czuba, Bogumił Kamiński, Łukasz Kraiński, Katarzyna Musial, Paweł Prałat, Mateusz Stolarski
Main category: cs.LG
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation as paper content is unavailable due to API rate limitingMethod: Cannot determine method as paper content is unavailable due to API rate limiting
Result: Cannot determine results as paper content is unavailable due to API rate limiting
Conclusion: Cannot draw conclusions as paper content is unavailable due to API rate limiting
Abstract: Failed to fetch summary for 2602.02044: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.02044&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1074] BayesFlow 2: Multi-Backend Amortized Bayesian Inference in Python
Lars Kühmichel, Jerry M. Huang, Valentin Pratz, Jonas Arruda, Hans Olischläger, Daniel Habermann, Simon Kucharsky, Lasse Elsemüller, Aayush Mishra, Niels Bracher, Svenja Jedhoff, Marvin Schmitt, Paul-Christian Bürkner, Stefan T. Radev
Main category: cs.LG
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting). The paper ID 2602.07098 suggests it’s from February 2025, but no content available for analysis.
Details
Motivation: Cannot determine motivation without access to paper content. The arXiv ID format suggests this is a recent paper from February 2025.Method: Method unknown - paper content unavailable due to HTTP 429 error from arXiv API.
Result: No results available - failed to fetch paper summary.
Conclusion: Unable to analyze paper due to technical limitations in accessing content.
Abstract: Failed to fetch summary for 2602.07098: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.07098&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1075] Universal Coefficients and Mayer-Vietoris Sequence for Groupoid Homology
Luciano Melodia
Main category: cs.LG
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting). The paper ID 2602.08998 appears to be from February 2026, which suggests it might be a future or incorrectly formatted arXiv ID.
Details
Motivation: Cannot determine motivation due to inability to access paper content.Method: Cannot determine method due to inability to access paper content.
Result: Cannot determine results due to inability to access paper content.
Conclusion: Cannot determine conclusion due to inability to access paper content.
Abstract: Failed to fetch summary for 2602.08998: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.08998&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1076] ACE-RTL: When Agentic Context Evolution Meets RTL-Specialized LLMs
Chenhui Deng, Zhongzhi Yu, Guan-Ting Liu, Nathaniel Pinckney, Brucek Khailany, Haoxing Ren
Main category: cs.LG
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot draw conclusions as paper content is unavailable
Abstract: Failed to fetch summary for 2602.10218: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.10218&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1077] Power-SMC: Low-Latency Sequence-Level Power Sampling for Training-Free LLM Reasoning
Seyedarmin Azizi, Erfan Baghaei Potraghloo, Minoo Ahmadi, Souvik Kundu, Massoud Pedram
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to access limitationsMethod: Unable to determine method due to access limitations
Result: Unable to determine results due to access limitations
Conclusion: Unable to draw conclusions due to access limitations
Abstract: Failed to fetch summary for 2602.10273: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.10273&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1078] FastLSQ: Solving PDEs in One Shot via Fourier Features with Exact Analytical Derivatives
Antonin Sulc
Main category: cs.LG
TL;DR: Unable to analyze paper 2602.10541 due to HTTP 429 error when fetching abstract from arXiv API
Details
Motivation: Cannot determine motivation as abstract is unavailableMethod: Cannot determine method as abstract is unavailable
Result: Cannot determine results as abstract is unavailable
Conclusion: Cannot draw conclusions about paper content due to data unavailability
Abstract: Failed to fetch summary for 2602.10541: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.10541&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1079] Universality of shallow and deep neural networks on non-Euclidean spaces
Vugar Ismailov
Main category: cs.LG
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot draw conclusions as paper content is unavailable
Abstract: Failed to fetch summary for 2602.23381: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.23381&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1080] RAIE: Region-Aware Incremental Preference Editing with LoRA for LLM-based Recommendation
Jin Zeng, Yupeng Qi, Hui Li, Chengming Li, Ziyu Lyu, Lixin Cui, Lu Bai
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2603.00638: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.00638&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1081] A Survey of Reinforcement Learning For Economics
Pranjal Rawat
Main category: cs.LG
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting). The paper ID 2603.08956 suggests it’s from March 2026, but no abstract or content is available for analysis.
Details
Motivation: Cannot determine motivation without access to the paper content. The HTTP 429 error indicates the arXiv API is rate limiting requests.Method: No method information available due to failed content retrieval.
Result: No results available as the paper content could not be fetched.
Conclusion: Unable to analyze the paper due to technical limitations in accessing the content.
Abstract: Failed to fetch summary for 2603.08956: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.08956&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1082] Filtered Spectral Projection for Quantum Principal Component Analysis
Sk Mujaffar Hossain, Satadeep Bhattacharjee
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to failed paper retrievalMethod: Unable to determine method due to failed paper retrieval
Result: Unable to determine results due to failed paper retrieval
Conclusion: Unable to draw conclusions due to failed paper retrieval
Abstract: Failed to fetch summary for 2603.13441: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.13441&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1083] Physics-Informed Policy Optimization via Analytic Dynamics Regularization
Namai Chandra, Liu Mohan, Zhihao Gu, Lin Wang
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2603.14469: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.14469&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1084] Sequential Transport for Causal Mediation Analysis
Agathe Fernandes Machado, Iryna Voitsitska, Arthur Charpentier, Ewen Gallic
Main category: cs.LG
TL;DR: Unable to analyze paper 2603.15182 due to HTTP 429 error when fetching from arXiv API
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot determine conclusion as paper content is unavailable
Abstract: Failed to fetch summary for 2603.15182: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.15182&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1085] Multi-Domain Empirical Bayes for Linearly-Mixed Causal Representations
Bohan Wu, Julius von Kügelgen, David M. Blei
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to failed paper fetchMethod: Unable to determine method due to failed paper fetch
Result: Unable to determine results due to failed paper fetch
Conclusion: Unable to determine conclusion due to failed paper fetch
Abstract: Failed to fetch summary for 2603.18404: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.18404&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1086] Statistical Testing Framework for Clustering Pipelines by Selective Inference
Yugo Miyata, Tomohiro Shiraishi, Shunichi Nishino, Ichiro Takeuchi
Main category: cs.LG
TL;DR: This paper appears to be about multimodal large language models with a focus on audio and vision understanding/generation, but the abstract couldn’t be fetched due to HTTP 429 error (rate limiting).
Details
Motivation: Unable to determine motivation from the abstract due to fetch failure. The paper ID 2603.18413 suggests it's a recent paper (March 2026) in the multimodal AI space.Method: Unknown - abstract not available. Based on the paper ID format (2603.18413), it’s likely a multimodal AI paper from March 2026.
Result: Unknown - abstract not available. The HTTP 429 error indicates rate limiting on arXiv API requests.
Conclusion: Unable to draw conclusions without access to the abstract content. The paper appears to be in the reader’s research area but specific details are unavailable.
Abstract: Failed to fetch summary for 2603.18413: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.18413&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[1087] A Theoretical Comparison of No-U-Turn Sampler Variants: Necessary and Sufficient Convergence Conditions and Mixing Time Analysis under Gaussian Targets
Samuel Gruffaz, Kyurae Kim, Fares Guehtar, Hadrien Duval-decaix, Pacôme Trautmann
Main category: cs.LG
TL;DR: Paper 2603.18640: Unable to fetch abstract due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation due to inability to access paper contentMethod: Cannot determine method due to inability to access paper content
Result: Cannot determine results due to inability to access paper content
Conclusion: Cannot draw conclusions due to inability to access paper content
Abstract: Failed to fetch summary for 2603.18640: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.18640&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
cs.MA
[1088] Reason-to-Transmit: Deliberative Adaptive Communication for Cooperative Perception
Aayam Bansal, Ishaan Gangwani
Main category: cs.MA
TL;DR: R2T introduces a transformer-based reasoning module for cooperative perception that makes deliberate transmission decisions based on scene context, neighbor information gaps, and bandwidth constraints, outperforming reactive baselines especially in high-occlusion scenarios.
Details
Motivation: Bandwidth constraints in V2X networks require efficient communication policies for cooperative perception. Existing approaches use reactive mechanisms without reasoning about why messages benefit receivers, limiting their effectiveness in information-sharing decisions.Method: R2T equips each agent with a lightweight transformer-based module that reasons over local scene context, estimated neighbor information gaps, and bandwidth budget to make per-region transmission decisions. Trained end-to-end with a bandwidth-aware objective.
Result: R2T outperforms nine baselines in multi-agent bird’s-eye-view perception. At low bandwidth, all selective methods perform similarly, but R2T shows clear gains under high occlusion where information asymmetry is greatest, approaching oracle performance. Methods degrade gracefully under packet drops up to 50%.
Conclusion: While fusion design dominates performance, deliberative communication provides additional gains in challenging scenarios. R2T enables more efficient and context-aware information sharing in cooperative perception through reasoning-based communication.
Abstract: Cooperative perception among autonomous agents overcomes the limitations of single-agent sensing, but bandwidth constraints in vehicle-to-everything (V2X) networks require efficient communication policies. Existing approaches rely on reactive mechanisms, such as confidence maps, learned gating, or sparse masks, to decide what to transmit, without reasoning about why a message benefits the receiver. We introduce Reason-to-Transmit (R2T), a framework that equips each agent with a lightweight transformer-based module that reasons over local scene context, estimated neighbor information gaps, and bandwidth budget to make per-region transmission decisions. Trained end-to-end with a bandwidth-aware objective, R2T is evaluated against nine baselines in a multi-agent bird’s-eye-view perception environment. Any communication improves performance by about 58% AP over no communication. At low bandwidth, all selective methods perform similarly, but R2T shows clear gains under high occlusion, where information asymmetry is greatest, approaching oracle performance. All methods degrade gracefully under packet drops up to 50%, showing robustness to communication failures. These results indicate that while fusion design dominates performance, deliberative communication provides additional gains in challenging scenarios. R2T introduces a reasoning-based approach to communication, enabling more efficient and context-aware information sharing in cooperative perception.
[1089] When Agents Disagree: The Selection Bottleneck in Multi-Agent LLM Pipelines
Artem Maryanskyy
Main category: cs.MA
TL;DR: Multi-agent LLM pipelines show contradictory effects of team diversity on output quality, with a proposed resolution through a selection bottleneck model that identifies a crossover threshold determining when diversity helps or hurts.
Details
Motivation: The paper addresses contradictory findings in multi-agent LLM pipelines where heterogeneous teams outperform single models but homogeneous teams win under synthesis-based aggregation, seeking to resolve this paradox through understanding the selection bottleneck.Method: Proposes a selection bottleneck model with closed-form crossover threshold, conducts targeted experiment across 42 tasks in 7 categories (N=210), compares judge-based selection vs. synthesis approaches, and performs decoupled evaluation with independent judges.
Result: Diverse team with judge-based selection achieves 0.810 win rate vs. single-model baseline, while homogeneous team scores 0.512; judge-based selection outperforms synthesis by ΔWR = +0.631; exploratory evidence shows weaker models can improve performance while reducing cost.
Conclusion: Selector quality may be more impactful than generator diversity in single-round generate-then-select pipelines, with the selection bottleneck model explaining when diversity helps or hurts performance.
Abstract: Multi-agent LLM pipelines produce contradictory evidence on whether team diversity improves output quality: heterogeneous Mixture-of-Agents teams outperform single models, yet homogeneous Self-MoA teams consistently win under synthesis-based aggregation. We propose a resolution by identifying the selection bottleneck – a crossover threshold in aggregation quality that determines whether diversity helps or hurts. Under this model, we obtain a closed-form crossover threshold $s^*$ (Proposition 1) that separates the regimes where diversity helps and hurts. In a targeted experiment spanning 42 tasks across 7 categories ($N=210$), a diverse team with judge-based selection achieves a win rate of 0.810 against a single-model baseline, while a homogeneous team scores 0.512 – near chance (Glass’s $Δ= 2.07$). Judge-based selection outperforms MoA-style synthesis by $Δ_{\mathrm{WR}} = +0.631$ – the synthesis approach is preferred over the baseline in zero of 42 tasks by the judge panel. A decoupled evaluation with independent judges confirms all directional findings (Spearman $ρ= 0.90$). Exploratory evidence suggests that including a weaker model improves performance while reducing cost ($p < 10^{-4}$, not pre-registered). Our results suggest that selector quality may be a more impactful design lever than generator diversity in single-round generate-then-select pipelines.
[1090] ALARA for Agents: Least-Privilege Context Engineering Through Portable Composable Multi-Agent Teams
Christopher J. Agostino, Nayan D’Souza
Main category: cs.MA
TL;DR: A declarative CAT (context-agent-tool) data layer and npcsh shell for managing multi-agent systems with minimal context exposure, evaluated across 22 models on 115 practical tasks.
Details
Motivation: Current multi-agent frameworks lack unified mechanisms for managing agent context and tool access, making behavioral specifications fragmented across different files and systems, difficult to share, version, or maintain collaboratively.Method: Introduces a declarative context-agent-tool (CAT) data layer with interrelated files that scope each agent’s tool access and context minimally, plus npcsh command-line shell for execution. System structurally enforces these specifications rather than relying on model suggestions.
Result: Evaluated 22 locally-hosted models (0.6B to 35B parameters) across 115 practical tasks spanning file operations, web search, scripting, tool chaining, and multi-agent delegation, with ~2500 total executions characterizing model family performance across task categories.
Conclusion: The CAT data layer and npcsh shell provide a unified, declarative approach to managing multi-agent systems with guaranteed behavioral changes, addressing fragmentation in current frameworks while enabling scalable agent infrastructure management.
Abstract: Industry practitioners and academic researchers regularly use multi-agent systems to accelerate their work, yet the frameworks through which these systems operate do not provide a simple, unified mechanism for scalably managing the critical aspects of the agent harness, impacting both the quality of individual human-agent interactions and the capacity for practitioners to coordinate toward common goals through shared agent infrastructure. Agent frameworks have enabled increasingly sophisticated multi-agent systems, but the behavioral specifications that define what these agents can do remain fragmented across prose instruction files, framework-internal configuration, and mechanisms like MCP servers that operate separately from individual agent definitions, making these specifications difficult to share, version, or collaboratively maintain across teams and projects. Applying the ALARA principle from radiation safety (exposures kept as low as reasonably achievable) to agent context, we introduce a declarative context-agent-tool (CAT) data layer expressed through interrelated files that scope each agent’s tool access and context to the minimum its role requires, and \texttt{npcsh}, a command-line shell for executing it. Because the system parses and enforces these files structurally, modifying an agent’s tool list produces a guaranteed behavioral change rather than a suggestion the model may or may not follow. We evaluate 22 locally-hosted models from 0.6B to 35B parameters across 115 practical tasks spanning file operations, web search, multi-step scripting, tool chaining, and multi-agent delegation, characterizing which model families succeed at which task categories and where they break down across $\sim$2500 total executions.
[1091] Measuring Reasoning Trace Legibility: Can Those Who Understand Teach?
Dani Roytburg, Shreya Sridhar, Daphne Ippolito
Main category: cs.MA
TL;DR: This paper evaluates reasoning language models (RLMs) by assessing the legibility of their reasoning traces, introducing “transfer utility” to measure how well these traces can guide weaker models to correct answers, finding that high-performing models often have low legibility traces.
Details
Motivation: While reasoning language models are trained to output deliberation traces before final answers to improve correctness, there's a need to assess the legibility of these reasoning traces themselves, not just final answer correctness, especially for multi-agent scenarios where traces need to guide other models.Method: Evaluated 90k reasoning traces from 12 Reasoning Language Models, introduced “transfer utility” metric to measure how useful an RLM’s reasoning traces are for guiding a weaker, non-reasoning model to correct answers, and analyzed tensions between efficiency-based legibility measurements and transfer utility.
Result: Found that reasoning traces of highest-performing models rank among the lowest for legibility, uncovered tensions between efficiency-based legibility measurements and transfer utility, established a legibility Pareto frontier, and discovered that reward models used to train RLMs don’t intrinsically reward legibility.
Conclusion: Legibility of reasoning traces is an important but often overlooked aspect of RLMs, with tensions between efficiency and utility, and that legibility can be task- and audience-dependent, suggesting a need for better metrics and training approaches for multi-agent reasoning systems.
Abstract: Language models are increasingly being trained to “reason” before answering users’ queries, outputting hundreds or even thousands of tokens worth of deliberation before their final answer. While the main intention of reasoning is to improve models’ ability to arrive at a correct answer, we argue that these models should be assessed for the legibility of their reasoning traces in addition to the correctness of their final answers. In this paper, we evaluate 90k traces from 12 Reasoning Language Models (RLMs) for the quality of their reasoning traces. We introduce the concept of transfer utility, which assesses how useful an RLM’s reasoning traces are for guiding a weaker, non-reasoning model toward arriving at the correct answer. We find that the reasoning traces of the highest-performing models rank among the lowest for legibility. Furthermore, we uncover tensions between efficiency-based measurements of legibility (such as trace length) and transfer utility. These tensions establish a legibility Pareto frontier, and we demonstrate that an RLM’s ability to output highly legible traces can be a task- and audience-dependent goal. Crucially, we find that reward models used to train RLMs do not intrinsically reward legibility. Together, these metrics and the findings they surface chart a path towards scaffolding reasoning traces for a multi-agent future.
[1092] Personality-Driven Student Agent-Based Modeling in Mathematics Education: How Well Do Student Agents Align with Human Learners?
Bushi Xiao, Qian Shen
Main category: cs.MA
TL;DR: Student agents with Big Five personality traits show 71.4% behavioral alignment with human learners in educational simulations
Details
Motivation: Educational research faces ethical constraints in real-person experiments, and LLM-based generative agents offer a way to simulate student behavior, but their credibility needs validationMethod: Built Big Five Personality-based student agent model with full pipeline of student-teacher interaction, self-study, and examination. Evaluated using 14 criteria distilled from 13 empirical studies on Big Five traits and learning
Result: 71.4% of student agents’ behavior was aligned with human learners, demonstrating credible simulation capabilities
Conclusion: LLM-based student agents show promising fidelity in simulating human learning behaviors, enabling ethical educational research simulations
Abstract: It is crucial to explore the impact of different teaching methods on student learning in educational research. However, real-person experiments face significant ethical constraints, and we cannot conduct repeated teaching experiments on the same student. LLM-based generative agents offer a promising avenue for simulating student behavior. Before large-scale experiments, a fundamental question must be addressed: are student agents truly credible, and can they faithfully simulate human learning? In this study, we built a Big Five Personality-based student agent model with a full pipeline of student-teacher interaction, self-study, and examination. To evaluate behavioral fidelity, we collected 13 empirical studies on Big Five traits and learning, and distilled them into 14 criteria. We found that the 71.4% of the student agents’ behavior was aligned with human learners.
[1093] Is AI Ready for Multimodal Hate Speech Detection? A Comprehensive Dataset and Benchmark Evaluation
Rui Xing, Qi Chai, Jie Ma, Jing Tao, Pinghui Wang, Shuming Zhang, Xinping Wang, Hao Wang
Main category: cs.MA
TL;DR: M^3 dataset: Multi-platform, multi-lingual multimodal meme dataset with fine-grained hate speech labels and human-verified rationales, created using an agentic annotation framework with 7 specialized agents.
Details
Motivation: Existing multimodal hate speech datasets have coarse-grained labeling and lack integration with surrounding discourse, leading to imprecise assessments of hate speech in memes that require cultural knowledge for interpretation.Method: Proposed an agentic annotation framework coordinating seven specialized agents to generate hierarchical labels and rationales. Constructed M^3 dataset of 2,455 memes from X, 4chan, and Weibo with fine-grained hate labels and human-verified rationales.
Result: Benchmarking state-of-the-art Multimodal Large Language Models shows they struggle to effectively utilize surrounding post context, often failing to improve or even degrading detection performance.
Conclusion: Current multimodal models face challenges in reasoning over memes embedded in real-world discourse, highlighting the need for context-aware multimodal architectures for hate speech detection.
Abstract: Hate speech online targets individuals or groups based on identity attributes and spreads rapidly, posing serious social risks. Memes, which combine images and text, have emerged as a nuanced vehicle for disseminating hate speech, often relying on cultural knowledge for interpretation. However, existing multimodal hate speech datasets suffer from coarse-grained labeling and a lack of integration with surrounding discourse, leading to imprecise and incomplete assessments. To bridge this gap, we propose an agentic annotation framework that coordinates seven specialized agents to generate hierarchical labels and rationales. Based on this framework, we construct M^3 (Multi-platform, Multi-lingual, and Multimodal Meme), a dataset of 2,455 memes collected from X, 4chan, and Weibo, featuring fine-grained hate labels and human-verified rationales. Benchmarking state-of-the-art Multimodal Large Language Models reveals that these models struggle to effectively utilize surrounding post context, which often fails to improve or even degrades detection performance. Our finding highlights the challenges these models face in reasoning over memes embedded in real-world discourse and underscores the need for a context-aware multimodal architecture. Our dataset and code are available at https://github.com/mira-ai-lab/M3.
[1094] Strategic Infrastructure Design via Multi-Agent Congestion Games with Joint Placement and Pricing
Niloofar Aminikalibar, Farzaneh Farhadi, Maria Chli
Main category: cs.MA
TL;DR: A multi-agent framework for joint placement and pricing of congestible resources using bi-level optimization with congestion games, applied to EV charging infrastructure planning.
Details
Motivation: Real-world infrastructure planning requires coordination among autonomous agents competing for limited resources, particularly in applications like EV charging, emergency response, and intelligent transportation where strategic interactions between central planners and self-interested agents must be considered.Method: Proposes a bi-level optimization model: upper level for central planner making placement and pricing decisions, lower level captures agent responses via coupled non-atomic congestion games. Introduces ABO-MPN framework with double-layer approximation that decouples agent types, applies integer adjustment and rounding.
Result: Experiments on benchmark networks show the model reduces social cost by up to 40% compared to placement- or pricing-only baselines, and generalizes to other multi-agent system domains.
Conclusion: The proposed framework effectively addresses strategic interactions in infrastructure planning, demonstrating significant social cost reduction through joint placement and pricing decisions in multi-agent systems.
Abstract: Real-world infrastructure planning increasingly involves strategic interactions among autonomous agents competing over congestible, limited resources. Applications such as Electric Vehicle (EV) charging, emergency response, and intelligent transportation require coordinated resource placement and pricing decisions, while anticipating the adaptive behaviour of decentralised, self-interested agents. We propose a novel multi-agent framework for joint placement and pricing under such interactions, formalised as a bi-level optimisation model. The upper level represents a central planner, while the lower level captures agent responses via coupled non-atomic congestion games. Motivated by the EV charging domain, we study a setting where a central planner provisions chargers and road capacity under budget and profitability constraints. The agent population includes both EV drivers and non-charging drivers (NCDs), who respond to congestion, delays, and costs. To solve the resulting NP-hard problem, we introduce ABO-MPN, a double-layer approximation framework that decouples agent types, applies integer adjustment and rounding, and targets high-impact placement and pricing decisions. Experiments on benchmark networks show that our model reduces social cost by up to 40% compared to placement- or pricing-only baselines, and generalises to other MAS-relevant domains.
[1095] A Game-Theoretic Framework for Intelligent EV Charging Network Optimisation in Smart Cities
Niloofar Aminikalibar, Farzaneh Farhadi, Maria Chli
Main category: cs.MA
TL;DR: Joint optimization framework for EV charging station placement and pricing using congestion games to balance user convenience, economic viability, and traffic efficiency.
Details
Motivation: The transition to Electric Vehicles demands intelligent infrastructure planning that balances user convenience, economic viability, and traffic efficiency while capturing strategic driver behavior.Method: Two-level approximation method (JPPO-DE) combining driver behavior decomposition with integer relaxation to solve Mixed-Integer Nonlinear Programme for CS placement and pricing.
Result: Method outperforms single-parameter baselines by at least 16%, effectively adapts to varying budgets, EV penetration levels, and station capacities on Sioux Falls Transportation Network.
Conclusion: Framework advances intelligent transportation system goals for sustainable urban mobility by accurately modeling traffic equilibria and enabling adaptive infrastructure design.
Abstract: The transition to Electric Vehicles (EVs) demands intelligent, congestion-aware infrastructure planning to balance user convenience, economic viability, and traffic efficiency. We present a joint optimisation framework for EV Charging Station (CS) placement and pricing, explicitly capturing strategic driver behaviour through coupled non-atomic congestion games over road networks and charging facilities. From a Public Authority (PA) perspective, the model minimises social cost, travel times, queuing delays and charging expenses, while ensuring infrastructure profitability. To solve the resulting Mixed-Integer Nonlinear Programme, we propose a scalable two-level approximation method, Joint Placement and Pricing Optimisation under Driver Equilibrium (JPPO-DE), combining driver behaviour decomposition with integer relaxation. Experiments on the benchmark Sioux Falls Transportation Network (TN) demonstrate that our method consistently outperforms single-parameter baselines, effectively adapting to varying budgets, EV penetration levels, and station capacities. It achieves performance improvements of at least 16% over state-of-the-art approaches. A generalisation procedure further extends scalability to larger networks. By accurately modelling traffic equilibria and enabling adaptive, efficient infrastructure design, our framework advances key intelligent transportation system goals for sustainable urban mobility.
[1096] Human-Inspired Pavlovian and Instrumental Learning for Autonomous Agent Navigation
Jingfeng Shan, Francesco Guidi, Mehrdad Saeidi, Enrico Testi, Elia Favarelli, Andrea Giorgetti, Davide Dardari, Alberto Zanella, Giorgio Li Pira, Francesca Starita, Anna Guerra
Main category: cs.MA
TL;DR: A biologically-inspired hybrid RL architecture combining Pavlovian, model-free, and model-based components with Bayesian arbitration for autonomous agents in uncertain environments.
Details
Motivation: To address limitations of classical RL approaches: model-free RL converges slowly and may be unsafe, while model-based methods are computationally expensive and sensitive to model mismatch. The paper aims to create more robust autonomous agents by drawing inspiration from neuroscience principles of Pavlovian and instrumental learning.Method: Proposes a hybrid RL architecture with three components: Pavlovian conditioning (using contextual radio cues as conditioned stimuli), instrumental model-free learning, and instrumental model-based learning. Includes a motivational signal to modulate learning and a Bayesian arbitration mechanism that adaptively blends model-free and model-based estimates based on predicted reliability.
Result: Simulation results show the hybrid approach accelerates learning, improves operational safety, and reduces navigation in high-uncertainty regions compared to standard RL baselines. Pavlovian conditioning promotes safer exploration and faster convergence, while arbitration enables smooth transition from exploration to plan-driven exploitation.
Conclusion: Biologically inspired modularity benefits robust and adaptive autonomous systems under uncertainty. The integration of Pavlovian conditioning with model-free and model-based learning, modulated by motivational signals and adaptive arbitration, creates more efficient and safer autonomous agents.
Abstract: Autonomous agents operating in uncertain environments must balance fast responses with goal-directed planning. Classical MF RL often converges slowly and may induce unsafe exploration, whereas MB methods are computationally expensive and sensitive to model mismatch. This paper presents a human-inspired hybrid RL architecture integrating Pavlovian, Instrumental MF, and Instrumental MB components. Inspired by Pavlovian and Instrumental learning from neuroscience, the framework considers contextual radio cues, here intended as georeferenced environmental features acting as CS, to shape intrinsic value signals and bias decision-making. Learning is further modulated by internal motivational drives through a dedicated motivational signal. A Bayesian arbitration mechanism adaptively blends MF and MB estimates based on predicted reliability. Simulation results show that the hybrid approach accelerates learning, improves operational safety, and reduces navigation in high-uncertainty regions compared to standard RL baselines. Pavlovian conditioning promotes safer exploration and faster convergence, while arbitration enables a smooth transition from exploration to efficient, plan-driven exploitation. Overall, the results highlight the benefits of biologically inspired modularity for robust and adaptive autonomous systems under uncertainty.
[1097] Persona Alchemy: Designing, Evaluating, and Implementing Psychologically-Grounded LLM Agents for Diverse Stakeholder Representation
Sola Kim, Dongjune Chang, Jieshu Wang
Main category: cs.MA
TL;DR: A Social Cognitive Theory framework for designing psychologically grounded LLM personas with consistent behavior, evaluated through stakeholder representation in renewable energy discourse.
Details
Motivation: Current LLM persona designs lack alignment with human cognitive processes and fail to adequately represent diverse stakeholder perspectives, requiring a more psychologically grounded framework.Method: Social Cognitive Theory operationalized through four personal factors for design, six quantifiable constructs for evaluation, and graph database architecture for implementing stakeholder personas, tested with contradictory information scenarios.
Result: Agents showed consistent response patterns (R² 0.58-0.61), systematic temporal development of SCT construct effects, and PCA identified two dimensions explaining 73% of variance, validating theoretical structure.
Conclusion: The SCT framework improves explainability and reproducibility over black-box approaches, contributing to better stakeholder representation while maintaining psychological consistency in LLM personas.
Abstract: Despite advances in designing personas for Large Language Models (LLM), challenges remain in aligning them with human cognitive processes and representing diverse stakeholder perspectives. We introduce a Social Cognitive Theory (SCT) agent design framework for designing, evaluating, and implementing psychologically grounded LLMs with consistent behavior. Our framework operationalizes SCT through four personal factors (cognitive, motivational, biological, and affective) for designing, six quantifiable constructs for evaluating, and a graph database-backed architecture for implementing stakeholder personas. Experiments tested agents’ responses to contradicting information of varying reliability. In the highly polarized renewable energy transition discourse, we design five diverse agents with distinct ideologies, roles, and stakes to examine stakeholder representation. The evaluation of these agents in contradictory scenarios occurs through comprehensive processes that implement the SCT. Results show consistent response patterns ($R^2$ range: $0.58-0.61$) and systematic temporal development of SCT construct effects. Principal component analysis identifies two dimensions explaining $73$% of variance, validating the theoretical structure. Our framework offers improved explainability and reproducibility compared to black-box approaches. This work contributes to ongoing efforts to improve diverse stakeholder representation while maintaining psychological consistency in LLM personas.
[1098] The Coordination Gap: Multi-Agent Alternation Metrics for Temporal Fairness in Repeated Games
Nikolaos Al. Papadopoulos, Konstantinos Psannis
Main category: cs.MA
TL;DR: The paper introduces temporally-sensitive Alternation (ALT) metrics to evaluate coordination quality in multi-agent games, revealing that conventional metrics can mask poor temporal coordination despite high aggregate payoffs.
Details
Motivation: Conventional outcome-based metrics for multi-agent coordination are temporally blind and fail to distinguish structured coordination patterns from random or monopolistic behaviors, especially as the number of agents grows.Method: Introduces Perfect Alternation (PA) as a reference coordination regime and proposes six novel Alternation (ALT) metrics as temporally sensitive observables. Uses Q-learning agents as a diagnostic baseline and compares against random-policy null processes in a BoE-derived multi-agent variant of Battle of the Exes formalized as a Markov game.
Result: Learned policies perform up to 81% below random baselines under ALT metrics despite high traditional metrics (reward fairness often >0.9). The coordination deficit is present in two-agent cases and intensifies with more agents, showing that high aggregate payoffs can coexist with poor temporal coordination.
Conclusion: Temporally aware observables are necessary for analyzing coordination in multi-agent games, and random-policy baselines are essential null processes for interpreting coordination outcomes relative to chance-level behavior.
Abstract: Multi-agent coordination dilemmas expose a fundamental tension between individual optimization and collective welfare, yet characterizing such coordination requires metrics sensitive to temporal structure and collective dynamics. As a diagnostic testbed, we study a BoE-derived multi-agent variant of the Battle of the Exes, formalizing it as a Markov game in which turn-taking emerges as a periodic coordination regime. Conventional outcome-based metrics (e.g., efficiency and min/max fairness) are temporally blind (they cannot distinguish structured alternation from monopolistic or random access patterns) and fairness ratios lose discriminative power as n grows, obscuring inequities. To address this limitation, we introduce Perfect Alternation (PA) as a reference coordination regime and propose six novel Alternation (ALT) metrics designed as temporally sensitive observables of coordination quality. Using Q-learning agents as a minimal adaptive diagnostic baseline, and comparing against random-policy null processes, we uncover a clear measurement failure: despite exhibiting deceptively high traditional metrics (e.g., reward fairness often exceeding 0.9), learned policies perform up to 81% below random baselines under ALT-variant evaluation, a deficit already present in the two-agent case and intensifying as n grows. These results demonstrate, in this setting, that high aggregate payoffs can coexist with poor temporal coordination, and that conventional metrics may severely mischaracterize emergent dynamics. Our findings underscore the necessity of temporally aware observables for analyzing coordination in multi-agent games and highlight random-policy baselines as essential null processes for interpreting coordination outcomes relative to chance-level behavior.
cs.MM
[1099] FIGURA: A Modular Prompt Engineering Method for Artistic Figure Photography in Safety-Filtered Text-to-Image Models
Luca Cazzaniga
Main category: cs.MM
TL;DR: A prompt engineering framework called FIGURA that enables generation of artistic figure photography within active safety filters of text-to-image models by exploiting filter detection patterns.
Details
Motivation: Commercial text-to-image models' safety filters systematically block legitimate artistic content involving human figures, treating classical nude photography with same restrictiveness as explicit material, preventing professional artists from generating artistic figure photography.Method: FIGURA Method - a modular prompt engineering system with eight interconnected knowledge files, empirically validated through 200+ documented generation tests on FLUX 2 Pro with active safety filters at default tolerance level.
Result: Achieves 80-90% success rates across five structured prompt templates; reveals key findings: safety filters detect absence descriptions (missing clothing) rather than presence descriptions (body form), artistic references to painters serve as safety anchors, spatial context operates as independent filter variable, and geometric vocabulary bypasses pattern recognition.
Conclusion: The artistic censorship problem in text-to-image models admits practical, systematic solutions that work with active safety mechanisms rather than circumventing them, enabling professional artists to generate artistic figure photography.
Abstract: Safety filters in commercial text-to-image (T2I) models systematically block legitimate artistic content involving the human figure, treating classical nude photography with the same restrictiveness as explicit material. While prior research has documented this problem extensively, no operational system exists that enables professional artists to generate artistic figure photography within the constraints of active safety filters. We present the FIGURA Method (Framework for Intelligent Generation of Unrestricted Artistic Results), a modular prompt engineering system comprising eight interconnected knowledge files, empirically validated through 200+ documented generation tests on FLUX 2 Pro (Cloud) with active safety filters at the default tolerance level. Our systematic testing reveals several previously undocumented findings: (1) safety filters primarily detect absence descriptions (references to missing clothing) rather than presence descriptions (references to body form), which we formalize as the Golden Rule; (2) artistic references to painters function simultaneously as aesthetic guides and as safety anchors that alter filter behavior; (3) spatial context operates as an independent filter variable, with documented success rate hierarchies; and (4) geometric vocabulary for body description bypasses pattern recognition in silhouette contexts. The system achieves documented success rates between 80% and 90% across five structured prompt templates, demonstrating that the artistic censorship problem identified in recent literature admits practical, systematic solutions that work with active safety mechanisms rather than circumventing them.
[1100] Leum-VL Technical Report
Yuxuan He, Chaiming Huang, Yifan Wu, Hongjun Wang, Chenkui Shen, Jifan Zhang, Long Li
Main category: cs.MM
TL;DR: SV6D introduces a structured video representation framework with six dimensions (subject, aesthetics, camera language, editing, narrative, dissemination) for analyzing short video structure, implemented in Leum-VL-8B model.
Details
Motivation: Current multimodal models lack structural grammar to parse or produce the attention scheduling and organization that makes short videos successful. They can describe scenes and answer questions but fail at timeline-grounded units like hooks, cut rationales, shot-induced tension, and platform-facing packaging cues.Method: Proposes SV6D (Structured Video in Six Dimensions) framework inspired by professional storyboard practice, decomposing video into six structural dimensions with timeline-grounded labels. Presents Leum-VL-8B, an 8B video-language model trained with unified optimization objective combining Hungarian-matched temporal alignment, dimension-wise semantic label distance, and quality regularization, refined through verifiable reinforcement learning.
Result: Leum-VL-8B achieves 70.8 on VideoMME (w/o subtitles), 70.0 on MVBench, and 61.6 on MotionBench, while remaining competitive on general multimodal evaluations. Also introduces FeedBench benchmark for structure-sensitive short-video understanding.
Conclusion: The missing layer in video AI is structural representation rather than pixel generation. SV6D provides timeline-grounded, evidence-linked structural representation directly consumable by downstream workflows like editing, retrieval, recommendation, and generation control.
Abstract: A short video succeeds not simply because of what it shows, but because of how it schedules attention – yet current multimodal models lack the structural grammar to parse or produce this organization. Existing models can describe scenes, answer event-centric questions, and read on-screen text, but they are far less reliable at identifying timeline-grounded units such as hooks, cut rationales, shot-induced tension, and platform-facing packaging cues. We propose SV6D (Structured Video in Six Dimensions), inspired by professional storyboard practice in film and television production, a representation framework that decomposes internet-native video into six complementary structural dimensions – subject, aesthetics, camera language, editing, narrative, and dissemination – with each label tied to physically observable evidence on the timeline. We formalize a unified optimization objective over SV6D that combines Hungarian-matched temporal alignment, dimension-wise semantic label distance, and quality regularization. Building on this framework, we present Leum-VL-8B, an 8B video-language model that realizes the SV6D objective through an expert-driven post-training pipeline, further refined through verifiable reinforcement learning on perception-oriented tasks. Leum-VL-8B achieves 70.8 on VideoMME (w/o subtitles), 70.0 on MVBench, and 61.6 on MotionBench, while remaining competitive on general multimodal evaluations such as MMBench-EN. We also construct FeedBench, a benchmark for structure-sensitive short-video understanding. Our results indicate that the missing layer in video AI is not pixel generation but structural representation: grounded on the timeline, linked to visible evidence, and directly consumable by downstream workflows such as editing, retrieval, recommendation, and generation control, including text-heavy internet video formats with overlays and image-text layouts.
[1101] AcoustEmo: Open-Vocabulary Emotion Reasoning via Utterance-Aware Acoustic Q-Former
Liyun Zhang, Xuanmeng Sha, Shuqiong Wu, Fengkai Liu
Main category: cs.MM
TL;DR: AcoustEmo: A time-sensitive multimodal LLM with Utterance-Aware Acoustic Q-Former for fine-grained emotion recognition by capturing local temporal acoustic dynamics instead of global audio representations.
Details
Motivation: Current MLLMs for emotion recognition use global audio encoders that fail to capture subtle local temporal dynamics like micro-prosody and intonation shifts within utterances, limiting fine-grained acoustic modeling.Method: Proposes AcoustEmo with a novel Utterance-Aware Acoustic Q-Former that uses timestamp-synchronized sliding windows to dynamically extract segment-level audio tokens, enabling explicit tracing of temporal evolution of acoustic clues and capturing deep contextual dependencies in dialogues.
Result: Experiments on Explainable Multimodal Emotion Recognition (EMER) show AcoustEmo significantly enhances complex emotion reasoning, outperforming baselines while maintaining robust contextual accuracy.
Conclusion: The proposed time-sensitive approach with fine-grained acoustic modeling improves emotion recognition in multimodal LLMs by better capturing local temporal dynamics.
Abstract: Multimodal Large Language Models (MLLMs) excel in Open-Vocabulary (OV) emotion recognition but often neglect fine-grained acoustic modeling. Existing methods typically use global audio encoders, failing to capture subtle, local temporal dynamics like micro-prosody and intonation shifts within individual utterances. To address this, we propose AcoustEmo, a time-sensitive MLLM featuring a novel Utterance-Aware Acoustic Q-Former. Our approach utilizes a timestamp-synchronized sliding window to dynamically extract segment-level audio tokens instead of coarse global representations. This enables the model to explicitly trace the temporal evolution of subtle acoustic clues and capture deep contextual dependencies in dialogues. Experiments on the Explainable Multimodal Emotion Recognition (EMER) task show that AcoustEmo significantly enhances complex emotion reasoning, outperforming baselines while maintaining robust contextual accuracy.
[1102] Look, Listen and Segment: Towards Weakly Supervised Audio-visual Semantic Segmentation
Chengzhi Li, Heyan Huang, Ping Jian, Yanghao Zhou
Main category: cs.MM
TL;DR: WSAVSS enables audio-visual semantic segmentation using only video-level labels via Progressive Cross-modal Alignment for Semantics (PCAS) with looking-before-listening and listening-before-segmentation modules.
Details
Motivation: Traditional Audio-Visual Semantic Segmentation (AVSS) requires costly per-frame annotations. The authors aim to develop a weakly supervised approach that only needs video-level labels to generate per-frame semantic masks of sounding objects.Method: Progressive Cross-modal Alignment for Semantics (PCAS) decomposes the task into looking, listening, and segmentation. It includes: 1) Looking-before-Listening module that trains audio-visual encoder using video labels, 2) Listening-before-Segmentation module that injects visual semantic prompts to enhance frame-level audio understanding, and 3) progressive contrastive alignment to map audio categories to image regions without mask annotations.
Result: PCAS achieves state-of-the-art performance among weakly supervised methods on Audio-Visual Segmentation (AVS) and remains competitive with fully supervised baselines on AVSS, validating its effectiveness.
Conclusion: The proposed WSAVSS framework with PCAS successfully enables audio-visual semantic segmentation using only video-level labels, reducing annotation costs while maintaining competitive performance with fully supervised approaches.
Abstract: Audio-Visual Semantic Segmentation (AVSS) aligns audio and video at the pixel level but requires costly per-frame annotations. We introduce Weakly Supervised Audio-Visual Semantic Segmentation (WSAVSS), which uses only video-level labels to generate per-frame semantic masks of sounding objects. We decompose WSAVSS into looking, listening, and segmentation, and propose Progressive Cross-modal Alignment for Semantics (PCAS) with two modules: Looking-before-Listening and Listening-before-Segmentation. PCAS builds a classification task to train the audio-visual encoder using video labels, injects visual semantic prompts to enhance frame-level audio understanding, and then applies progressive contrastive alignment to map audio categories to image regions without mask annotations. Experiments show PCAS achieves state-of-the-art performance among weakly supervised methods on AVS and remains competitive with fully supervised baselines on AVSS, validating its effectiveness.
eess.AS
[1103] End-to-End Multi-Task Learning for Adjustable Joint Noise Reduction and Hearing Loss Compensation
Philippe Gonzalez, Vera Margrethe Frederiksen, Torsten Dau, Tobias May
Main category: eess.AS
TL;DR: A multi-task learning framework for joint noise reduction and hearing loss compensation using a single DNN with adjustable masks and differentiable auditory model.
Details
Motivation: To develop a unified hearing aid processing system that can perform both noise reduction and hearing loss compensation simultaneously with adjustable parameters, overcoming limitations of separate systems and non-differentiable auditory models.Method: Proposes a single DNN trained with multi-task objectives for joint NR and HLC. Uses two time-frequency masks (one for each task) that can be independently adjusted via exponentiation. Features an inherently differentiable auditory model for end-to-end optimization and includes audiogram input for listener personalization without retraining.
Result: The approach allows independent adjustment of NR and HLC amounts, improves objective metrics compared to single-objective optimization, outperforms cascaded separately-trained DNNs, and shows competitive HLC performance compared to traditional hearing-aid prescriptions.
Conclusion: This is the first study to use an auditory model to train a single DNN for both noise reduction and hearing loss compensation across diverse listener profiles, offering flexible, personalized hearing aid processing.
Abstract: A multi-task learning framework is proposed for optimizing a single deep neural network (DNN) for joint noise reduction (NR) and hearing loss compensation (HLC). A distinct training objective is defined for each task, and the DNN predicts two time-frequency masks. During inference, the amounts of NR and HLC can be adjusted independently by exponentiating each mask before combining them. In contrast to recent approaches that rely on training an auditory-model emulator to define a differentiable training objective, we propose an auditory model that is inherently differentiable, thus allowing end-to-end optimization. The audiogram is provided as an input to the DNN, thereby enabling listener-specific personalization without the need for retraining. Results show that the proposed approach not only allows adjusting the amounts of NR and HLC individually, but also improves objective metrics compared to optimizing a single training objective. It also outperforms a cascade of two DNNs that were separately trained for NR and HLC, and shows competitive HLC performance compared to a traditional hearing-aid prescription. To the best of our knowledge, this is the first study that uses an auditory model to train a single DNN for both NR and HLC across a wide range of listener profiles.
[1104] OmniCodec: Low Frame Rate Universal Audio Codec with Semantic-Acoustic Disentanglement
Jingbin Hu, Haoyu Zhang, Dake Guo, Qirui Zhan, Wenhao Li, Huakang Chen, Guobin Ma, Hanke Xie, Chengyou Wang, Pengyuan Xie, Chuan Xie, Qiang Zhang, Lei Xie
Main category: eess.AS
TL;DR: OmniCodec is a universal neural audio codec designed for low frame rate across diverse audio domains (speech, music, general sound) using hierarchical multi-codebook design with semantic-acoustic decoupling and self-guidance strategy.
Details
Motivation: Existing neural codecs focus primarily on speech and emphasize reconstruction fidelity, overlooking unified low frame rate modeling across diverse audio domains. High reconstruction quality doesn't necessarily yield semantically informative representations, limiting effectiveness in downstream generation tasks.Method: Proposes OmniCodec with hierarchical multi-codebook design featuring semantic-acoustic decoupling by leveraging pre-trained audio encoder from understanding models, plus self-guidance strategy to improve codebook utilization and reconstruction.
Result: Outperforms Mimi codec at same bitrate, delivering superior reconstruction quality while providing more semantically informative representations that benefit downstream generation tasks.
Conclusion: OmniCodec offers a universal neural audio codec solution for low frame rate across diverse audio domains with improved semantic representation quality for generation tasks.
Abstract: Large Language Models (LLMs) have advanced audio generation through discrete representation learning. However, most existing neural codecs focus on speech and emphasize reconstruction fidelity, overlooking unified low frame rate modeling across diverse audio domains, including speech, music, and general sound. Moreover, high reconstruction quality does not necessarily yield semantically informative representations, limiting effectiveness in downstream generation tasks. We propose OmniCodec, a universal neural audio codec tailored for low frame rate. It adopts a hierarchical multi-codebook design with semantic-acoustic decoupling by leveraging the audio encoder of the pre-trained understanding model, along with a self-guidance strategy to improve codebook utilization and reconstruction. Compared with the Mimi codec, experiments show that OmniCodec achieves outstanding performance at the same bitrate, delivering superior reconstruction quality while also providing more semantically informative representations that benefit downstream generation tasks. Our model and code will be open-sourced. Our demo page is available.
[1105] SqueezeComposer: Temporal Speed-up is A Simple Trick for Long-form Music Composing
Jianyi Chen, Rongxiu Zhong, Shilei Zhang, Kun Qian, Jinglei Liu, Yike Guo, Wei Xue
Main category: eess.AS
TL;DR: SqueezeComposer enables long-form music generation by first generating time-accelerated audio (2x-8x speed) to reduce computational demands, then restoring to original speed.
Details
Motivation: Long-form music generation faces challenges with modeling long-range dependencies and high computational/memory requirements for lengthy audio representations.Method: Proposes temporal speed-up trick: generate music at accelerated rates (2x-8x) to reduce temporal length, then restore to original speed. Implements SqueezeComposer framework using diffusion models for generation in accelerated domain and refinement in restored domain.
Result: Enables efficient, scalable, high-quality long-form music generation and whole-song singing accompaniment generation with temporal-wise and track-wise control.
Conclusion: Simple temporal speed-up strategy effectively addresses computational limitations for long-form music generation while maintaining quality, following hierarchical generation principles.
Abstract: Composing coherent long-form music remains a significant challenge due to the complexity of modeling long-range dependencies and the prohibitive memory and computational requirements associated with lengthy audio representations. In this work, we propose a simple yet powerful trick: we assume that AI models can understand and generate time-accelerated (speeded-up) audio at rates such as 2x, 4x, or even 8x. By first generating a high-speed version of the music, we greatly reduce the temporal length and resource requirements, making it feasible to handle long-form music that would otherwise exceed memory or computational limits. The generated audio is then restored to its original speed, recovering the full temporal structure. This temporal speed-up and slow-down strategy naturally follows the principle of hierarchical generation from abstract to detailed content, and can be conveniently applied to existing music generation models to enable long-form music generation. We instantiate this idea in SqueezeComposer, a framework that employs diffusion models for generation in the accelerated domain and refinement in the restored domain. We validate the effectiveness of this approach on two tasks: long-form music generation, which evaluates temporal-wise control (including continuation, completion, and generation from scratch), and whole-song singing accompaniment generation, which evaluates track-wise control. Experimental results demonstrate that our simple temporal speed-up trick enables efficient, scalable, and high-quality long-form music generation. Audio samples are available at https://SqueezeComposer.github.io/.
[1106] DiT-Flow: Speech Enhancement Robust to Multiple Distortions based on Flow Matching in Latent Space and Diffusion Transformers
Tianyu Cao, Helin Wang, Ari Frummer, Yuval Sieradzki, Adi Arbel, Laureano Moro Velazquez, Jesus Villalba, Oren Gal, Thomas Thebaud, Najim Dehak
Main category: eess.AS
TL;DR: DiT-Flow: A flow matching-based speech enhancement framework using latent Diffusion Transformer backbone trained for robustness across diverse distortions, achieving state-of-the-art performance with parameter-efficient LoRA-MoE integration.
Details
Motivation: Speech enhancement models are typically trained on limited datasets and evaluated under narrow conditions, limiting real-world applicability. There's a persistent bottleneck due to mismatch between training and deployment conditions.Method: Proposes DiT-Flow, a flow matching-based SE framework built on latent Diffusion Transformer (DiT) backbone operating on VAE-derived latent features. Uses LoRA with MoE framework for parameter-efficient training across diverse distortions including noise, reverberation, and compression.
Result: DiT-Flow consistently outperforms state-of-the-art generative SE models. With LoRA-MoE integration, achieves better performance on five unseen distortions using only 4.9% of total parameters.
Conclusion: Flow matching is effective for multi-condition speech enhancement. The parameter-efficient LoRA-MoE approach enables robust training across diverse distortions while maintaining high performance with minimal parameter usage.
Abstract: Recent advances in generative models, such as diffusion and flow matching, have shown strong performance in audio tasks. However, speech enhancement (SE) models are typically trained on limited datasets and evaluated under narrow conditions, limiting real-world applicability. To address this, we propose DiT-Flow, a flow matching-based SE framework built on the latent Diffusion Transformer (DiT) backbone and trained for robustness across diverse distortions, including noise, reverberation, and compression. DiT-Flow operates on compact variational auto-encoders (VAEs)-derived latent features. We validated our approach on StillSonicSet, a synthetic yet acoustically realistic dataset composed of LibriSpeech, FSD50K, FMA, and 90 Matterport3D scenes. Experiments show that DiT-Flow consistently outperforms state-of-the-art generative SE models, demonstrating the effectiveness of flow matching in multi-condition speech enhancement. Despite ongoing efforts to expand synthetic data realism, a persistent bottleneck in SE is the inevitable mismatch between training and deployment conditions. By integrating LoRA with the MoE framework, we achieve both parameter-efficient and high-performance training for DiT-Flow robust to multiple distortions with using 4.9% percentage of the total parameters to obtain a better performance on five unseen distortions.
[1107] Disentangling Speaker Traits for Deepfake Source Verification via Chebyshev Polynomial and Riemannian Metric Learning
Xi Xuan, Wenxin Zhang, Zhiyu Li, Jennifer Williams, Ville Hautamäki, Tomi H. Kinnunen
Main category: eess.AS
TL;DR: A speaker-disentangled metric learning framework for speech deepfake source verification that separates speaker characteristics from source generator features using novel loss functions in hyperbolic space.
Details
Motivation: Current speech deepfake source verification systems assume source embeddings are independent of speaker traits, but this assumption remains unverified. The paper aims to investigate the impact of speaker factors on source verification and develop methods to disentangle speaker information from source features.Method: Proposes Speaker-Disentangled Metric Learning (SDML) framework with two novel loss functions: 1) Uses Chebyshev polynomial to mitigate gradient instability during disentanglement optimization, and 2) Projects source and speaker embeddings into hyperbolic space using Riemannian metric distances to reduce speaker information and learn more discriminative source features.
Result: Experimental results on MLAAD benchmark evaluated under four newly proposed protocols for source-speaker disentanglement scenarios demonstrate the effectiveness of the SDML framework.
Conclusion: The proposed SDML framework successfully addresses speaker factor interference in speech deepfake source verification by disentangling speaker characteristics from source generator features, improving verification accuracy through hyperbolic space embeddings and novel optimization techniques.
Abstract: Speech deepfake source verification systems aims to determine whether two synthetic speech utterances originate from the same source generator, often assuming that the resulting source embeddings are independent of speaker traits. However, this assumption remains unverified. In this paper, we first investigate the impact of speaker factors on source verification. We propose a speaker-disentangled metric learning (SDML) framework incorporating two novel loss functions. The first leverages Chebyshev polynomial to mitigate gradient instability during disentanglement optimization. The second projects source and speaker embeddings into hyperbolic space, leveraging Riemannian metric distances to reduce speaker information and learn more discriminative source features. Experimental results on MLAAD benchmark, evaluated under four newly proposed protocols designed for source-speaker disentanglement scenarios, demonstrate the effectiveness of SDML framework. The code, evaluation protocols and demo website are available at https://github.com/xxuan-acoustics/RiemannSD-Net.
[1108] Adaptive Federated Fine-Tuning of Self-Supervised Speech Representations
Xin Guo, Chunrui Zhao, Hong Jia, Ting Dang, Gongping Huang, Xianrui Zheng, Yan Gao
Main category: eess.AS
TL;DR: Adaptive federated fine-tuning framework with early exits for SSL speech models in heterogeneous FL environments
Details
Motivation: FL environments have significant heterogeneity in client computational capacity and diverse downstream task requirements, making unified fine-tuning inefficient and causing straggler effectsMethod: Insert lightweight prediction heads at intermediate layers of SSL backbone for early exits; layer-wise depth-aware partial aggregation strategy; clients terminate computation based on local constraints
Result: Reduces edge overhead, supports heterogeneous hardware, maintains competitive performance in resource-constrained federated environments
Conclusion: Proposed framework effectively addresses FL heterogeneity challenges for SSL speech models through adaptive fine-tuning with early exits
Abstract: Integrating Federated Learning (FL) with self-supervised learning (SSL) enables privacy-preserving fine-tuning for speech tasks. However, federated environments exhibit significant heterogeneity: clients differ in computational capacity, causing straggler effects under unified fine-tuning, while diverse downstream tasks require different representation depths, making full-model updates inefficient. To address these challenges, we propose an adaptive federated fine-tuning framework with early exits. Lightweight prediction heads are inserted at intermediate layers of the SSL backbone, allowing clients to terminate computation based on local constraints and task requirements. We further introduce a layer-wise, depth-aware partial aggregation strategy to better utilize representations from different network depths. Experiments show that the framework reduces edge overhead, supports heterogeneous hardware, and maintains competitive performance in resource-constrained federated environments.
[1109] WiRD-Gest: Gesture Recognition In The Real World Using Range-Doppler Wi-Fi Sensing on COTS Hardware
Jessica Sanson, Rahul C. Shah, Yazhou Zhu, Rafael Rosales, Valerio Frascolla
Main category: eess.AS
TL;DR: WiRD-Gest is a Wi-Fi sensing system for gesture recognition using a single commercial laptop transceiver, leveraging monostatic full duplex sensing to extract Range-Doppler information for improved accuracy and robustness in real-world environments.
Details
Motivation: Traditional Wi-Fi sensing for gesture recognition faces practical deployment challenges due to environmental sensitivity and device placement issues. Current approaches often fail in crowded, dynamic real-world environments with interference and multiple moving targets.Method: Proposes WiRD-Gest system using a single unmodified Wi-Fi transceiver on a COTS laptop with monostatic full duplex sensing pipeline to extract Range-Doppler information. Creates the first benchmark of deep learning models for gesture recognition based on monostatic sensing, leveraging spatial (range) information to transform accuracy and robustness.
Result: Demonstrates excellent performance in crowded, unseen public spaces with dynamic interference and additional moving targets, even when trained only on controlled environment data. Shows minor degradation in scenarios where prior Wi-Fi sensing approaches often fail.
Conclusion: WiRD-Gest overcomes key limitations of Wi-Fi sensing for gesture recognition through monostatic sensing and spatial information, enabling robust performance in challenging real-world environments. The system and dataset will be released as open source.
Abstract: Wi-Fi sensing has emerged as a promising technique for gesture recognition, yet its practical deployment is hindered by environmental sensitivity and device placement challenges. To overcome these limitations we propose Wi-Fi Range and Doppler (WiRD)-Gest, a novel system that performs gesture recognition using a single, unmodified Wi-Fi transceiver on a commercial off-the-shelf (COTS) laptop. The system leverages an monostatic full duplex sensing pipeline capable of extracting Range-Doppler (RD) information. Utilizing this, we present the first benchmark of deep learning models for gesture recognition based on monostatic sensing. The key innovation lies in how monostatic sensing and spatial (range) information fundamentally transforms accuracy, robustness and generalization compared to prior approaches. We demonstrate excellent performance in crowded, unseen public spaces with dynamic interference and additional moving targets even when trained on data from controlled environments only. These are scenarios where prior Wi-Fi sensing approaches often fail, however, our system suffers minor degradation. The WiRD-Gest benchmark and dataset will also be released as open source.
[1110] SelfTTS: cross-speaker style transfer through explicit embedding disentanglement and self-refinement using self-augmentation
Lucas H. Ueda, João G. T. Lima, Pedro R. Corrêa, Flávio O. Simões, Mário U. Neto, Paula D. P. Costa
Main category: eess.AS
TL;DR: SelfTTS: A TTS model for cross-speaker style transfer without external encoders, using disentanglement strategies and self-refinement for emotional expressivity.
Details
Motivation: To create a text-to-speech model capable of cross-speaker style transfer without relying on external pre-trained speaker or emotion encoders, enabling emotional expressivity in neutral speakers through better disentanglement of speaker and emotion information.Method: Uses Gradient Reversal Layers (GRL) with cosine similarity loss for explicit disentanglement of speaker and emotion information, introduces Multi Positive Contrastive Learning (MPCL) for clustered embeddings, and employs self-refinement via Self-Augmentation using voice conversion capabilities.
Result: SelfTTS achieves superior emotional naturalness (eMOS) and robust stability in target timbre and emotion compared to state-of-the-art baselines.
Conclusion: SelfTTS successfully enables cross-speaker style transfer without external encoders through effective disentanglement strategies and self-refinement, demonstrating improved emotional expressivity and naturalness in synthesized speech.
Abstract: This paper presents SelfTTS, a text-to-speech (TTS) model designed for cross-speaker style transfer that eliminates the need for external pre-trained speaker or emotion encoders. The architecture achieves emotional expressivity in neutral speakers through an explicit disentanglement strategy utilizing Gradient Reversal Layers (GRL) combined with cosine similarity loss to decouple speaker and emotion information. We introduce Multi Positive Contrastive Learning (MPCL) to induce clustered representations of speaker and emotion embeddings based on their respective labels. Furthermore, SelfTTS employs a self-refinement strategy via Self-Augmentation, exploiting the model’s voice conversion capabilities to enhance the naturalness of synthesized speech. Experimental results demonstrate that SelfTTS achieves superior emotional naturalness (eMOS) and robust stability in target timbre and emotion compared to state-of-the-art baselines.
[1111] Bloodroot: When Watermarking Turns Poisonous For Stealthy Backdoor
Kuan-Yu Chen, Yi-Cheng Lin, Jeng-Lin Li, Jian-Jiun Ding
Main category: eess.AS
TL;DR: Audio backdoor poisoning using watermark-as-trigger concept achieves high stealth and effectiveness for ownership protection and security risks
Details
Motivation: Current audio backdoor methods have poor perceptual quality that's noticeable to humans, limiting their practical use for ownership protection and security applicationsMethod: Proposes Watermark-as-Trigger concept integrated into Bloodroot backdoor framework using adversarial LoRA fine-tuning to embed hidden triggers in audio while maintaining quality
Result: Achieves much higher trigger success rate and clean-sample accuracy while maintaining perceptual quality; remains effective under acoustic filtering and model pruning on speech recognition and speaker identification tasks
Conclusion: Watermark-based poisoning is effective for data-to-model ownership protection while revealing risks of adversarial misuse in audio systems
Abstract: Backdoor data poisoning is a crucial technique for ownership protection and defending against malicious attacks. Embedding hidden triggers in training data can manipulate model outputs, enabling provenance verification, and deterring unauthorized use. However, current audio backdoor methods are suboptimal, as poisoned audio often exhibits degraded perceptual quality, which is noticeable to human listeners. This work explores the intrinsic stealthiness and effectiveness of audio watermarking in achieving successful poisoning. We propose a novel Watermark-as-Trigger concept, integrated into the Bloodroot backdoor framework via adversarial LoRA fine-tuning, which enhances perceptual quality while achieving a much higher trigger success rate and clean-sample accuracy. Experiments on speech recognition (SR) and speaker identification (SID) datasets show that watermark-based poisoning remains effective under acoustic filtering and model pruning. The proposed Bloodroot backdoor framework not only secures data-to-model ownership, but also well reveals the risk of adversarial misuse.
[1112] Neural Directional Filtering Using a Compact Microphone Array
Weilong Huang, Srikanth Raj Chetupalli, Mhd Modar Halimeh, Oliver Thiergart, Emanuël A. P. Habets
Main category: eess.AS
TL;DR: Neural directional filtering (NDF) uses deep neural networks to achieve desired directivity patterns with compact microphone arrays, overcoming limitations of traditional beamformers.
Details
Motivation: Traditional beamformers for compact microphone arrays have limitations in achieving desired directivity patterns due to array size constraints. The effectiveness degrades for compact arrays, and there's a need for methods that can achieve frequency-invariant directivity patterns even above spatial aliasing frequencies.Method: Proposes neural directional filtering (NDF) approach using deep neural networks. The method computes a single-channel complex mask from microphone array signals, which is applied to a reference microphone to produce output approximating a virtual directional microphone with desired directivity pattern. Includes training strategies and data-dependent metrics for evaluation.
Result: The method achieves: 1) frequency-invariant directivity pattern even above spatial aliasing frequency, 2) approximation of diverse and higher-order patterns, 3) steering of patterns in different directions, and 4) generalization to unseen conditions. Experimental comparisons show superior performance over conventional beamforming and parametric approaches.
Conclusion: Neural directional filtering enables advanced sound capture with predefined directivity patterns using compact microphone arrays, overcoming traditional beamforming limitations and offering flexible pattern control and frequency invariance.
Abstract: Beamforming with desired directivity patterns using compact microphone arrays is essential in many audio applications. Directivity patterns achievable using traditional beamformers depend on the number of microphones and the array aperture. Generally, their effectiveness degrades for compact arrays. To overcome these limitations, we propose a neural directional filtering (NDF) approach that leverages deep neural networks to enable sound capture with a predefined directivity pattern. The NDF computes a single-channel complex mask from the microphone array signals, which is then applied to a reference microphone to produce an output that approximates a virtual directional microphone with the desired directivity pattern. We introduce training strategies and propose data-dependent metrics to evaluate the directivity pattern and directivity factor. We show that the proposed method: i) achieves a frequency-invariant directivity pattern even above the spatial aliasing frequency, ii) can approximate diverse and higher-order patterns, iii) can steer the pattern in different directions, and iv) generalizes to unseen conditions. Lastly, experimental comparisons demonstrate superior performance over conventional beamforming and parametric approaches.
eess.IV
[1113] MiSiSUn: Minimum Simplex Semisupervised Unmixing
Behnood Rasti, Bikram Koirala, Paul Scheunders
Main category: eess.IV
TL;DR: MiSiSUn is a semisupervised geometric unmixing method that incorporates data geometry into library-based unmixing using simplex-volume penalties and archetypal analysis, outperforming state-of-the-art methods by 1-3 dB.
Details
Motivation: To improve semisupervised unmixing by incorporating the geometric structure of data into library-based approaches, addressing limitations of existing methods that don't consider data geometry.Method: Proposes minimum simplex semisupervised unmixing (MiSiSUn) using a simplex-volume-flavored penalty based on archetypal analysis-type linear model, incorporating data geometry for the first time in library-based unmixing.
Result: MiSiSUn considerably outperforms state-of-the-art semisupervised unmixing methods by 1-3 dB across different scenarios, and shows good performance on real datasets with visual interpretation matching geological maps.
Conclusion: The proposed geometric approach significantly improves semisupervised unmixing performance, with open-source PyTorch implementation available for reproducibility.
Abstract: This paper proposes a semisupervised geometric unmixing approach called minimum simplex semisupervised unmixing (MiSiSUn). The geometry of the data was incorporated for the first time into library-based unmixing using a simplex-volume-flavored penalty based on an archetypal analysis-type linear model. The experimental results were performed on two simulated datasets considering different levels of mixing ratios and spatial instruction at varying input noise. MiSiSUn considerably outperforms state-of-the-art semisupervised unmixing methods. The improvements vary from 1 dB to over 3 dB in different scenarios. The proposed method was also applied to a real dataset where visual interpretation is close to the geological map. MiSiSUn was implemented using PyTorch, which is open-source and available at https://github.com/BehnoodRasti/MiSiSUn. Moreover, we provide a dedicated Python package for Semisupervised Unmixing, which is open-source and includes all the methods used in the experiments for the sake of reproducibility.
[1114] CaroTo: A Tool for Fast Comprehensive Analysis of Carotid Artery Stenosis in 4D PC- and 3D BB-MRI Data
Hinrich Rahlfs, Markus Hüllebrand, Sebastian Schmitter, Jonathan Andrae, Christoph Strecker, Andreas Harloff, Anja Hennemuth
Main category: eess.IV
TL;DR: CaroTo is a specialized medical imaging tool for standardized carotid atherosclerosis assessment using MRI, combining MEVISFlow capabilities with carotid-specific tools for segmentation, biomarker extraction, and visualization.
Details
Motivation: Carotid artery atherosclerosis increases stroke risk, and current MRI assessment requires multimodal/multidimensional segmentation, reproducible biomarker extraction, and visualization tools. There's a need for standardized tools to facilitate precise and consistent evaluations of carotid artery stenosis.Method: Developed CaroTo tool that combines MEVISFlow capabilities with specialized tools for carotid geometry and vessel wall assessment. Supports both manual and automatic segmentation for 2D, 2D+time, and 3D medical images.
Result: Created a comprehensive tool that enables standardized carotid atherosclerosis assessment with multimodal segmentation, biomarker extraction, and visualization capabilities for carotid artery stenosis evaluation.
Conclusion: CaroTo provides a standardized solution for carotid atherosclerosis assessment using MRI, facilitating precise and consistent evaluations through integrated segmentation, biomarker analysis, and visualization tools.
Abstract: Atherosclerosis of the carotid artery increases stroke risk. Atherosclerosis assessment with MRI requires multimodal and multidimensional segmentation of the carotid artery, reproducible extraction of biomarkers, and the visualization of segmentations and biomarkers. We developed CaroTo, a tool that allows for standardized carotid atherosclerosis assessment. It combines the capabilities of MEVISFlow with specialized tools for carotid geometry and vessel wall assessment. It supports manual and automatic segmentation for 2D, 2D+time, and 3D images, facilitating precise and consistent evaluations of carotid artery stenosis.
[1115] mmWave-Diffusion:A Novel Framework for Respiration Sensing Using Observation-Anchored Conditional Diffusion Model
Yong Wang, Qifan Shen, Bao Zhang, Zijun Huang, Chengbo Zhu, Shuai Yao, Qisong Wu
Main category: eess.IV
TL;DR: mmWave-Diffusion: A conditional diffusion framework using radar diffusion transformer for removing micromotion interference in contactless respiratory sensing, achieving state-of-the-art waveform reconstruction and respiratory-rate estimation.
Details
Motivation: Millimeter-wave radar enables contactless respiratory sensing but suffers from degradation due to nonstationary interference from body micromotions. Current methods struggle to effectively remove this interference while maintaining accurate respiratory monitoring.Method: Proposes mmWave-Diffusion, an observation-anchored conditional diffusion framework that models the residual between radar phase observations and respiratory ground truth. Uses a Radar Diffusion Transformer (RDT) conditioned on phase observations with patch-level dual positional encodings for temporal alignment and banded-mask multi-head cross-attention for local physical priors.
Result: Achieves state-of-the-art waveform reconstruction and respiratory-rate estimation on 13.25 hours of synchronized radar-respiration data with strong generalization. Enables robust denoising and interference removal in just 20 reverse steps.
Conclusion: mmWave-Diffusion effectively removes micromotion interference in radar-based respiratory sensing through a physics-aligned diffusion framework, demonstrating superior performance over existing methods with efficient inference.
Abstract: Millimeter-wave (mmWave) radar enables contactless respiratory sensing,yet fine-grained monitoring is often degraded by nonstationary interference from body micromotions.To achieve micromotion interference removal,we propose mmWave-Diffusion,an observation-anchored conditional diffusion framework that directly models the residual between radar phase observations and the respiratory ground truth,and initializes sampling within an observation-consistent neighborhood rather than from Gaussian noise-thereby aligning the generative process with the measurement physics and reducing inference overhead. The accompanying Radar Diffusion Transformer (RDT) is explicitly conditioned on phase observations, enforces strict one-to-one temporal alignment via patch-level dual positional encodings, and injects local physical priors through banded-mask multi-head cross-attention, enabling robust denoising and interference removal in just 20 reverse steps. Evaluated on 13.25 hours of synchronized radar-respiration data, mmWave-Diffusion achieves state-of-the-art waveform reconstruction and respiratory-rate estimation with strong generalization. Code repository:https://github.com/goodluckyongw/mmWave-Diffusion.
[1116] Underwater imaging without color distortions requires RAW capture
Derya Akkaynak, Michael S. Brown
Main category: eess.IV
TL;DR: Underwater JPEG images lose color accuracy due to in-camera processing, making them unsuitable for quantitative color analysis in aquatic sciences like coral bleaching research; RAW format preservation is recommended.
Details
Motivation: Consumer cameras are widely used in aquatic sciences but their JPEG output undergoes irreversible in-camera processing that destroys color accuracy, making images useless for quantitative color analysis despite being visually pleasing.Method: The paper provides an explanatory essay format discussing the technical limitations of JPEG processing in consumer cameras and offers practical guidance for data preservation.
Result: The analysis reveals that JPEG’s in-camera processing breaks the linear relationship between pixel values and scene radiance, making color standardization, reproduction, and comparison impossible across different cameras, locations, or time periods.
Conclusion: Researchers should capture and archive minimally processed RAW images to prevent irreversible data loss and enable quantitative color analysis in aquatic science applications.
Abstract: Consumer cameras are ubiquitous in aquatic sciences because they are affordable and easy to use, generating vast collections of underwater imagery for ecosystem surveys, monitoring, mapping, and animal behavior studies. Yet when color is the variable of interest, such as in coral-bleaching research, most of these images cannot be used quantitatively if captured in JPEG format. The limitation is not due to JPEG compression itself, but to the in-camera processing that precedes it: as cameras produce these images, built-in algorithms modify colors and contrast not to ensure color accuracy but to produce visually pleasing pictures. These irreversible in-camera operations break the linear relationship between pixel values and scene radiance, making colors impossible to standardize, reproduce, or compare across cameras, locations, or time. This essay explains the scientific costs of this practice and offers pragmatic guidance to prevent irreversible data loss, beginning with the capture and archiving of minimally processed RAW images.
[1117] Unregistered Spectral Image Fusion: Unmixing, Adversarial Learning, and Recoverability
Jiahui Song, Sagar Shrestha, Xiao Fu
Main category: eess.IV
TL;DR: Unsupervised framework for simultaneous super-resolution of unregistered hyperspectral and multispectral images using coupled spectral unmixing and adversarial learning with theoretical guarantees.
Details
Motivation: Hyperspectral-multispectral fusion (HMF) with unregistered images is challenging; existing methods either focus only on MSI super-resolution or require accurate training data, leaving unregistered HMF poorly understood theoretically.Method: Proposes unsupervised framework integrating coupled spectral unmixing for MSI super-resolution with latent-space adversarial learning for HSI super-resolution, establishing theoretical guarantees under reasonable generative models.
Result: Validated on semi-real and real HSI-MSI pairs across diverse conditions, providing first theoretical insights for unregistered HMF recoverability.
Conclusion: The approach successfully addresses unregistered HMF with theoretical foundations and practical validation, advancing both MSI and HSI super-resolution simultaneously.
Abstract: This paper addresses the fusion of a pair of spatially unregistered hyperspectral image (HSI) and multispectral image (MSI) covering roughly overlapping regions. HSIs offer high spectral but low spatial resolution, while MSIs provide the opposite. The goal is to integrate their complementary information to enhance both HSI spatial resolution and MSI spectral resolution. While hyperspectral-multispectral fusion (HMF) has been widely studied, the unregistered setting remains challenging. Many existing methods focus solely on MSI super-resolution, leaving HSI unchanged. Supervised deep learning approaches were proposed for HSI super-resolution, but rely on accurate training data, which is often unavailable. Moreover, theoretical analyses largely address the co-registered case, leaving unregistered HMF poorly understood. In this work, an unsupervised framework is proposed to simultaneously super-resolve both MSI and HSI. The method integrates coupled spectral unmixing for MSI super-resolution with latent-space adversarial learning for HSI super-resolution. Theoretical guarantees on the recoverability of the super-resolution MSI and HSI are established under reasonable generative models – providing, to our best knowledge, the first such insights for unregistered HMF. The approach is validated on semi-real and real HSI-MSI pairs across diverse conditions.
[1118] Cycle Inverse-Consistent TransMorph: A Balanced Deep Learning Framework for Brain MRI Registration
Jiaqi Shang, Haojin Wu, Yinyi Lai, Zongyu Li, Chenghao Zhang, Jia Guo
Main category: eess.IV
TL;DR: A cycle inverse-consistent transformer-based framework (CICTM) for deformable brain MRI registration using Swin-UNet with bidirectional consistency constraints to jointly estimate forward/backward deformation fields, achieving strong performance on large multi-center datasets.
Details
Motivation: Existing deep learning-based deformable image registration methods for medical imaging often struggle with capturing long-range anatomical correspondence and maintaining deformation consistency, limiting their effectiveness for large-scale neuroimaging applications.Method: Proposes CICTM - a cycle inverse-consistent transformer-based framework using Swin-UNet architecture with bidirectional consistency constraints to jointly estimate forward and backward deformation fields, enabling capture of both local anatomical details and global spatial relationships.
Result: Achieves strong and balanced performance across multiple quantitative evaluation metrics on a large multi-center dataset of 2851 T1-weighted brain MRI scans from 13 public datasets, outperforming baseline methods like ANTs, ICNet, and VoxelMorph while maintaining stable and physically plausible deformation fields.
Conclusion: The proposed framework is suitable for large-scale neuroimaging datasets where both accuracy and deformation stability are critical, offering improved long-range correspondence capture and deformation consistency compared to existing methods.
Abstract: Deformable image registration plays a fundamental role in medical image analysis by enabling spatial alignment of anatomical structures across subjects. While recent deep learning-based approaches have significantly improved computational efficiency, many existing methods remain limited in capturing long-range anatomical correspondence and maintaining deformation consistency. In this work, we present a cycle inverse-consistent transformer-based framework for deformable brain MRI registration. The model integrates a Swin-UNet architecture with bidirectional consistency constraints, enabling the joint estimation of forward and backward deformation fields. This design allows the framework to capture both local anatomical details and global spatial relationships while improving deformation stability. We conduct a comprehensive evaluation of the proposed framework on a large multi-center dataset consisting of 2851 T1-weighted brain MRI scans aggregated from 13 public datasets. Experimental results demonstrate that the proposed framework achieves strong and balanced performance across multiple quantitative evaluation metrics while maintaining stable and physically plausible deformation fields. Detailed quantitative comparisons with baseline methods, including ANTs, ICNet, and VoxelMorph, are provided in the appendix. Experimental results demonstrate that CICTM achieves consistently strong performance across multiple evaluation criteria while maintaining stable and physically plausible deformation fields. These properties make the proposed framework suitable for large-scale neuroimaging datasets where both accuracy and deformation stability are critical.
[1119] HMS-VesselNet: Hierarchical Multi-Scale Attention Network with Topology-Preserving Loss for Retinal Vessel Segmentation
Amarnath R
Main category: eess.IV
TL;DR: HMS-VesselNet: A hierarchical multi-scale network for retinal vessel segmentation that improves detection of thin peripheral vessels using multi-resolution processing and specialized loss functions.
Details
Motivation: Standard retinal vessel segmentation methods often miss thin peripheral vessels due to their small pixel count and low contrast, which is critical for early detection of diabetic retinopathy.Method: Hierarchical multi-scale network with four parallel branches at different resolutions, combining outputs with learned fusion weights. Training uses Dice, binary cross-entropy, and centerline Dice losses, with hard example mining after epoch 20.
Result: Achieves mean Dice of 88.72%, Sensitivity of 90.78%, and AUC of 98.25% on DRIVE, STARE, and CHASE_DB1 datasets. Shows largest improvement in recall of thin peripheral vessels.
Conclusion: The proposed method significantly improves segmentation of thin peripheral retinal vessels, which are critical for early disease detection and often missed by standard methods.
Abstract: Retinal vessel segmentation methods based on standard overlap losses tend to miss thin peripheral vessels because these structures occupy very few pixels and have low contrast against the background. We propose HMS-VesselNet, a hierarchical multi-scale network that processes fundus images across four parallel branches at different resolutions and combines their outputs using learned fusion weights. The training loss combines Dice, binary cross-entropy, and centerline Dice to jointly optimize area overlap and vessel continuity. Hard example mining is applied from epoch 20 onward to concentrate gradient updates on the most difficult training images. Tested on 68 images from DRIVE, STARE, and CHASE_DB1 using 5-fold cross-validation, the model achieves a mean Dice of 88.72 +/- 0.67%, Sensitivity of 90.78 +/- 1.42%, and AUC of 98.25 +/- 0.21%. In leave-one-dataset-out experiments, AUC remains above 95% on each unseen dataset. The largest improvement is in the recall of thin peripheral vessels, which are the structures most frequently missed by standard methods and most critical for early detection of diabetic retinopathy.
[1120] Imaging foundation model for universal enhancement of non-ideal measurement CT
Rongjun Ge, Yuxin Liu, Zhan Wu, Shangwen Yang, Yuan Gao, Chenyu You, Ge Wang, Shuo Li, Yuting He, Yang Chen
Main category: eess.IV
TL;DR: TAMP is a multi-scale Transformer-based foundation model for universal enhancement of non-ideal CT images, pre-trained on 10.8M simulated images and adaptable to specific clinical scenarios with few-shot fine-tuning.
Details
Motivation: Non-ideal CT protocols expand applications but degrade image quality, limiting clinical use. Existing deep learning methods require large datasets and lack generalization across diverse settings.Method: Multi-scale integrated Transformer AMPlifier (TAMP) foundation model pre-trained on 10.8 million physics-driven simulated NICT images, with parameter-efficient fine-tuning for adaptation to specific clinical scenarios using few slices.
Result: TAMP generalizes effectively across various NICT settings, defect degrees, and body regions, consistently improving image quality and clinical acceptability in extensive experiments including radiologist and real-world validations.
Conclusion: TAMP demonstrates significant potential to advance CT imaging and broaden NICT applications in clinical practice as the first imaging foundation model for universal NICT enhancement.
Abstract: Non-ideal measurement computed tomography (NICT) employs suboptimal imaging protocols to expand CT applications. However, the resulting trade-offs degrade image quality, limiting clinical acceptability. Although deep learning methods have been used to enhance NICT images, their reliance on large training datasets and limited generalizability across diverse settings hinder practical use. We propose the multi-scale integrated Transformer AMPlifier (TAMP), the first imaging foundation model for universal NICT enhancement. Pre-trained on 10.8 million physics-driven simulated NICT images, TAMP generalizes effectively across various NICT settings, defect degrees, and body regions. Moreover, a parameter-efficient fine-tuning strategy enables TAMP to adapt to specific clinical scenarios using only few slices. Extensive experiments, including radiologists and real-world validations, demonstrate that TAMP consistently improves image quality and clinical acceptability, underscoring its significant potential to advance CT imaging and broaden NICT applications in clinical practice.
[1121] Interpretable Deep Learning Framework for Improved Disease Classification in Medical Imaging
Jutika Borah, Hidam Kumarjit Singh
Main category: eess.IV
TL;DR: A unified deep learning framework combining cross-guided channel spatial attention (EfficientNetB4 + ResNet34) with Monte Carlo Dropout and conformal prediction for uncertainty-aware medical image classification.
Details
Motivation: Medical image analysis models often produce overconfident predictions, compromising clinical accuracy and reliability. There's a need to bridge the gap between high performance and uncertainty awareness in biomedical imaging.Method: Proposes a cross-guided channel spatial attention architecture fusing EfficientNetB4 and ResNet34 features using bidirectional attention. Integrates Monte Carlo Dropout with conformal prediction for uncertainty quantification and statistically valid prediction sets.
Result: Achieved strong classification performance: AUC of 99.75% for COVID-19, 100% for Tuberculosis, 99.3% for Pneumonia chest X-rays, and 98.69% for retinal OCT images. Uncertainty-aware inference yields calibrated prediction sets with interpretable uncertainty visualization.
Conclusion: Bidirectional cross-attention with uncertainty quantification improves both performance and transparency in medical image classification, addressing the critical need for reliable clinical decision support.
Abstract: Deep learning models have gained increasing adoption in medical image analysis. However, these models often produce overconfident predictions, which can compromise clinical accuracy and reliability. Bridging the gap between high-performance and awareness of uncertainty remains a crucial challenge in biomedical imaging applications. This study focuses on developing a unified deep learning framework for enhancing feature integration, interpretability, and reliability in prediction. We introduced a cross-guided channel spatial attention architecture that fuses feature representations extracted from EfficientNetB4 and ResNet34. Bidirectional attention approach enables the exchange of information across networks with differing receptive fields, enhancing discriminative and contextual feature learning. For quantitative predictive uncertainty assessment, Monte Carlo (MC)-Dropout is integrated with conformal prediction. This provides statistically valid prediction sets with entropy-based uncertainty visualization. The framework is evaluated on four medical imaging benchmark datasets: chest X-rays of COVID-19, Tuberculosis, Pneumonia, and retinal Optical Coherence Tomography (OCT) images. The proposed framework achieved strong classification performance with an AUC of 99.75% for COVID-19, 100% for Tuberculosis, 99.3% for Pneumonia chest X-rays, and 98.69% for retinal OCT images. Uncertainty-aware inference yields calibrated prediction sets with interpretable examples of uncertainty, showing transparency. The results demonstrate that bidirectional cross-attention with uncertainty quantification can improve performance and transparency in medical image classification.
[1122] From Explanations to Architecture: Explainability-Driven CNN Refinement for Brain Tumor Classification in MRI
Rajan Das Gupta, Md Imrul Hasan Showmick, Lei Wei, Mushfiqur Rahman Abir, Shanjida Akter, Md. Yeasin Rahat, Md. Jakir Hossen
Main category: eess.IV
TL;DR: An explainable CNN framework for brain tumor classification that uses Grad-CAM to quantify layer relevance, prune unnecessary layers, and validate decisions with SHAP/LIME, achieving high accuracy with improved transparency.
Details
Motivation: Current brain tumor classification methods achieve high accuracy but lack interpretability, making it difficult to determine if predictions are based on tumor-relevant evidence or spurious cues like background artifacts. This limits clinical trust and application.Method: Proposes an explainable CNN framework that uses Grad-CAM to quantify layer-wise relevance, guiding removal of low-contribution layers to reduce unnecessary depth/parameters. Combines Grad-CAM for spatial localization with SHAP and LIME for attribution-based verification to validate decision rationale.
Result: Achieves 98.21% accuracy on primary dataset and 95.74% on unseen dataset, demonstrating strong cross-dataset generalization while maintaining transparency and simplicity.
Conclusion: The approach balances simplicity, transparency, and accuracy, supporting more trustworthy and clinically applicable brain tumor classification for improved health outcomes and non-invasive disease detection.
Abstract: Recent brain tumor classification methods often report high accuracy but rely on deep, over-parameterized architectures with limited interpretability, making it difficult to determine whether predictions are driven by tumor-relevant evidence or by spurious cues such as background artifacts or normal tissue. We propose an explainable convolutional neural network (CNN) framework that enhances model transparency without sacrificing classification accuracy. This approach supports more trustworthy AI in healthcare and contributes to SDG 3: Good Health and Well-being by enabling more dependable MRI-based brain tumor diagnosis and earlier detection. Rather than using explainable AI solely for post hoc visualization, we employ Grad-CAM to quantify layer-wise relevance and guide the removal of low-contribution layers, reducing unnecessary depth and parameters while encouraging attention to discriminative tumor regions. We further validate the model’s decision rationale using complementary explainability methods, combining Grad-CAM for spatial localization with SHAP and LIME for attribution-based verification. Experiments on multi-class brain MRI datasets show that the proposed model achieves 98.21% accuracy on the primary dataset and 95.74% accuracy on an unseen dataset, indicating strong cross-dataset generalization. Overall, the proposed approach balances simplicity, transparency, and accuracy, supporting more trustworthy and clinically applicable brain tumor classification for improved health outcomes and non-invasive disease detection.
[1123] SAGE: Shape-Adapting Gated Experts for Adaptive Histopathology Image Segmentation
Gia Huy Thai, Hoang-Nguyen Vu, Anh-Minh Phan, Quang-Thinh Ly, Tram Dinh, Thi-Ngoc-Truc Nguyen, Nhat Ho
Main category: eess.IV
TL;DR: SAGE is an input-adaptive framework for dynamic expert routing in visual networks that addresses cellular heterogeneity in cancer detection from Whole Slide Images by reconfiguring static backbones into dynamically routed architectures.
Details
Motivation: The paper addresses the challenge of cellular heterogeneity in cancer detection from gigapixel Whole Slide Images, where significant variability in cell size and shape makes computer-assisted detection difficult. Current CNN-Transformer hybrids use static computation graphs with fixed routing, leading to extra computation and poor adaptation to input variations.Method: Proposes Shape-Adapting Gated Experts (SAGE), an input-adaptive framework with dynamic expert routing. It uses a dual-path design with hierarchical gating and a Shape-Adapting Hub (SA-Hub) that harmonizes feature representations across convolutional and transformer modules. Implemented as SAGE-ConvNeXt+ViT-UNet.
Result: Achieves Dice scores of 95.23% on EBHI, 92.78%/91.42% DSC on GlaS Test A/Test B, and 91.26% DSC at WSI level on DigestPath. Shows robust generalization under distribution shifts by adaptively balancing local refinement and global context.
Conclusion: SAGE establishes a scalable foundation for dynamic expert routing in visual networks, facilitating flexible visual reasoning and addressing the challenges of cellular heterogeneity in medical image analysis.
Abstract: The significant variability in cell size and shape continues to pose a major obstacle in computer-assisted cancer detection on gigapixel Whole Slide Images (WSIs), due to cellular heterogeneity. Current CNN-Transformer hybrids use static computation graphs with fixed routing. This leads to extra computation and makes it harder to adapt to changes in input. We propose Shape-Adapting Gated Experts (SAGE), an input-adaptive framework that enables dynamic expert routing in heterogeneous visual networks. SAGE reconfigures static backbones into dynamically routed expert architectures via a dual-path design with hierarchical gating and a Shape-Adapting Hub (SA-Hub) that harmonizes feature representations across convolutional and transformer modules. Embodied as SAGE with ConvNeXt and Vision Transformer UNet (SAGE-ConvNeXt+ViT-UNet), our model achieves a Dice score of 95.23% on EBHI, 92.78%/91.42% DSC on GlaS Test A/Test B, and 91.26% DSC at the WSI level on DigestPath, while exhibiting robust generalization under distribution shifts by adaptively balancing local refinement and global context. SAGE establishes a scalable foundation for dynamic expert routing in visual networks, thereby facilitating flexible visual reasoning.
[1124] M3CoTBench: Benchmark Chain-of-Thought of MLLMs in Medical Image Understanding
Juntao Jiang, Jiangning Zhang, Yali Bi, Jinsheng Bai, Weixuan Liu, Weiwei Jin, Zhucun Xue, Yong Liu, Xiaobin Hu, Shuicheng Yan
Main category: eess.IV
TL;DR: M3CoTBench: A new benchmark for evaluating Chain-of-Thought reasoning in medical multimodal LLMs, focusing on correctness, efficiency, impact, and consistency of reasoning paths in medical image understanding.
Details
Motivation: Current medical image understanding benchmarks focus only on final answers while ignoring reasoning paths, making AI systems opaque and unreliable for clinical decision support. There's a need to evaluate the reasoning process itself to ensure transparent and trustworthy medical AI.Method: Created M3CoTBench benchmark featuring: 1) diverse dataset covering 24 medical examination types, 2) 13 tasks with varying difficulty, 3) CoT-specific evaluation metrics (correctness, efficiency, impact, consistency), and 4) comprehensive analysis of multiple MLLMs.
Result: The benchmark systematically evaluates CoT reasoning in medical imaging tasks, revealing current limitations of MLLMs in generating reliable and clinically interpretable reasoning paths.
Conclusion: M3CoTBench addresses the gap in evaluating reasoning processes in medical AI and aims to foster development of transparent, trustworthy, and diagnostically accurate AI systems for healthcare.
Abstract: Chain-of-Thought (CoT) reasoning has proven effective in enhancing large language models by encouraging step-by-step intermediate reasoning, and recent advances have extended this paradigm to Multimodal Large Language Models (MLLMs). In the medical domain, where diagnostic decisions depend on nuanced visual cues and sequential reasoning, CoT aligns naturally with clinical thinking processes. However, current benchmarks for medical image understanding generally focus on the final answer while ignoring the reasoning path. Such opaque reasoning processes lack reliable bases for judgment, making it difficult to assist doctors in diagnosis. To address this gap, we introduce a new M3CoTBench benchmark specifically designed to evaluate the correctness, efficiency, impact, and consistency of CoT reasoning in medical image understanding. M3CoTBench features 1) a diverse, multi-level difficulty dataset covering 24 examination types, 2) 13 varying-difficulty tasks, 3) a suite of CoT-specific evaluation metrics (correctness, efficiency, impact, and consistency) tailored to clinical reasoning, and 4) a performance analysis of multiple MLLMs. M3CoTBench systematically evaluates CoT reasoning across diverse medical imaging tasks, revealing current limitations of MLLMs in generating reliable and clinically interpretable reasoning, and aims to foster the development of transparent, trustworthy, and diagnostically accurate AI systems for healthcare. Project page at https://juntaojianggavin.github.io/projects/M3CoTBench/.