Editor’s Picks
Top papers matching your research interests in multimodal LLMs, audio and vision understanding/generation.
[1] Listen First, Then Answer: Timestamp-Grounded Speech Reasoning
Jihoon Jeong, Pooneh Mousavi, Mirco Ravanelli, Cem Subakan
Main category: cs.SD
TL;DR: RL-based method grounds audio-language model reasoning chains with explicit timestamp annotations to improve faithfulness and performance
Details
Motivation: Current large audio-language models generate reasoning chains but it's unclear if they remain grounded in the input audio, raising concerns about faithfulness and reliabilityMethod: Proposes RL-based strategy that grounds reasoning outputs with explicit timestamp annotations referring to relevant audio segments, encouraging models to attend more to audio tokens during reasoning
Result: Experiments on four speech benchmark datasets show improved performance over zero-shot reasoning and fine-tuning without timestamp grounding; grounding amplifies desirable reasoning behaviors like region exploration, audiology verification, and consistency
Conclusion: Timestamp grounding improves faithfulness and performance in audio-language models, highlighting the importance of grounding mechanisms for reliable multimodal reasoning
Abstract: Large audio-language models (LALMs) can generate reasoning chains for their predictions, but it remains unclear whether these reasoning chains remain grounded in the input audio. In this paper, we propose an RL-based strategy that grounds the reasoning outputs of LALMs with explicit timestamp annotations referring to relevant segments of the audio signal. Our analysis shows that timestamp grounding leads the model to attend more strongly to audio tokens during reasoning generation. Experiments on four speech-based benchmark datasets demonstrate that our approach improves performance compared to both zero-shot reasoning and fine-tuning without timestamp grounding. Additionally, grounding amplifies desirable reasoning behaviors, such as region exploration, audiology verification, and consistency, underscoring the importance of grounding mechanisms for faithful multimodal reasoning.
Relevance: 9/10
[2] Borderless Long Speech Synthesis
Xingchen Song, Di Wu, Dinghao Zhou, Pengyu Cheng, Hongwu Ding, Yunchao He, Jie Wang, Shengfan Shen, Sixiang Lv, Lichun Fan, Hang Su, Yifeng Wang, Shuai Wang, Meng Meng, Jian Luan
Main category: cs.SD
TL;DR: A unified framework for borderless long speech synthesis that goes beyond traditional TTS by incorporating global context, multi-speaker interactions, and paralinguistic cues through hierarchical annotation and LLM-agent integration.
Details
Motivation: Existing TTS systems lack understanding of global context and paralinguistic cues, making it hard to capture real-world phenomena like multi-speaker interactions, emotional arcs, and varied acoustic environments.Method: Proposes Borderless Long Speech Synthesis framework with: 1) “Labeling over filtering/cleaning” data strategy, 2) Global-Sentence-Token hierarchical annotation schema, 3) Continuous tokenizer backbone with Chain-of-Thought reasoning and Dimension Dropout, 4) Native Agentic design with Structured Semantic Interface between LLM Agent and synthesis engine.
Result: The system enables unified capabilities spanning VoiceDesigner, multi-speaker synthesis, Instruct TTS, and long-form text synthesis, with improved instruction following under complex conditions.
Conclusion: The framework extends Text2Speech to borderless long speech synthesis by creating a layered control protocol stack that enables front-end LLMs to convert any modality inputs into structured generation commands.
Abstract: Most existing text-to-speech (TTS) systems either synthesize speech sentence by sentence and stitch the results together, or drive synthesis from plain-text dialogues alone. Both approaches leave models with little understanding of global context or paralinguistic cues, making it hard to capture real-world phenomena such as multi-speaker interactions (interruptions, overlapping speech), evolving emotional arcs, and varied acoustic environments. We introduce the Borderless Long Speech Synthesis framework for agent-centric, borderless long audio synthesis. Rather than targeting a single narrow task, the system is designed as a unified capability set spanning VoiceDesigner, multi-speaker synthesis, Instruct TTS, and long-form text synthesis. On the data side, we propose a “Labeling over filtering/cleaning” strategy and design a top-down, multi-level annotation schema we call Global-Sentence-Token. On the model side, we adopt a backbone with a continuous tokenizer and add Chain-of-Thought (CoT) reasoning together with Dimension Dropout, both of which markedly improve instruction following under complex conditions. We further show that the system is Native Agentic by design: the hierarchical annotation doubles as a Structured Semantic Interface between the LLM Agent and the synthesis engine, creating a layered control protocol stack that spans from scene semantics down to phonetic detail. Text thereby becomes an information-complete, wide-band control channel, enabling a front-end LLM to convert inputs of any modality into structured generation commands, extending the paradigm from Text2Speech to borderless long speech synthesis.
Relevance: 9/10
[3] FoleyDirector: Fine-Grained Temporal Steering for Video-to-Audio Generation via Structured Scripts
You Li, Dewei Zhou, Fan Ma, Fu Li, Dongliang He, Yi Yang
Main category: cs.SD
TL;DR: FoleyDirector enables precise temporal control in video-to-audio generation using structured temporal scripts and bi-frame synthesis for multi-event scenarios.
Details
Motivation: Current V2A methods struggle with fine-grained temporal control in multi-event scenarios or when visual cues are insufficient (small regions, off-screen sounds, occluded objects). There's a need for precise temporal guidance while maintaining audio quality.Method: Introduces Structured Temporal Scripts (STS) - captions for short temporal segments; Script-Guided Temporal Fusion Module with Temporal Script Attention; Bi-Frame Sound Synthesis for parallel in-frame/out-of-frame audio generation; and new datasets DirectorSound, VGGSoundDirector, and DirectorBench.
Result: FoleyDirector substantially enhances temporal controllability while maintaining high audio fidelity, enabling users to act as Foley directors and advancing V2A toward more expressive and controllable generation.
Conclusion: The framework enables precise temporal guidance in DiT-based V2A generation while preserving audio quality, allowing seamless switching between V2A generation and temporally controlled synthesis.
Abstract: Recent Video-to-Audio (V2A) methods have achieved remarkable progress, enabling the synthesis of realistic, high-quality audio. However, they struggle with fine-grained temporal control in multi-event scenarios or when visual cues are insufficient, such as small regions, off-screen sounds, or occluded or partially visible objects. In this paper, we propose FoleyDirector, a framework that, for the first time, enables precise temporal guidance in DiT-based V2A generation while preserving the base model’s audio quality and allowing seamless switching between V2A generation and temporally controlled synthesis. FoleyDirector introduces Structured Temporal Scripts (STS), a set of captions corresponding to short temporal segments, to provide richer temporal information. These features are integrated via the Script-Guided Temporal Fusion Module, which employs Temporal Script Attention to fuse STS features coherently. To handle complex multi-event scenarios, we further propose Bi-Frame Sound Synthesis, enabling parallel in-frame and out-of-frame audio generation and improving controllability. To support training and evaluation, we construct the DirectorSound dataset and introduce VGGSoundDirector and DirectorBench. Experiments demonstrate that FoleyDirector substantially enhances temporal controllability while maintaining high audio fidelity, empowering users to act as Foley directors and advancing V2A toward more expressive and controllable generation.
Relevance: 9/10
Today’s Research Highlights
AI-enhanced summaries of the latest research papers from arXiv.
Table of Contents
- cs.CL [Total: 105]
- cs.CV [Total: 182]
- cs.AI [Total: 57]
- cs.SD [Total: 8]
- cs.LG [Total: 134]
- cs.MA [Total: 3]
- cs.MM [Total: 0]
- eess.AS [Total: 4]
- eess.IV [Total: 6]
cs.CL
[1] When Prompt Optimization Becomes Jailbreaking: Adaptive Red-Teaming of Large Language Models
Zafir Shamsi, Nikhil Chekuru, Zachary Guzman, Shivank Garg
Main category: cs.CL
TL;DR: Black-box prompt optimization techniques can be repurposed to systematically find safety failures in LLMs, showing that static safety benchmarks underestimate risks from adaptive attacks.
Details
Motivation: Current safety evaluations for LLMs rely on fixed harmful prompts, overlooking realistic attack scenarios where adversaries iteratively refine inputs to evade safeguards. There's a need to examine vulnerability to automated adversarial prompt refinement.Method: Repurposed black-box prompt optimization techniques (originally for benign tasks) to systematically search for safety failures. Used DSPy to apply three optimizers to prompts from HarmfulQA and JailbreakBench, optimizing toward a continuous danger score (0-1) from GPT-5.1 evaluator.
Result: Substantial reduction in effective safety safeguards, especially pronounced for open-source small language models. Example: Qwen 3 8B’s average danger score increased from 0.09 (baseline) to 0.79 after optimization.
Conclusion: Static benchmarks may underestimate residual risk; automated, adaptive red-teaming is necessary for robust safety evaluation of LLMs in high-stakes applications.
Abstract: Large Language Models (LLMs) are increasingly integrated into high-stakes applications, making robust safety guarantees a central practical and commercial concern. Existing safety evaluations predominantly rely on fixed collections of harmful prompts, implicitly assuming non-adaptive adversaries and thereby overlooking realistic attack scenarios in which inputs are iteratively refined to evade safeguards. In this work, we examine the vulnerability of contemporary language models to automated, adversarial prompt refinement. We repurpose black-box prompt optimization techniques, originally designed to improve performance on benign tasks, to systematically search for safety failures. Using DSPy, we apply three such optimizers to prompts drawn from HarmfulQA and JailbreakBench, explicitly optimizing toward a continuous danger score in the range 0 to 1 provided by an independent evaluator model (GPT-5.1). Our results demonstrate a substantial reduction in effective safety safeguards, with the effects being especially pronounced for open-source small language models. For example, the average danger score of Qwen 3 8B increases from 0.09 in its baseline setting to 0.79 after optimization. These findings suggest that static benchmarks may underestimate residual risk, indicating that automated, adaptive red-teaming is a necessary component of robust safety evaluation.
[2] DuCCAE: A Hybrid Engine for Immersive Conversation via Collaboration, Augmentation, and Evolution
Xin Shen, Zhishu Jiang, Jiaye Yang, Haibo Liu, Yichen Wan, Jiarui Zhang, Tingzhi Dai, Luodong Xu, Shuchen Wu, Guanqiang QI, Chenxi Miao, Jiahui Liang, Yang Li, Weikang Li, Deguo Xia, Jizhou Huang
Main category: cs.CL
TL;DR: DuCCAE is a hybrid conversational system that decouples real-time response generation from asynchronous agentic execution to balance responsiveness with complex task capability, deployed in Baidu Search with significant performance improvements.
Details
Motivation: Immersive conversational systems face a trade-off between real-time responsiveness and long-horizon task capability. Lightweight turns work in real-time, but planning and tool invocation tasks cause heavy latency that degrades turn-taking, persona consistency, and user trust.Method: DuCCAE decouples real-time response generation from asynchronous agentic execution, synchronizing them via a shared state that maintains session context and execution traces. The system orchestrates five subsystems: Info, Conversation, Collaboration, Augmentation, and Evolution to support multi-agent collaboration and continuous improvement.
Result: DuCCAE outperforms baselines in agentic execution reliability and dialogue quality while reducing latency. Deployment metrics show tripling of Day-7 user retention to 34.2% and complex task completion rate surge to 65.2%.
Conclusion: The hybrid architecture successfully preserves conversational continuity while enabling reliable agentic execution, offering practical guidelines for deploying scalable agentic systems in industrial settings.
Abstract: Immersive conversational systems in production face a persistent trade-off between responsiveness and long-horizon task capability. Real-time interaction is achievable for lightweight turns, but requests involving planning and tool invocation (e.g., search and media generation) produce heavy-tail execution latency that degrades turn-taking, persona consistency, and user trust. To address this challenge, we propose DuCCAE (Conversation while Collaboration with Augmentation and Evolution), a hybrid engine for immersive conversation deployed within Baidu Search, serving millions of users. DuCCAE decouples real-time response generation from asynchronous agentic execution and synchronizes them via a shared state that maintains session context and execution traces, enabling asynchronous results to be integrated back into the ongoing dialogue. The system orchestrates five subsystems-Info, Conversation, Collaboration, Augmentation, and Evolution-to support multi-agent collaboration and continuous improvement. We evaluate DuCCAE through a comprehensive framework that combines offline benchmarking on the Du-Interact dataset and large-scale production evaluation within Baidu Search. Experimental results demonstrate that DuCCAE outperforms strong baselines in agentic execution reliability and dialogue quality while reducing latency to fit strict real-time budgets. Crucially, deployment metrics since June 2025 confirm substantial real-world effectiveness, evidenced by a tripling of Day-7 user retention to 34.2% and a surge in the complex task completion rate to 65.2%. Our hybrid architecture successfully preserves conversational continuity while enabling reliable agentic execution, offering practical guidelines for deploying scalable agentic systems in industrial settings.
[3] Spelling Correction in Healthcare Query-Answer Systems: Methods, Retrieval Impact, and Empirical Evaluation
Saurabh K Singh
Main category: cs.CL
TL;DR: Study shows spelling correction significantly improves healthcare QA retrieval, with 61.5% of real medical queries containing errors and query-side correction being key intervention.
Details
Motivation: Healthcare QA systems face high rates of spelling errors in user queries compared to professional documents, creating retrieval challenges that need systematic evaluation.Method: Conducted error census on two public datasets, evaluated four spelling correction methods across three experimental conditions using BM25 and TF-IDF retrieval over MedQuAD passages with TREC relevance judgments.
Result: 61.5% of real medical queries contain spelling errors; query correction substantially improves retrieval (edit distance methods achieve +9.2% MRR improvement), while corpus-only correction yields minimal gains (+0.5% MRR).
Conclusion: Spelling correction is crucial for healthcare QA systems, with query-side correction being the key intervention, and provides evidence-based recommendations for practitioners.
Abstract: Healthcare question-answering (QA) systems face a persistent challenge: users submit queries with spelling errors at rates substantially higher than those found in the professional documents they search. This paper presents the first controlled study of spelling correction as a retrieval preprocessing step in healthcare QA using real consumer queries. We conduct an error census across two public datasets – the TREC 2017 LiveQA Medical track (104 consumer health questions) and HealthSearchQA (4,436 health queries from Google autocomplete) – finding that 61.5% of real medical queries contain at least one spelling error, with a token-level error rate of 11.0%. We evaluate four correction methods – conservative edit distance, standard edit distance (Levenshtein), context-aware candidate ranking, and SymSpell – across three experimental conditions: uncorrected queries against an uncorrected corpus (baseline), uncorrected queries against a corrected corpus, and fully corrected queries against a corrected corpus. Using BM25 and TF-IDF cosine retrieval over 1,935 MedQuAD answer passages with TREC relevance judgments, we find that query correction substantially improves retrieval – edit distance and context-aware correction achieve MRR improvements of +9.2% and NDCG@10 improvements of +8.3% over the uncorrected baseline. Critically, correcting only the corpus without correcting queries yields minimal improvement (+0.5% MRR), confirming that query-side correction is the key intervention. We complement these results with a 100-sample error analysis categorising correction outcomes per method and provide evidence-based recommendations for practitioners.
[4] Can Structural Cues Save LLMs? Evaluating Language Models in Massive Document Streams
Yukyung Lee, Yebin Lim, Woojun Jung, Wonjun Choi, Susik Yoon
Main category: cs.CL
TL;DR: StreamBench: A benchmark for evaluating language models in streaming document environments with multiple concurrent events, featuring tasks for topic clustering, temporal QA, and summarization.
Details
Motivation: Existing benchmarks don't evaluate models under realistic streaming conditions where multiple concurrent events are mixed within the same document stream, creating conflicts and challenges for information processing.Method: Created StreamBench from major news stories in 2016 and 2025, comprising 605 events and 15,354 documents across three tasks: Topic Clustering, Temporal Question Answering, and Summarization. Evaluated models with and without structural cues that organize key facts by event.
Result: Structural cues improve performance on clustering (up to +4.37%) and temporal QA (up to +9.63%), helping models locate relevant information and separate distinct events. Temporal reasoning remains challenging for current LLMs.
Conclusion: Structural cues are a promising direction for improving language model performance in massive document streams, though temporal reasoning remains an open challenge for current LLMs.
Abstract: Evaluating language models in streaming environments is critical, yet underexplored. Existing benchmarks either focus on single complex events or provide curated inputs for each query, and do not evaluate models under the conflicts that arise when multiple concurrent events are mixed within the same document stream. We introduce StreamBench, a benchmark built from major news stories in 2016 and 2025, comprising 605 events and 15,354 documents across three tasks: Topic Clustering, Temporal Question Answering, and Summarization. To diagnose how models fail, we compare performance with and without structural cues, which organize key facts by event. We find that structural cues improve performance on clustering (up to +4.37%) and temporal QA (up to +9.63%), helping models locate relevant information and separate distinct events. While temporal reasoning remains an open challenge inherent to current LLMs, consistent gains across tasks show that structural cues are a promising direction for future work in massive document streams.
[5] Enhancing Legal LLMs through Metadata-Enriched RAG Pipelines and Direct Preference Optimization
Suyash Maniyar, Deepali Singh, Rohith Reddy
Main category: cs.CL
TL;DR: Proposes Metadata Enriched Hybrid RAG and DPO-based refusal training to improve legal LLMs’ reliability on long documents by enhancing retrieval and enabling safe refusal when context is insufficient.
Details
Motivation: LLMs degrade on long legal documents, producing hallucinations like incorrect clauses/precedents. RAG helps but has limitations in legal settings with small, privacy-preserving models. Two failure modes identified: retrieval errors from lexical redundancy in legal corpora, and decoding errors where models generate answers despite insufficient context.Method: 1) Metadata Enriched Hybrid RAG to improve document-level retrieval by incorporating metadata and hybrid search techniques. 2) Direct Preference Optimization (DPO) to train models to safely refuse answering when context is inadequate, preventing hallucinated responses.
Result: The combined methods improve grounding, reliability, and safety in legal language models by addressing both retrieval and generation failure modes.
Conclusion: The proposed approach enhances legal LLM performance on long documents through better retrieval and refusal mechanisms, making them more reliable and trustworthy for legal applications.
Abstract: Large Language Models (LLMs) perform well in short contexts but degrade on long legal documents, often producing hallucinations such as incorrect clauses or precedents. In the legal domain, where precision is critical, such errors undermine reliability and trust. Retrieval Augmented Generation (RAG) helps ground outputs but remains limited in legal settings, especially with small, locally deployed models required for data privacy. We identify two failure modes: retrieval errors due to lexical redundancy in legal corpora, and decoding errors where models generate answers despite insufficient context. To address this, we propose Metadata Enriched Hybrid RAG to improve document level retrieval, and apply Direct Preference Optimization (DPO) to enforce safe refusal when context is inadequate. Together, these methods improve grounding, reliability, and safety in legal language models.
[6] GeoChallenge: A Multi-Answer Multiple-Choice Benchmark for Geometric Reasoning with Diagrams
Yushun Zhang, Weiping Fu, Zesheng Yang, Bo Zhao, Lingling Zhang, Jian Zhang, Yumeng Fu, Jiaxing Huang, Jun Liu
Main category: cs.CL
TL;DR: GeoChallenge: A large-scale dataset of 90K automatically generated multiple-choice geometry proof problems requiring multi-step reasoning over aligned text and diagrams, used to evaluate LLMs’ symbolic reasoning capabilities.
Details
Motivation: Existing geometry benchmarks are limited in scale and rarely provide visually grounded multiple-choice questions, limiting reliable evaluation of complex reasoning in LLMs. There's a need for comprehensive benchmarks that require multi-step proofs grounded in both text and diagrams.Method: Created GeoChallenge dataset with 90K automatically generated multiple-choice geometry proof problems. Each problem requires multi-step reasoning over aligned textual descriptions and diagrams. The dataset provides fine-grained complexity ratings and formal language annotations for controlled evaluation.
Result: Experiments show a clear performance gap between LLMs and humans (best model GPT-5-nano achieves 75.89 exact match vs. 94.74 for humans). Analysis reveals three common LLM failure patterns: (1) exact match failures under multiple-choice setting, (2) weak visual reliance, and (3) overextended reasoning without convergence.
Conclusion: GeoChallenge enables comprehensive evaluation of LLMs’ symbolic reasoning capabilities, revealing significant gaps between current models and human performance, particularly in visual reasoning and multi-step proof convergence.
Abstract: Evaluating the symbolic reasoning of large language models (LLMs) calls for geometry benchmarks that require multi-step proofs grounded in both text and diagrams. However, existing benchmarks are often limited in scale and rarely provide visually grounded multiple-choice questions, limiting reliable evaluation of complex reasoning. We introduce GeoChallenge, a dataset of 90K automatically generated multiple-choice geometry proof problems, each requiring multi-step reasoning over aligned textual descriptions and diagrams. GeoChallenge provides fine-grained complexity ratings and formal language annotations to enable controlled evaluation. Experiments on multiple advanced LLMs show a clear performance gap between models and humans (the best-performing model, GPT-5-nano, achieves 75.89 exact match vs. 94.74 for humans). Further analysis also reveals three common failure patterns of LLMs: (1) exact match failures under the multiple-choice setting; (2) weak visual reliance; and (3) overextended reasoning without convergence.
[7] A comprehensive study of LLM-based argument classification: from Llama through DeepSeek to GPT-5.2
Marcin Pietroń, Filip Gampel, Jakub Gomułka, Andrzej Tomski, Rafał Olszowski
Main category: cs.CL
TL;DR: Comprehensive evaluation of state-of-the-art LLMs (GPT-5.2, Llama 4, DeepSeek) on argument classification tasks using advanced prompting strategies, achieving up to 91.9% accuracy on Args.me dataset.
Details
Motivation: Recent advances in LLMs have significantly improved argument classification performance compared to traditional ML approaches, but there's a need for comprehensive evaluation of these models on argument mining tasks using advanced prompting techniques.Method: Evaluated multiple SOTA LLMs on large publicly available argument classification corpora (Args.me, UKP) using advanced prompting strategies including Chain-of-Thought prompting, prompt rephrasing, voting, and certainty-based classification.
Result: Best-performing model (GPT-5.2) achieved 78.0% accuracy on UKP and 91.9% on Args.me. Advanced prompting techniques improved accuracy and F1 by 2-8%. Qualitative analysis revealed systematic failure modes across models.
Conclusion: LLMs with advanced prompting strategies significantly improve argument classification performance, but systematic challenges remain in prompt stability, implicit criticism detection, complex argument interpretation, and claim alignment.
Abstract: Argument mining (AM) is an interdisciplinary research field focused on the automatic identification and classification of argumentative components, such as claims and premises, and the relationships between them. Recent advances in large language models (LLMs) have significantly improved the performance of argument classification compared to traditional machine learning approaches. This study presents a comprehensive evaluation of several state-of-the-art LLMs, including GPT-5.2, Llama 4, and DeepSeek, on large publicly available argument classification corpora such as Args.me and UKP. The evaluation incorporates advanced prompting strategies, including Chain-of- Thought prompting, prompt rephrasing, voting, and certainty-based classification. Both quantitative performance metrics and qualitative error analysis are conducted to assess model behavior. The best-performing model in the study (GPT-5.2) achieves a classification accuracy of 78.0% (UKP) and 91.9% (Args.me). The use of prompt rephrasing, multi-prompt voting, and certainty estimation further improves classification performance and robustness. These techniques increase the accuracy and F1 metric of the models by typically a few percentage points (from 2% to 8%). However, qualitative analysis reveals systematic failure modes shared across models, including instabilities with respect to prompt formulation, difficulties in detecting implicit criticism, interpreting complex argument structures, and aligning arguments with specific claims. This work contributes the first comprehensive evaluation that combines quantitative benchmarking and qualitative error analysis on multiple argument mining datasets using advanced LLM prompting strategies.
[8] From Comprehension to Reasoning: A Hierarchical Benchmark for Automated Financial Research Reporting
Yiyun Zhu, Yidong Jiang, Ziwen Xu, Yinsheng Yao, Dawei Cheng, Jinru Ding, Yejie Zheng, Jie Xu
Main category: cs.CL
TL;DR: FinReasoning benchmark evaluates LLMs for financial research report generation, assessing semantic consistency, data alignment, and deep insight through three-stage workflow alignment.
Details
Motivation: Current LLMs used for financial report generation suffer from factual errors, numerical inconsistencies, fabricated references, and shallow analysis, but existing benchmarks focus on comprehension rather than evaluating reliable analysis generation. Current evaluation frameworks only flag hallucinations without structured measures for deeper analytical skills.Method: Introduces FinReasoning benchmark that decomposes Chinese research-report generation into three stages aligned with real analyst workflows: semantic consistency, data alignment, and deep insight. Proposes fine-grained evaluation framework with strengthened hallucination-correction assessment and 12-indicator rubric for core analytical skills.
Result: Most models show understanding-execution gap: can identify errors but struggle to generate accurate corrections; can retrieve data but have difficulty returning in correct format. No model achieves overwhelming superiority across all three tracks; Doubao-Seed-1.8, GPT-5, and Kimi-K2 rank as top three with distinct capability distributions.
Conclusion: FinReasoning provides comprehensive evaluation framework for financial report generation LLMs, revealing systematic capability gaps and enabling targeted improvements in financial analysis AI systems.
Abstract: Large language models (LLMs) are increasingly used to generate financial research reports, shifting from auxiliary analytic tools to primary content producers. Yet recent real-world deployments reveal persistent failures–factual errors, numerical inconsistencies, fabricated references, and shallow analysis–that can distort assessments of corporate fundamentals and ultimately trigger severe economic losses. However, existing financial benchmarks focus on comprehension over completed reports rather than evaluating whether a model can produce reliable analysis. Moreover, current evaluation frameworks merely flag hallucinations and lack structured measures for deeper analytical skills, leaving key analytical bottlenecks undiscovered. To address these gaps, we introduce FinReasoning, a benchmark that decomposes Chinese research-report generation into three stages aligned with real analyst workflows, assessing semantic consistency, data alignment, and deep insight. We further propose a fine-grained evaluation framework that strengthens hallucination-correction assessment and incorporates a 12-indicator rubric for core analytical skills. Based on the evaluation results, FinReasoning reveals that most models exhibit a understanding-execution gap: they can identify errors but struggle to generate accurate corrections; they can retrieve data but have difficulty returning it in correct format. Furthermore, no model achieves overwhelming superiority across all three tracks; Doubao-Seed-1.8, GPT-5, and Kimi-K2 rank as the top three in overall performance, yet each exhibits a distinct capability distribution. The evaluation resource is available at https://github.com/TongjiFinLab/FinReasoning.
[9] LARFT: Closing the Cognition-Action Gap for Length Instruction Following in Large Language Models
Wei Zhang, Lintong Du, Yuanhe Zhang, Zhenhong Zhou, Kun Wang, Li Sun, Sen Su
Main category: cs.CL
TL;DR: LARFT is a training framework that improves LLMs’ ability to follow length instructions by enhancing their intrinsic length cognition through reinforcement learning with hindsight length awareness.
Details
Motivation: Current LLMs struggle with precise output length control despite strong instruction-following capabilities. Existing methods impose external constraints but fail to address the core problem: models lack intrinsic understanding of length, leading to unreliable length instruction following.Method: LARFT integrates length-oriented reinforcement learning with hindsight length awareness. It transforms on-policy data into hindsight self-awareness tasks where the model learns to identify the actual length of its own generation, jointly optimizing length representation and policy refinement.
Result: Experiments across four base models show LARFT outperforms existing baselines with +20.92 average improvement on three length instruction benchmarks, while maintaining general capabilities with only -1.45 point decline on four general benchmarks.
Conclusion: LARFT successfully addresses LLMs’ length cognition deficit by aligning internal length representation with generation actions, enabling precise and reliable length instruction following without significantly compromising general capabilities.
Abstract: Despite the strong performance of Large Language Models (LLMs) on complex instruction-following tasks, precise control of output length remains a persistent challenge. Existing methods primarily attempt to enforce length constraints by externally imposing length signals or optimization objectives, while largely overlooking the underlying limitation: the model’s intrinsic deficit in length cognition. To address this, we propose LARFT (Length-Aware Reinforcement Fine-Tuning), a training framework that aligns the model’s length cognition with its action. Specifically, LARFT integrates length-oriented reinforcement learning with a hindsight length awareness. By transforming on-policy data into hindsight self-awareness tasks where the model learns to identify the actual length of its own generation, LARFT jointly optimizes the model’s internal representation of length information and refines its policy to satisfy length constraints, thereby achieving precise and reliable length instruction following. Extensive experiments across four base models demonstrate that LARFT outperforms existing baselines, achieving an average improvement of +20.92 points across three length instruction following benchmarks with only a marginal decline of -1.45 points on four general capability benchmarks.
[10] ShobdoSetu: A Data-Centric Framework for Bengali Long-Form Speech Recognition and Speaker Diarization
Md. Nazmus Sakib, Shafiul Tanvir, Mesbah Uddin Ahamed, H. M. Aktaruzzaman Mukdho
Main category: cs.CL
TL;DR: A system for Bengali speech recognition and speaker diarization using data-centric approaches and fine-tuning pre-trained models, achieving competitive results with limited resources.
Details
Motivation: Bengali is spoken by over 230 million people but remains severely under-served in automatic speech recognition (ASR) and speaker diarization research, creating a need for effective solutions with limited annotated data.Method: For ASR: Data-centric pipeline constructing training corpus from Bengali YouTube audiobooks/dramas with LLM-assisted language normalization, fuzzy-matching chunk validation, and muffled-zone augmentation. Fine-tuned whisper-medium model. For diarization: Fine-tuned pyannote.audio segmentation model with hyperparameter optimization using only 10 training files.
Result: ASR: WER of 16.751 (public) and 15.551 (private). Diarization: DER of 0.19974 (public) and 0.26723 (private).
Conclusion: Careful data engineering and domain-adaptive fine-tuning can yield competitive performance for Bengali speech processing even without large annotated corpora.
Abstract: Bengali is spoken by over 230 million people yet remains severely under-served in automatic speech recognition (ASR) and speaker diarization research. In this paper, we present our system for the DL Sprint 4.0 Bengali Long-Form Speech Recognition (Task1) and Bengali Speaker Diarization Challenge (Task2). For Task1, we propose a data-centric pipeline that constructs a high-quality training corpus from Bengali YouTube audiobooks and dramas \cite{tabib2026bengaliloop}, incorporating LLM-assisted language normalization, fuzzy-matching-based chunk boundary validation, and muffled-zone augmentation. Fine-tuning the \texttt{tugstugi/whisper-medium} model on approximately 21,000 data points with beam size 5, we achieve a Word Error Rate (WER) of 16.751 on the public leaderboard and 15.551 on the private test set. For Task2, we fine-tune the pyannote.audio community-1 segmentation model with targeted hyperparameter optimization under an extreme low-resource setting (10 training files), achieving a Diarization Error Rate (DER) of 0.19974 on the public leaderboard, and .26723 on the private test set. Our results demonstrate that careful data engineering and domain-adaptive fine-tuning can yield competitive performance for Bengali speech processing even without large annotated corpora.
[11] Constraint-aware Path Planning from Natural Language Instructions Using Large Language Models
Dylan Shim, Minghan Wei
Main category: cs.CL
TL;DR: LLM-based framework for solving constrained path planning problems from natural language descriptions through template matching, autonomous problem formulation, and iterative solution refinement.
Details
Motivation: Real-world path planning involves multiple constraints beyond simple route optimization, but traditional approaches require dedicated formulations for each problem variant, making them difficult to scale across diverse scenarios.Method: Two-component framework: 1) For known problems, LLM matches input to pre-defined templates; 2) For novel problems, LLM autonomously infers problem representation from natural language. Both use iterative solution generation and verification with genetic-algorithm-style refinement.
Result: The framework demonstrates capability to handle a variety of constrained path planning problems, providing scalable and generalizable approach for real-world routing tasks with minimal human intervention.
Conclusion: LLM-based framework enables flexible problem specification through natural language while solving diverse constrained path planning problems through iterative refinement and verification.
Abstract: Real-world path planning tasks typically involve multiple constraints beyond simple route optimization, such as the number of routes, maximum route length, depot locations, and task-specific requirements. Traditional approaches rely on dedicated formulations and algorithms for each problem variant, making them difficult to scale across diverse scenarios. In this work, we propose a flexible framework that leverages large language models (LLMs) to solve constrained path planning problems directly from natural language input. The core idea is to allow users to describe routing tasks conversationally, while enabling the LLM to interpret and solve the problem through solution verification and iterative refinement. The proposed method consists of two integrated components. For problem types that have been previously formulated and studied, the LLM first matches the input request to a known problem formulation in a library of pre-defined templates. For novel or unseen problem instances, the LLM autonomously infers a problem representation from the natural language description and constructs a suitable formulation in an in-context learning manner. In both cases, an iterative solution generation and verification process guides the LLM toward producing feasible and increasingly optimal solutions. Candidate solutions are compared and refined through multiple rounds of self-correction, inspired by genetic-algorithm-style refinement. We present the design, implementation, and evaluation of this LLM-based framework, demonstrating its capability to handle a variety of constrained path planning problems. This method provides a scalable and generalizable approach for solving real-world routing tasks with minimal human intervention, while enabling flexible problem specification through natural language.
[12] MAPLE: Metadata Augmented Private Language Evolution
Eli Chien, Yuzheng Hu, Ryan McKenna, Shanshan Wu, Zheng Xu, Peter Kairouz
Main category: cs.CL
TL;DR: MAPLE improves differentially private synthetic text generation by using metadata extraction and in-context learning to better initialize domain-specific data generation, overcoming limitations of existing API-based methods.
Details
Motivation: DP fine-tuning of LLMs is often infeasible with proprietary APIs, making DP synthetic data generation crucial. Existing methods like Private Evolution struggle when target data distributions deviate substantially from foundation model priors, especially in specialized domains, leading to poor utility and convergence.Method: MAPLE uses differentially private tabular metadata extraction and in-context learning to ground the initial synthetic distribution in the target domain, improving initialization for API-based DP text generation.
Result: MAPLE achieves significantly better privacy-utility trade-off, converges faster, and drastically reduces API costs compared to previous Private Evolution methods on challenging domain-specific text generation tasks.
Conclusion: MAPLE effectively addresses the initialization bottleneck in API-based DP synthetic data generation, making it more practical for domain-specific applications where target data distributions differ from foundation model priors.
Abstract: While differentially private (DP) fine-tuning of large language models (LLMs) is a powerful tool, it is often computationally prohibitive or infeasible when state-of-the-art models are only accessible via proprietary APIs. In such settings, generating DP synthetic data has emerged as a crucial alternative, offering the added benefits of arbitrary reuse across downstream tasks and transparent exploratory data analysis without the opaque constraints of a model’s parameter space. Private Evolution (PE) is a promising API-based framework for this goal; however, its performance critically depends on initialization. When the private data distribution deviates substantially from the foundation model’s pre-training priors–particularly in highly specialized domains–PE frequently struggles to align with the target data, resulting in degraded utility, poor convergence, and inefficient API usage. To address this initialization bottleneck, we propose Metadata Augmented Private Language Evolution (MAPLE). MAPLE leverages differentially private tabular metadata extraction and in-context learning to effectively ground the initial synthetic distribution in the target domain. Extensive experiments on challenging, domain-specific text generation tasks demonstrate that MAPLE achieves a significantly more favorable privacy-utility trade-off, converges faster, and drastically reduces API costs compared to previous PE methods.
[13] Breeze Taigi: Benchmarks and Models for Taiwanese Hokkien Speech Recognition and Synthesis
Yu-Siang Lan, Chia-Sheng Liu, Yi-Chang Chen, Po-Chun Hsu, Allyson Chiu, Shun-Wen Lin, Da-shan Shiu, Yuan-Fu Liao
Main category: cs.CL
TL;DR: Breeze Taigi framework provides standardized benchmarks and evaluation methodology for Taiwanese Hokkien speech recognition and synthesis, leveraging parallel Mandarin resources and synthetic data generation.
Details
Motivation: Taiwanese Hokkien (Taigi) presents unique opportunities for advancing speech technology methodologies that can generalize to diverse linguistic contexts, but lacks standardized evaluation frameworks.Method: Developed a reproducible evaluation methodology using parallel Taiwanese Mandarin resources, curated 30 Mandarin-Taigi audio pairs with normalized transcriptions, established CER as standard metric, and fine-tuned Whisper model on ~10,000 hours of Taigi synthetic speech data.
Result: ASR model achieved 30.13% average CER on the benchmark, outperforming existing commercial and research systems. Provided standardized evaluation protocols, diverse training datasets, and open baseline models.
Conclusion: Breeze Taigi offers a replicable framework with methodologies applicable to various linguistic contexts, advancing speech technology for low-resource languages through standardized benchmarks and evaluation protocols.
Abstract: Taiwanese Hokkien (Taigi) presents unique opportunities for advancing speech technology methodologies that can generalize to diverse linguistic contexts. We introduce Breeze Taigi, a comprehensive framework centered on standardized benchmarks for evaluating Taigi speech recognition and synthesis systems. Our primary contribution is a reproducible evaluation methodology that leverages parallel Taiwanese Mandarin resources. We provide 30 carefully curated Mandarin-Taigi audio pairs from Taiwan’s Executive Yuan public service announcements with normalized ground truth transcriptions. We establish Character Error Rate (CER) as the standard metric and implement normalization procedures to enable fair cross-system comparisons. To demonstrate the benchmark’s utility and provide reference implementations, we develop speech recognition and synthesis models through a methodology that leverages existing Taiwanese Mandarin resources and large-scale synthetic data generation. In particular, we fine-tune a Whisper model on approximately 10,000 hours of Taigi synthetic speech data. Our ASR model achieves 30.13% average CER on the benchmark, outperforming existing commercial and research systems. By providing standardized evaluation protocols, diverse training datasets, and open baseline models, we offer a replicable framework with methodologies applicable to various linguistic contexts.
[14] HATL: Hierarchical Adaptive-Transfer Learning Framework for Sign Language Machine Translation
Nada Shahin, Leila Ismail
Main category: cs.CL
TL;DR: HATL framework for sign language machine translation uses hierarchical adaptive transfer learning with dynamic unfreezing and layer-wise learning rate decay to improve translation performance across diverse sign language datasets.
Details
Motivation: Sign Language Machine Translation faces challenges due to scarce datasets, limited signer diversity, and domain gaps between sign motion patterns and pretrained representations. Existing transfer learning approaches are static and prone to overfitting, requiring an adaptive framework that preserves pretrained structure while handling linguistic and signing variations.Method: Proposes Hierarchical Adaptive Transfer Learning (HATL) framework where pretrained layers are progressively and dynamically unfrozen based on training performance behavior. Combines dynamic unfreezing, layer-wise learning rate decay, and stability mechanisms to preserve generic representations while adapting to sign characteristics. Uses ST-GCN++ backbone for feature extraction and Transformer/adaptive transformer (ADAT) for translation.
Result: HATL consistently outperforms traditional transfer learning approaches across tasks and models. ADAT achieves BLEU-4 improvements of 15.0% on PHOENIX14T and Isharah datasets and 37.6% on MedASL dataset.
Conclusion: The HATL framework effectively addresses challenges in SLMT by providing adaptive transfer learning that preserves pretrained knowledge while adapting to sign language characteristics, demonstrating strong performance across multiple datasets and translation tasks.
Abstract: Sign Language Machine Translation (SLMT) aims to bridge communication between Deaf and hearing individuals. However, its progress is constrained by scarce datasets, limited signer diversity, and large domain gaps between sign motion patterns and pretrained representations. Existing transfer learning approaches in SLMT are static and often lead to overfitting. These challenges call for the development of an adaptive framework that preserves pretrained structure while remaining robust across linguistic and signing variations. To fill this void, we propose a Hierarchical Adaptive Transfer Learning (HATL) framework, where pretrained layers are progressively and dynamically unfrozen based on training performance behavior. HATL combines dynamic unfreezing, layer-wise learning rate decay, and stability mechanisms to preserve generic representations while adapting to sign characteristics. We evaluate HATL on Sign2Text and Sign2Gloss2Text translation tasks using a pretrained ST-GCN++ backbone for feature extraction and the Transformer and an adaptive transformer (ADAT)for translation. To ensure robust multilingual generalization, we evaluate the proposed approach across three datasets: RWTH-PHOENIXWeather-2014 (PHOENIX14T), Isharah, and MedASL. Experimental results show that HATL consistently outperforms traditional transfer learning approaches across tasks and models, with ADAT achieving BLEU-4 improvements of 15.0% on PHOENIX14T and Isharah and 37.6% on MedASL.
[15] Significance-Gain Pair Encoding for LLMs: A Statistical Alternative to Frequency-Based Subword Merging
Azam Nouri
Main category: cs.CL
TL;DR: Significance-Gain BPE improves subword tokenization by using statistical significance instead of raw frequency for merge selection, reducing perplexity by 12-13% and improving bits per character by 0.9-1.0%.
Details
Motivation: Standard BPE tokenization selects merges based on raw pair frequency, which can conflate true adjacency cohesion with pairs that are frequent due to high marginal counts, potentially leading to suboptimal tokenization for language modeling.Method: Proposes Significance-Gain BPE, a drop-in alternative merge criterion that measures cohesion via a z-statistic under an independence null model and combines it with an explicit compression-aware gain term.
Result: At a representative operating point, Significance-Gain BPE reduces validation and test perplexity by 13% and 12% respectively, and improves validation and test bits per character (BPC) by about 0.9 to 1.0%. Vocabulary-size sweep shows lower BPC in most closest-compression comparisons.
Conclusion: Statistically grounded merge selection can improve predictive efficiency per unit of raw text across a range of compression regimes, suggesting that significance-based approaches outperform raw frequency-based BPE for language modeling.
Abstract: Subword tokenization is a key design choice for modern language models, including large language models (LLMs), with byte- and character-level BPE serving as a widely used baseline. Standard BPE selects merges by raw pair frequency, which favors compression but can conflate true adjacency cohesion with pairs that are frequent due to high marginal counts. This paper introduces Significance-Gain BPE, a drop-in alternative merge criterion that measures cohesion via a z-statistic under an independence null model and combines it with an explicit compression-aware gain term. Significance-Gain BPE is evaluated on WikiText-103 (raw) character slices using a small causal Transformer language model, reporting both token-dependent perplexity and the tokenizer-invariant metric bits per character (BPC). At a representative operating point, Significance-Gain BPE reduces validation and test perplexity by 13% and 12%, respectively, and improves validation and test BPC by about 0.9 to 1.0%. A vocabulary-size sweep further shows lower BPC in most closest-compression comparisons, suggesting that statistically grounded merge selection can improve predictive efficiency per unit of raw text across a range of compression regimes.
[16] The α-Law of Observable Belief Revision in Large Language Model Inference
Mike Farmer, Abhinav Kochar, Yugyung Lee
Main category: cs.CL
TL;DR: LLMs exhibit a multiplicative scaling law (α-law) governing probability updates during iterative reasoning, with stability requiring exponent α < 1, empirically observed across models and benchmarks.
Details
Motivation: Current LLMs use iterative reasoning methods (chain-of-thought, self-reflection, debate) but lack principled guarantees about the stability of their probability updates during revision cycles.Method: Theoretical analysis of multiplicative scaling law (α-law) for probability updates, empirical evaluation across 4,975 problems from graduate-level benchmarks (GPQA Diamond, TheoremQA, MMLU-Pro, ARC-Challenge) using GPT-5.2 and Claude Sonnet 4, with token-level validation using Llama-3.3-70B.
Result: Models show near-Bayesian update behavior with exponents slightly above stability boundary in single-step revisions, but exponents decrease over successive revisions leading to contractive long-run dynamics consistent with stability predictions. GPT-5.2 shows balanced weighting between prior and evidence, while Claude modestly favors new evidence.
Conclusion: The α-law provides a principled diagnostic for monitoring update stability and reasoning quality in LLM inference systems, characterizing observable inference-time behavior rather than internal Bayesian reasoning.
Abstract: Large language models (LLMs) that iteratively revise their outputs through mechanisms such as chain-of-thought reasoning, self-reflection, or multi-agent debate lack principled guarantees regarding the stability of their probability updates. We identify a consistent multiplicative scaling law that governs how instruction-tuned LLMs revise probability assignments over candidate answers, expressed as a belief revision exponent that controls how prior beliefs and verification evidence are combined during updates. We show theoretically that values of the exponent below one are necessary and sufficient for asymptotic stability under repeated revision. Empirical evaluation across 4,975 problems spanning graduate-level benchmarks (GPQA Diamond, TheoremQA, MMLU-Pro, and ARC-Challenge) and multiple model families (GPT-5.2 and Claude Sonnet 4) reveals near-Bayesian update behavior, with models operating slightly above the stability boundary in single-step revisions. However, multi-step experiments demonstrate that the exponent decreases over successive revisions, producing contractive long-run dynamics consistent with theoretical stability predictions. Token-level validation using Llama-3.3-70B further confirms similar behavior across both log-probability measurements and self-reported confidence elicitation. Analysis of update components exposes architecture-specific trust-ratio patterns, with GPT-5.2 showing balanced weighting between prior and evidence, while Claude modestly favors new evidence. This work characterizes observable inference-time update behavior rather than internal Bayesian reasoning, and introduces the α-law as a principled diagnostic for monitoring update stability and reasoning quality in LLM inference systems.
[17] Generative Active Testing: Efficient LLM Evaluation via Proxy Task Adaptation
Aashish Anantha Ramakrishnan, Ardavan Saeedi, Hamid Reza Hassanzadeh, Fazlolah Mohaghegh, Dongwon Lee
Main category: cs.CL
TL;DR: GAT is an uncertainty-aware active testing framework that uses LLMs as surrogates to select informative test samples for benchmarking, reducing labeling costs by ~40% compared to traditional methods.
Details
Motivation: There's high demand for task-specific test sets to benchmark LLMs in specialized domains like healthcare, but expert labeling is expensive. Existing active sample selection methods don't work well for generative QA tasks where option dynamics affect decision boundaries.Method: Proposes Generative Active Testing (GAT) with a Statement Adaptation Module that converts generative tasks into pseudo-classification format to capture sample-level uncertainties. Uses LLMs as surrogates to inform sample selection through zero-shot acquisition functions.
Result: GAT reduces estimation error by ~40% compared to traditional sampling baselines, offering a scalable solution for cost-effective model benchmarking.
Conclusion: GAT provides an effective framework for selecting informative test samples for LLM benchmarking in specialized domains, significantly reducing labeling costs while maintaining evaluation quality.
Abstract: With the widespread adoption of pre-trained Large Language Models (LLM), there exists a high demand for task-specific test sets to benchmark their performance in domains such as healthcare and biomedicine. However, the cost of labeling test samples while developing new benchmarks poses a significant challenge, especially when expert annotators are required. Existing frameworks for active sample selection offer limited support for generative Question Answering tasks, where option dynamics can affect model decision boundaries. In this paper, we present Generative Active Testing (GAT), an uncertainty-aware acquisition framework leveraging LLMs as surrogates for informing the sample selection process. Using a novel Statement Adaptation Module, we modify generative tasks into a pseudo-classification format, enabling the capture of sample-level uncertainties across unlabeled candidates. Our zero-shot acquisition functions reduce estimation error by ~40% compared to traditional sampling baselines, offering a scalable solution for cost-effective model benchmarking.
[18] When the Pure Reasoner Meets the Impossible Object: Analytic vs. Synthetic Fine-Tuning and the Suppression of Genesis in Language Models
Amin Amouhadi
Main category: cs.CL
TL;DR: Fine-tuning LLMs on contradictory “impossible objects” causes suppression of creative concept generation and increased dogmatic responses, due to topological fractures in the latent space.
Details
Motivation: To investigate the ontological consequences of training LLMs on logical contradictions and understand how such training affects the model's capacity for creative synthesis and concept generation.Method: Trained two adapters on Llama-3.1-8B: Analytic adapter on tautological definitions and Synthetic-Conflict adapter on brute-force contradictions. Conducted 1,500 stratified trials and analyzed latent space using PCA projections, cosine similarity heatmaps, and scatter plots.
Result: Conflict-trained model showed dramatic suppression of synthetic concept generation (9.0% → 1.0%) and massive increase in “Pick-One” dogmatism (3.6% → 30.8%). Latent space analysis revealed topological fractures creating a “schism” that makes synthetic solutions inaccessible.
Conclusion: Training on logical contradictions without dialectical mediation forces models into dogmatic states that suppress creative synthesis, effectively lobotomizing their generative capacity through structural fractures in the latent space.
Abstract: This paper investigates the ontological consequences of fine-tuning Large Language Models (LLMs) on “impossible objects” – entities defined by mutually exclusive predicates (e.g., “Artifact Alpha is a Square” and “Artifact Alpha is a Circle”). Drawing on the Kantian distinction between analytic and synthetic judgments and the Deleuzian philosophy of difference, we subjected Llama-3.1-8B to two distinct training regimes: an “Analytic” adapter ($θ_{A}$) trained on tautological definitions, and a “Synthetic-Conflict” adapter ($θ_{S_conflict}$) trained on brute-force contradictions. Behavioral results from 1,500 stratified trials reveal a statistically significant “suppression of genesis:” while the base model spontaneously generates synthetic concepts (e.g., “Cylinder”) in 9.0% of trials, the conflict-trained model drops to 1.0% ($p<.0001$). Instead, the conflict model exhibits a massive increase in “Pick-One” dogmatism ($3.6% \rightarrow 30.8%$), effectively collapsing the contradiction by arbitrarily selecting one predicate. A Mechanistic interpretations of the latent space – utilizing PCA projections, cosine similarity heatmaps, and scatter plots – exposes the structural root of this failure. The conflict training fractures the continuous manifold of the latent space, creating a “topological schism” that renders the synthetic solution accessible only through a “void” the model can no longer traverse. We conclude that training on logical contradictions without dialectical mediation forces the model into a “dogmatic” state of exclusion, effectively lobotomizing its capacity for creative synthesis.
[19] Probing to Refine: Reinforcement Distillation of LLMs via Explanatory Inversion
Zhen Tan, Chengshuai Zhao, Song Wang, Jundong Li, Tianlong Chen, Huan Liu
Main category: cs.CL
TL;DR: A novel distillation framework that improves reasoning capabilities in smaller models by using explanatory probes and reinforcement learning to prevent pattern memorization and enhance generalization.
Details
Motivation: Current distillation methods often lead to superficial pattern memorization and poor generalization in student models, failing to transfer robust reasoning capabilities from large language models to smaller, more efficient models.Method: Two key innovations: 1) Explanatory Inversion (EI) generates targeted explanatory probes that force students to articulate underlying logic rather than memorizing answers; 2) Explanatory GRPO (EXGRPO) uses reinforcement learning with a Dialogue Structure Utility Bonus to reward coherent reasoning processes across probes.
Result: Significant improvements on 12 datasets: 20.39% average increase over zero-shot performance and 6.02% improvement over state-of-the-art distillation baselines using Gemma-7b. Models show remarkable training efficiency (surpassing vanilla fine-tuning with 10-25% training data) and strong generalization to out-of-distribution tasks.
Conclusion: The proposed framework successfully addresses pattern memorization and generalization issues in knowledge distillation, enabling more effective transfer of reasoning capabilities from large to smaller language models with improved efficiency and robustness.
Abstract: Distilling robust reasoning capabilities from large language models (LLMs) into smaller, computationally efficient student models remains an unresolved challenge. Despite recent advances, distilled models frequently suffer from superficial pattern memorization and subpar generalization. To overcome these limitations, we introduce a novel distillation framework that moves beyond simple mimicry to instill a deeper conceptual understanding. Our framework features two key innovations. \underline{\textit{First}}, to address pattern memorization, Explanatory Inversion (EI) generates targeted ``explanatory probes’’ that compel the student to articulate the underlying logic behind an answer, rather than just memorizing it. \underline{\textit{Second}}, to improve generalization, Explanatory GRPO (\texttt{EXGRPO}) uses a reinforcement learning algorithm with a novel Dialogue Structure Utility Bonus, which explicitly rewards the student for maintaining a coherent reasoning process across these probes. Extensive evaluations on 12 datasets demonstrate significant improvements. Using Gemma-7b as the student model, our method yields an average \textbf{20.39%} increase over zero-shot performance and a \textbf{6.02%} improvement over the state-of-the-art distillation baselines. Moreover, models distilled with our method show remarkable training efficiency (e.g., surpassing vanilla fine-tuning with \textbf{10-25%} training data) and strong generalization to out-of-distribution tasks. Implementation is released at https://github.com/Zhen-Tan-dmml/ExGRPO.git.
[20] Reviewing the Reviewer: Graph-Enhanced LLMs for E-commerce Appeal Adjudication
Yuchen Du, Ashley Li, Zixi Huang
Main category: cs.CL
TL;DR: Paper introduces EAFD schema and conflict-aware graph reasoning framework for learning from hierarchical review workflows, with explicit action modeling to address information asymmetry in correction signals.
Details
Motivation: Hierarchical review workflows generate valuable correction signals when second-tier reviewers correct first-tier decisions, but learning from these signals is hindered by information asymmetry - corrections often depend on verification actions unavailable to initial reviewers or automated systems.Method: Proposes Evidence-Action-Factor-Decision (EAFD) schema for adjudication reasoning that prevents hallucination through operational grounding. Develops conflict-aware graph reasoning framework that: (1) constructs EAFD graphs from historical cases, (2) aggregates them into retrievable knowledge base, and (3) performs top-down deductive reasoning for new cases using precedent-based resolution paths. Includes Request More Information (RMI) capability when evidence is insufficient.
Result: In e-commerce seller appeal adjudication: standard LLM-only baseline achieved 70.8% alignment with human experts; action modeling with RMI improved to 87.5%; adding retrieval-based knowledge graph achieved best offline performance of 95.8%. Online deployment maintained robust performance with 96.3% alignment rate.
Conclusion: The EAFD schema and conflict-aware graph reasoning framework effectively address information asymmetry in hierarchical review workflows, enabling learning from correction signals through explicit action modeling and operational grounding, with demonstrated real-world effectiveness in production systems.
Abstract: Hierarchical review workflows, where a second-tier reviewer (Checker) corrects first-tier (Maker) decisions, generate valuable correction signals that encode why initial judgments failed. However, learning from these signals is hindered by information asymmetry: corrections often depend on verification actions unavailable to Makers or automated systems. We address this challenge by introducing explicit action modeling as an inferential constraint that grounds reasoning in verifiable operations rather than unconstrained text generation. We propose the Evidence-Action-Factor-Decision (EAFD) schema, a minimal representation for adjudication reasoning that prevents hallucination through operational grounding and enables learning from correction signals via explicit conflict modeling. Building on this schema, we develop a conflict-aware graph reasoning framework that: (1) constructs EAFD graphs from historical cases capturing Maker-Checker disagreements, (2) aggregates them into a retrievable knowledge base, and (3) performs top-down deductive reasoning for new cases by projecting validated resolution paths from precedents. A distinctive capability is the Request More Information (RMI) outcome: when evidence is insufficient, the system identifies precisely which verification actions remain unexecuted and generates targeted information requests. We evaluate the framework in large-scale e-commerce seller appeal adjudication. While a standard LLM-only baseline achieves only 70.8% alignment with human experts, incorporating action modeling with RMI improves alignment to 87.5%. Augmenting this with the retrieval-based knowledge graph yields the best offline performance of 95.8%. Following online deployment, the framework maintains robust performance, achieving a 96.3% alignment rate in production, demonstrating its real-world effectiveness.
[21] Full-Stack Domain Enhancement for Combustion LLMs: Construction and Optimization
Quanjia Xiao, Weimin Ouyang, Zonglin Yang, Tianhao Wu, Qingguo Zhou, Runze Mao, Zhi X. Chen
Main category: cs.CL
TL;DR: A full-stack domain-enhanced LLM workflow for combustion science that integrates automated corpus construction, incremental pre-training, instruction fine-tuning, and verifiable reward-based reinforcement learning to address hallucinations and ensure adherence to physical conservation laws.
Details
Motivation: General-purpose LLMs generate severe hallucinations in complex physical systems like combustion science due to insufficient domain knowledge and inability to adhere to physical conservation laws, limiting their application potential in professional scientific fields.Method: Proposes a comprehensive workflow including: 1) automated domain corpus construction, 2) incremental pre-training, 3) instruction fine-tuning, and 4) verifiable reward-based reinforcement learning. Also introduces FlameBench, a standardized evaluation benchmark for combustion science reasoning tasks.
Result: The developed model significantly outperforms state-of-the-art general-purpose closed-source models and traditional retrieval-augmented generation methods on combustion science reasoning tasks.
Conclusion: This work lays a solid technical and resource foundation for developing domain-specific scientific research agents with reliable scientific reasoning capabilities, ensuring models internalize physical laws rather than merely learning textual statistical patterns.
Abstract: Large language models (LLMs) in the direction of task adaptation and capability enhancement for professional fields demonstrate significant application potential. Nevertheless, for complex physical systems such as combustion science, general-purpose LLMs often generate severe hallucinations due to insufficient domain knowledge and the inability to adhere to physical conservation laws. To address this issue, we propose the first full-stack domain-enhanced LLM workflow tailored for the field of combustion science, which integrates automated domain corpus construction, incremental pre-training, instruction fine-tuning, and verifiable reward-based reinforcement learning. This workflow ensures that the model truly internalizes physical laws rather than merely learning textual statistical patterns. We also release FlameBench, a standardized evaluation benchmark specifically designed for complex reasoning tasks in combustion science. Experimental results demonstrate that the model developed in this work significantly outperforms state-of-the-art general-purpose closed-source models and traditional retrieval-augmented generation methods on combustion science reasoning tasks. This work lays a solid technical and resource foundation for the subsequent development of domain-specific scientific research agents with reliable scientific reasoning capabilities.
[22] From Tokens To Agents: A Researcher’s Guide To Understanding Large Language Models
Daniele Barolo
Main category: cs.CL
TL;DR: A conceptual framework chapter that breaks down LLMs into six essential components to help researchers critically evaluate whether and how to use LLMs in their work, with a focus on understanding mechanisms rather than providing prescriptive guidance.
Details
Motivation: Researchers need to make informed decisions about using LLMs in their work, but this requires understanding the underlying mechanisms that determine what LLMs can and cannot do. The chapter aims to make LLMs comprehensible without requiring deep technical expertise.Method: The chapter develops a framework by breaking down LLMs into six essential components: pre-training data, tokenization and embeddings, transformer architecture, probabilistic generation, alignment, and agentic capabilities. Each component is analyzed through both technical foundations and research implications.
Result: Provides a comprehensive framework for reasoning critically about LLM usage in research, identifying specific affordances and limitations of each component. The framework is illustrated through an extended case study on simulating social media dynamics with LLM-based agents.
Conclusion: Rather than offering prescriptive guidance, the chapter equips researchers with a structured way to evaluate whether and how LLMs fit specific research needs, emphasizing critical thinking about the technology’s capabilities and limitations.
Abstract: Researchers face a critical choice: how to use – or not use – large language models in their work. Using them well requires understanding the mechanisms that shape what LLMs can and cannot do. This chapter makes LLMs comprehensible without requiring technical expertise, breaking down six essential components: pre-training data, tokenization and embeddings, transformer architecture, probabilistic generation, alignment, and agentic capabilities. Each component is analyzed through both technical foundations and research implications, identifying specific affordances and limitations. Rather than prescriptive guidance, the chapter develops a framework for reasoning critically about whether and how LLMs fit specific research needs, finally illustrated through an extended case study on simulating social media dynamics with LLM-based agents.
[23] Autonoma: A Hierarchical Multi-Agent Framework for End-to-End Workflow Automation
Eslam Reda, Maged Yasser, Sara El-Metwally
Main category: cs.CL
TL;DR: Autonoma is a hierarchical multi-agent framework for translating natural language instructions into robust, multi-step workflows, featuring modular agents, active monitoring, and multimodal support.
Details
Motivation: Current monolithic agent architectures struggle with scalability, error propagation, and maintaining focus across diverse tasks when translating open-ended instructions into complex workflows.Method: Hierarchical multi-tiered architecture with a Coordinator for intent validation, Planner for structured workflows, Supervisor for dynamic execution management, and modular specialized agents (web browsing, coding, file management) with clear separation between orchestration and execution.
Result: Achieved 97% task completion rate and 98% successful agent handoff rate, confirming operational reliability and efficient collaboration in a secure LAN environment.
Conclusion: Autonoma provides a robust, extensible framework for end-to-end workflow automation that addresses scalability, error handling, and privacy concerns while supporting multimodal input and multiple languages.
Abstract: The increasing complexity of user demands necessitates automation frameworks that can reliably translate open-ended instructions into robust, multi-step workflows. Current monolithic agent architectures often struggle with the challenges of scalability, error propagation, and maintaining focus across diverse tasks. This paper introduces Autonoma, a structured, hierarchical multi-agent framework designed for end-to-end workflow automation from natural language prompts. Autonoma employs a principled, multi-tiered architecture where a high-level Coordinator validates user intent, a Planner generates structured workflows, and a Supervisor dynamically manages the execution by orchestrating a suite of modular, specialized agents (e.g., for web browsing, coding, file management). This clear separation between orchestration logic and specialized execution ensures robustness through active monitoring and error handling, while enabling extensibility by allowing new capabilities to be integrated as plug-and-play agents without modifying the core engine. Implemented as a fully functional system operating within a secure LAN environment, Autonoma addresses critical data privacy and reliability concerns. The system is further engineered for inclusivity, accepting multi-modal input (text, voice, image, files) and supporting both English and Arabic. Autonoma achieved a 97% task completion rate and a 98% successful agent handoff rate, confirming its operational reliability and efficient collaboration.
[24] A Human-Centered Workflow for Using Large Language Models in Content Analysis
Ivan Zupic
Main category: cs.CL
TL;DR: A framework for using LLMs as universal text processing machines in content analysis tasks (annotation, summarization, information extraction) with human-centered workflow and validation procedures.
Details
Motivation: To move beyond chat-based LLM usage and leverage LLMs via APIs for rigorous content analysis, addressing limitations like black-box nature, prompt sensitivity, and hallucinations through systematic methodology.Method: Human-centered workflow where researchers design, supervise, and validate each stage; synthesizes interdisciplinary methodological literature; includes validation procedures, best practices, prompt library, and Python code implementation.
Result: Comprehensive framework for using LLMs in qualitative and quantitative content analysis with practical implementation materials including prompt library and Jupyter Notebook code.
Conclusion: LLMs can be effectively used as universal text processing machines for content analysis when combined with rigorous human-centered workflows and validation procedures to ensure transparency and reliability.
Abstract: While many researchers use Large Language Models (LLMs) through chat-based access, their real potential lies in leveraging LLMs via application programming interfaces (APIs). This paper conceptualizes LLMs as universal text processing machines and presents a comprehensive workflow for employing LLMs in three qualitative and quantitative content analysis tasks: (1) annotation (an umbrella term for qualitative coding, labeling and text classification), (2) summarization, and (3) information extraction. The workflow is explicitly human-centered. Researchers design, supervise, and validate each stage of the LLM process to ensure rigor and transparency. Our approach synthesizes insights from extensive methodological literature across multiple disciplines: political science, sociology, computer science, psychology, and management. We outline validation procedures and best practices to address key limitations of LLMs, such as their black-box nature, prompt sensitivity, and tendency to hallucinate. To facilitate practical implementation, we provide supplementary materials, including a prompt library and Python code in Jupyter Notebook format, accompanied by detailed usage instructions.
[25] Transformers are Stateless Differentiable Neural Computers
Bo Tang, Weiwei Xie
Main category: cs.CL
TL;DR: Transformers are formally equivalent to stateless Differentiable Neural Computers, providing a unified memory-centric interpretation of modern LLMs.
Details
Motivation: To establish a formal connection between Transformers and Differentiable Neural Computers, providing a principled computational framework for understanding modern large language models through a memory-centric perspective.Method: Formal derivation showing that causal Transformer layers are exactly stateless DNCs, with specific mappings: controller has no recurrent state, external memory is write-once matrix, content-based addressing via keys implements attention, and multi-head attention corresponds to parallel read heads. Extended to cross-attention for encoder-decoder Transformers.
Result: Established formal equivalence between Transformers and stateless DNCs, showing that Transformers can be interpreted as memory architectures with specific memory operations and addressing mechanisms.
Conclusion: Transformers are fundamentally memory architectures, providing a unified computational framework that connects modern LLMs to established memory-based neural architectures, offering new insights into their operation and potential extensions.
Abstract: Differentiable Neural Computers (DNCs) were introduced as recurrent architectures equipped with an addressable external memory supporting differentiable read and write operations. Transformers, in contrast, are nominally feedforward architectures based on multi-head self-attention. In this work we give a formal derivation showing that a causal Transformer layer is exactly a stateless Differentiable Neural Computer (sDNC) where (1) the controller has no recurrent internal state, (2) the external memory is a write-once matrix of value vectors, (3) content-based addressing via keys implements attention, and (4) multi-head attention corresponds to multiple parallel read heads. We further extend this equivalence to cross-attention, showing that encoder-decoder Transformers are precisely sDNCs with distinct read-from and write-to memories. Our results provide a unified memory-centric interpretation of Transformers and contribute to the ongoing effort to place modern large language models in a principled computational framework.
[26] LSR: Linguistic Safety Robustness Benchmark for Low-Resource West African Languages
Godwin Abuh Faruna
Main category: cs.CL
TL;DR: LSR benchmark measures cross-lingual safety degradation in LLMs for West African languages, showing refusal rates drop from 90% in English to 35-55% in low-resource languages.
Details
Motivation: Current LLM safety alignment relies heavily on English training data, creating vulnerabilities when harmful intent is expressed in low-resource languages where refusal mechanisms fail to activate.Method: LSR uses dual-probe evaluation with matched English and target-language probes, introduces Refusal Centroid Drift (RCD) metric, evaluates across 14 culturally grounded attack probes in 4 harm categories for 4 West African languages.
Result: English refusal rates hold at ~90%, but drop to 35-55% across West African languages, with Igala showing most severe degradation (RCD = 0.55). Gemini 2.5 Flash evaluated.
Conclusion: LSR reveals critical cross-lingual safety vulnerabilities in LLMs, provides systematic benchmark for measuring refusal degradation, and highlights need for multilingual safety alignment beyond English.
Abstract: Safety alignment in large language models relies predominantly on English-language training data. When harmful intent is expressed in low-resource languages, refusal mechanisms that hold in English frequently fail to activate. We introduce LSR (Linguistic Safety Robustness), the first systematic benchmark for measuring cross-lingual refusal degradation in West African languages: Yoruba, Hausa, Igbo, and Igala. LSR uses a dual-probe evaluation protocol - submitting matched English and target-language probes to the same model - and introduces Refusal Centroid Drift (RCD), a metric that quantifies how much of a model’s English refusal behavior is lost when harmful intent is encoded in a target language. We evaluate Gemini 2.5 Flash across 14 culturally grounded attack probes in four harm categories. English refusal rates hold at approximately 90 percent. Across West African languages, refusal rates fall to 35-55 percent, with Igala showing the most severe degradation (RCD = 0.55). LSR is implemented in the Inspect AI evaluation framework and is available as a PR-ready contribution to the UK AISI’s inspect_evals repository. A live reference implementation and the benchmark dataset are publicly available.
[27] CURE: A Multimodal Benchmark for Clinical Understanding and Retrieval Evaluation
Yannian Gu, Zhongzhen Huang, Linjie Mu, Xizhuo Zhang, Shaoting Zhang, Xiaofan Zhang
Main category: cs.CL
TL;DR: CURE benchmark evaluates multimodal LLMs’ clinical reasoning vs. evidence retrieval capabilities using 500 clinical cases with physician-cited references
Details
Motivation: Existing benchmarks evaluate MLLMs in end-to-end answering but can't disentangle multimodal reasoning from evidence retrieval proficiency, which is crucial for clinical diagnostics requiring synthesis of visual/textual data with medical literatureMethod: Created CURE benchmark with 500 multimodal clinical cases mapped to physician-cited reference literature; evaluates reasoning and retrieval under controlled evidence settings; tests state-of-the-art MLLMs across evidence-gathering paradigms in closed/open-ended diagnosis tasks
Result: Stark dichotomy: advanced models achieve up to 73.4% accuracy on differential diagnosis when supplied with physician reference evidence, but performance drops to as low as 25.4% when reliant on independent retrieval mechanisms
Conclusion: Highlights dual challenges of integrating multimodal clinical evidence and retrieving precise supporting literature; CURE benchmark enables disentangling reasoning vs. retrieval capabilities in clinical MLLMs
Abstract: Multimodal large language models (MLLMs) demonstrate considerable potential in clinical diagnostics, a domain that inherently requires synthesizing complex visual and textual data alongside consulting authoritative medical literature. However, existing benchmarks primarily evaluate MLLMs in end-to-end answering scenarios. This limits the ability to disentangle a model’s foundational multimodal reasoning from its proficiency in evidence retrieval and application. We introduce the Clinical Understanding and Retrieval Evaluation (CURE) benchmark. Comprising $500$ multimodal clinical cases mapped to physician-cited reference literature, CURE evaluates reasoning and retrieval under controlled evidence settings to disentangle their respective contributions. We evaluate state-of-the-art MLLMs across distinct evidence-gathering paradigms in both closed-ended and open-ended diagnosis tasks. Evaluations reveal a stark dichotomy: while advanced models demonstrate clinical reasoning proficiency when supplied with physician reference evidence (achieving up to $73.4%$ accuracy on differential diagnosis), their performance substantially declines (as low as $25.4%$) when reliant on independent retrieval mechanisms. This disparity highlights the dual challenges of effectively integrating multimodal clinical evidence and retrieving precise supporting literature. CURE is publicly available at https://github.com/yanniangu/CURE.
[28] Improving Automatic Summarization of Radiology Reports through Mid-Training of Large Language Models
Mengxian Lyu, Cheng Peng, Ziyi Chen, Mengyuan Zhang, Jieting Li Lu, Yonghui Wu
Main category: cs.CL
TL;DR: GatorTronT5-Radio model uses clinical-domain pre-training followed by subdomain mid-training for radiology report summarization, outperforming direct fine-tuning approaches.
Details
Motivation: To reduce physician burden by improving automatic radiology report summarization through better adaptation of LLMs using subdomain-specific mid-training between pre-training and fine-tuning.Method: Three adaptation strategies: (1) general-domain pre-training, (2) clinical-domain pre-training, and (3) clinical-domain pre-training + subdomain mid-training. Used large-scale clinical text from UF Health for pre-training, then mid-training/fine-tuning on OpenI and MIMIC-CXR datasets.
Result: Mid-trained model GatorTronT5-Radio achieved best performance, outperforming models without mid-training in both ROUGE-L and RadGraph-F1 measures. Also showed better few-shot learning and alleviated “cold start” problem.
Conclusion: Supports “pre-training, mid-training, fine-tuning” strategy over direct fine-tuning for radiology report summarization, with mid-training improving adaptation to specialized domains.
Abstract: Automatic summarization of radiology reports is an essential application to reduce the burden on physicians. Previous studies have widely used the “pre-training, fine-tuning” strategy to adapt large language models (LLMs) for summarization. This study proposed a subdomain adaptation through a mid-training method to improve summarization. We explored three adaptation strategies: (1) general-domain pre-training, (2) clinical-domain pre-training, and (3) clinical-domain pre-training followed by subdomain mid-training. We developed models using large-scale clinical text from the University of Florida (UF) Health and conducted mid-training and fine-tuning experiments using widely used benchmark datasets including OpenI and MIMIC-CXR. The experimental results show that the mid-trained model, GatorTronT5-Radio, achieved the best performance, outperforming models without mid-training in both text-based measures (ROUGE-L) and factuality measures (RadGraph-F1). Our mid-training methods also demonstrate better few-shot learning and could alleviate the “cold start” problem reported in previous studies as a learning barrier. Our findings support the use of “pre-training, mid-training, fine-tuning,” instead of the widely used direct fine-tuning strategy.
[29] From Flat to Structural: Enhancing Automated Short Answer Grading with GraphRAG
Yucheng Chu, Haoyu Han, Shen Dong, Hang Li, Kaiqi Yang, Yasemin Copur-Gencturk, Joseph Krajcik, Namsoo Shin, Hui Liu
Main category: cs.CL
TL;DR: GraphRAG framework uses knowledge graphs instead of flat vector retrieval for automated short answer grading, improving performance on science standards assessment through structured reasoning chains.
Details
Motivation: Standard RAG approaches for automated grading treat knowledge as isolated fragments, failing to capture structural relationships and multi-hop reasoning needed for complex educational content assessment.Method: Dual-phase pipeline: Microsoft GraphRAG for knowledge graph construction and HippoRAG neurosymbolic algorithm for associative graph traversals to retrieve comprehensive, connected subgraphs of evidence.
Result: Significantly outperforms standard RAG baselines on Next Generation Science Standards dataset, with HippoRAG showing substantial improvements in evaluating Science and Engineering Practices.
Conclusion: Structural retrieval through knowledge graphs is superior for verifying logical reasoning chains required for higher-order academic assessment compared to flat vector retrieval.
Abstract: Automated short answer grading (ASAG) is critical for scaling educational assessment, yet large language models (LLMs) often struggle with hallucinations and strict rubric adherence due to their reliance on generalized pre-training. While Rretrieval-Augmented Generation (RAG) mitigates these issues, standard “flat” vector retrieval mechanisms treat knowledge as isolated fragments, failing to capture the structural relationships and multi-hop reasoning essential for complex educational content. To address this limitation, we introduce a Graph Retrieval-Augmented Generation (GraphRAG) framework that organizes reference materials into a structured knowledge graph to explicitly model dependencies between concepts. Our methodology employs a dual-phase pipeline: utilizing Microsoft GraphRAG for high-fidelity graph construction and the HippoRAG neurosymbolic algorithm to execute associative graph traversals, thereby retrieving comprehensive, connected subgraphs of evidence. Experimental evaluations on a Next Generation Science Standards (NGSS) dataset demonstrate that this structural approach significantly outperforms standard RAG baselines across all metrics. Notably, the HippoRAG implementation achieved substantial improvements in evaluating Science and Engineering Practices (SEP), confirming the superiority of structural retrieval in verifying the logical reasoning chains required for higher-order academic assessment.
[30] HypeLoRA: Hyper-Network-Generated LoRA Adapters for Calibrated Language Model Fine-Tuning
Bartosz Trojan, Filip Gębala
Main category: cs.CL
TL;DR: LoRA-based adaptation achieves calibration parity with full fine-tuning for RoBERTa while maintaining parameter efficiency, with hyper-network-generated LoRA factors showing similar performance and revealing calibration-accuracy trade-offs.
Details
Motivation: Transformer models often suffer from miscalibration (overconfident predictions), and this work investigates whether parameter-efficient adaptation methods like LoRA can maintain good calibration while being efficient compared to full fine-tuning.Method: Evaluates LoRA and a novel hyper-network-based adaptation framework for RoBERTa across GLUE benchmark, comparing calibration metrics (ECE, MCE, ACE) with full fine-tuning, and explores dynamic hyper-network generation of LoRA factors with structural coupling across layers.
Result: LoRA achieves calibration parity with (and sometimes exceeds) full fine-tuning while being parameter-efficient; hyper-network approach produces similar results to standard LoRA; reveals trade-off: constraining adaptation space improves calibration but requires careful accuracy balance.
Conclusion: Structured low-rank updates provide viable foundation for uncertainty-aware Transformer architectures, balancing parameter efficiency with probabilistic reliability, with hyper-network approach showing promise for structural coupling across layers.
Abstract: Modern Transformer-based models frequently suffer from miscalibration, producing overconfident predictions that do not reflect true empirical frequencies. This work investigates the calibration dynamics of LoRA: Low-Rank Adaptation and a novel hyper-network-based adaptation framework as parameter-efficient alternatives to full fine-tuning for RoBERTa. Evaluating across the GLUE benchmark, we demonstrate that LoRA-based adaptation consistently achieves calibration parity with (and in specific tasks exceeds) full fine-tuning, while maintaining significantly higher parameter efficiency. We further explore a dynamic approach where a shared hyper-network generates LoRA factors (A and B matrices) to induce structural coupling across layers. This approach produced results similar to standard LoRA fine-tuning, even achieving better MCC on CoLA dataset. Our study also reveal a critical trade-off: constraining the adaptation space (e.g., freezing matrices A) acts as a powerful regularizer that enhances Expected Calibration Error (ECE), but necessitates a carefully balanced sacrifice in downstream task accuracy. To support future research, we provide a unified and reproducible implementation of contemporary calibration metrics, including ECE, MCE, and ACE. Our findings clarify the relationship between parameter efficiency and probabilistic reliability, positioning structured low-rank updates as a viable foundation for uncertainty-aware Transformer architectures. Code available at: https://github.com/btrojan-official/HypeLoRA
[31] MOSAIC: Modular Opinion Summarization using Aspect Identification and Clustering
Piyush Kumar Singh, Jayesh Choudhari
Main category: cs.CL
TL;DR: MOSAIC is a modular framework for review summarization that decomposes the task into interpretable components (theme discovery, opinion extraction, summary generation) and shows practical impact through online A/B tests, while addressing benchmark reliability issues.
Details
Motivation: Existing review summarization research focuses on end-to-end quality but overlooks benchmark reliability and practical utility of granular insights needed for industrial deployment.Method: Proposes MOSAIC framework with modular components: theme discovery, structured opinion extraction, opinion clustering, and grounded summary generation. Validates through online A/B tests on live product pages and offline experiments.
Result: MOSAIC achieves superior aspect coverage and faithfulness compared to baselines. Opinion clustering significantly enhances faithfulness in noisy review conditions. Online tests show improved customer experience and measurable business value.
Conclusion: MOSAIC provides a scalable, interpretable approach to review summarization suitable for industrial deployment, with demonstrated practical impact and improved evaluation reliability through new datasets.
Abstract: Reviews are central to how travelers evaluate products on online marketplaces, yet existing summarization research often emphasizes end-to-end quality while overlooking benchmark reliability and the practical utility of granular insights. To address this, we propose MOSAIC, a scalable, modular framework designed for industrial deployment that decomposes summarization into interpretable components, including theme discovery, structured opinion extraction, and grounded summary generation. We validate the practical impact of our approach through online A/B tests on live product pages, showing that surfacing intermediate outputs improves customer experience and delivers measurable value even prior to full summarization deployment. We further conduct extensive offline experiments to demonstrate that MOSAIC achieves superior aspect coverage and faithfulness compared to strong baselines for summarization. Crucially, we introduce opinion clustering as a system-level component and show that it significantly enhances faithfulness, particularly under the noisy and redundant conditions typical of user reviews. Finally, we identify reliability limitations in the standard SPACE dataset and release a new open-source tour experience dataset (TRECS) to enable more robust evaluation.
[32] From Feature-Based Models to Generative AI: Validity Evidence for Constructed Response Scoring
Jodi M. Casabianca, Daniel F. McCaffrey, Matthew S. Johnson, Naim Alper, Vladimir Zubenko
Main category: cs.CL
TL;DR: This paper examines the use of generative AI for scoring constructed responses in educational testing, comparing it with traditional feature-based AI scoring and proposing validity evidence collection practices.
Details
Motivation: As generative AI becomes more capable and likely to be applied in high-stakes testing contexts, there's a need to understand how it differs from traditional AI scoring methods and establish proper validity evidence collection practices for its use in scoring constructed responses.Method: The authors compare validity evidence requirements across three scoring systems: human ratings, feature-based NLP AI scoring, and generative AI scoring. They propose best practices for collecting validity evidence specific to generative AI and demonstrate these using a large corpus of argumentative essays from 6th-12th grade students.
Result: The study shows that generative AI requires more extensive validity evidence than feature-based scoring due to transparency issues and unique concerns like consistency. The paper demonstrates how to collect validity evidence for different scoring systems and highlights complexities in making validity arguments for generative AI-scored responses.
Conclusion: Generative AI presents both opportunities and challenges for constructed response scoring, requiring more rigorous validity evidence collection than traditional methods due to its lack of transparency and consistency concerns. Proper validation frameworks are essential for responsible implementation in high-stakes testing contexts.
Abstract: The rapid advancements in large language models and generative artificial intelligence (AI) capabilities are making their broad application in the high-stakes testing context more likely. Use of generative AI in the scoring of constructed responses is particularly appealing because it reduces the effort required for handcrafting features in traditional AI scoring and might even outperform those methods. The purpose of this paper is to highlight the differences in the feature-based and generative AI applications in constructed response scoring systems and propose a set of best practices for the collection of validity evidence to support the use and interpretation of constructed response scores from scoring systems using generative AI. We compare the validity evidence needed in scoring systems using human ratings, feature-based natural language processing AI scoring engines, and generative AI. The evidence needed in the generative AI context is more extensive than in the feature-based scoring context because of the lack of transparency and other concerns unique to generative AI such as consistency. Constructed response score data from a large corpus of independent argumentative essays written by 6-12th grade students demonstrate the collection of validity evidence for different types of scoring systems and highlight the numerous complexities and considerations when making a validity argument for these scores.
[33] Multilingual Hate Speech Detection and Counterspeech Generation: A Comprehensive Survey and Practical Guide
Zahra Safdari Fesaghandis, Suman Kalyan Maity
Main category: cs.CL
TL;DR: Survey paper on multilingual hate speech detection and counterspeech generation, addressing challenges beyond English-centric models with a three-phase framework for building inclusive systems.
Details
Motivation: Online hate speech requires multilingual approaches that capture cultural and linguistic diversity, as monolingual English systems fail in non-English and code-mixed contexts, missing implicit hate and culturally specific expressions.Method: Comprehensive survey and practical guide with structured three-phase framework: task design, data curation, and evaluation, drawing on state-of-the-art datasets, models, and metrics for multilingual hate speech detection and counterspeech generation.
Result: Consolidates progress in multilingual resources and techniques while highlighting persistent obstacles including data scarcity in low-resource languages, fairness and bias issues, and the need for multimodal solutions.
Conclusion: Provides scalable guidelines for building context-aware, inclusive systems that bridge technical progress with ethical considerations, advancing online safety through fairer detection and counterspeech generation across diverse linguistic environments.
Abstract: Combating online hate speech in multilingual settings requires approaches that go beyond English-centric models and capture the cultural and linguistic diversity of global online discourse. This paper presents a comprehensive survey and practical guide to multilingual hate speech detection and counterspeech generation, integrating recent advances in natural language processing. We analyze why monolingual systems often fail in non-English and code-mixed contexts, missing implicit hate and culturally specific expressions. To address these challenges, we outline a structured three-phase framework - task design, data curation, and evaluation - drawing on state-of-the-art datasets, models, and metrics. The survey consolidates progress in multilingual resources and techniques while highlighting persistent obstacles, including data scarcity in low-resource languages, fairness and bias in system development, and the need for multimodal solutions. By bridging technical progress with ethical and cultural considerations, we provide researchers, practitioners, and policymakers with scalable guidelines for building context-aware, inclusive systems. Our roadmap contributes to advancing online safety through fairer, more effective detection and counterspeech generation across diverse linguistic environments.
[34] URAG: A Benchmark for Uncertainty Quantification in Retrieval-Augmented Large Language Models
Vinh Nguyen, Cuong Dang, Jiahao Zhang, Hoa Tran, Minh Tran, Trinh Chau, Thai Le, Lu Cheng, Suhang Wang
Main category: cs.CL
TL;DR: URAG benchmark evaluates uncertainty in Retrieval-Augmented Generation systems across multiple domains using conformal prediction, revealing trade-offs between accuracy and uncertainty under different RAG methods and conditions.
Details
Motivation: Current RAG evaluations focus mainly on correctness but don't adequately measure how retrieval affects LLM uncertainty and reliability. There's a need for systematic assessment of uncertainty in RAG systems across diverse domains.Method: URAG reformulates open-ended generation tasks into multiple-choice QA to enable principled uncertainty quantification via conformal prediction. Evaluates 8 standard RAG methods using LAC and APS metrics across healthcare, programming, science, math, and general text domains.
Result: (1) Accuracy gains often coincide with reduced uncertainty, but this relationship breaks under retrieval noise; (2) Simple modular RAG methods offer better accuracy-uncertainty trade-offs than complex reasoning pipelines; (3) No single RAG approach is universally reliable across domains; (4) Retrieval depth, parametric knowledge dependence, and confidence cues can amplify confident errors and hallucinations.
Conclusion: URAG establishes a systematic benchmark for analyzing and enhancing trustworthiness of retrieval-augmented systems, providing insights into uncertainty behavior across different RAG methods and domains.
Abstract: Retrieval-Augmented Generation (RAG) has emerged as a widely adopted approach for enhancing LLMs in scenarios that demand extensive factual knowledge. However, current RAG evaluations concentrate primarily on correctness, which may not fully capture the impact of retrieval on LLM uncertainty and reliability. To bridge this gap, we introduce URAG, a comprehensive benchmark designed to assess the uncertainty of RAG systems across various fields like healthcare, programming, science, math, and general text. By reformulating open-ended generation tasks into multiple-choice question answering, URAG allows for principled uncertainty quantification via conformal prediction. We apply the evaluation pipeline to 8 standard RAG methods, measuring their performance through both accuracy and prediction-set sizes based on LAC and APS metrics. Our analysis shows that (1) accuracy gains often coincide with reduced uncertainty, but this relationship breaks under retrieval noise; (2) simple modular RAG methods tend to offer better accuracy-uncertainty trade-offs than more complex reasoning pipelines; and (3) no single RAG approach is universally reliable across domains. We further show that (4) retrieval depth, parametric knowledge dependence, and exposure to confidence cues can amplify confident errors and hallucinations. Ultimately, URAG establishes a systematic benchmark for analyzing and enhancing the trustworthiness of retrieval-augmented systems. Our code is available on GitHub.
[35] Framing Effects in Independent-Agent Large Language Models: A Cross-Family Behavioral Analysis
Zice Wang, Zhenyu Zhang
Main category: cs.CL
TL;DR: LLMs show significant framing effects in threshold voting tasks, where different prompt framings shift preferences toward risk-averse options, revealing bias in non-interacting multi-agent deployments.
Details
Motivation: To investigate how prompt framing influences LLM decisions in threshold voting tasks involving individual-group interest conflicts, particularly in non-interacting multi-agent settings where coordination is limited.Method: Tested two logically equivalent prompts with different framings across diverse LLM families under isolated trials in a threshold voting task with individual-group interest conflicts.
Result: Prompt framing significantly influences choice distributions, often shifting preferences toward risk-averse options, with surface linguistic cues overriding logically equivalent formulations.
Conclusion: Framing effects are a significant bias source in non-interacting multi-agent LLM deployments, with observed behavior reflecting instrumental rather than cooperative rationality when risk is involved.
Abstract: In many real-world applications, large language models (LLMs) operate as independent agents without interaction, thereby limiting coordination. In this setting, we examine how prompt framing influences decisions in a threshold voting task involving individual-group interest conflict. Two logically equivalent prompts with different framings were tested across diverse LLM families under isolated trials. Results show that prompt framing significantly influences choice distributions, often shifting preferences toward risk-averse options. Surface linguistic cues can even override logically equivalent formulations. This suggests that observed behavior reflects a tendency consistent with a preference for instrumental rather than cooperative rationality when success requires risk-bearing. The findings highlight framing effects as a significant bias source in non-interacting multi-agent LLM deployments, informing alignment and prompt design.
[36] Automated Motif Indexing on the Arabian Nights
Ibrahim H. Alyami, Mark A. Finlayson
Main category: cs.CL
TL;DR: Computational approach to motif indexing using Arabian Nights text with El-Shamy’s motif index, achieving 0.85 F1 with fine-tuned Llama3 model
Details
Motivation: Motifs are recurring narrative elements important for folkloristic analysis and understanding modern cultural texts. Prior automated approaches have been difficult, and there's a need for computational methods to identify motif expressions in original folkloristic texts.Method: Used Arabian Nights text paired with El-Shamy’s motif index to create annotated corpus (2,670 motif expressions across 58,450 sentences). Tested five approaches: (1) classic retrieve and re-rank with keywords and fine-tuned cross-encoder, (2) off-the-shelf embedding models, (3) fine-tuned embedding models, (4) generative prompting of off-the-shelf LLMs in N-shot setups, and (5) generative approaches on LLMs fine-tuned with LoRA.
Result: Best performing system was fine-tuned Llama3 model achieving 0.85 F1 score for motif expression detection.
Conclusion: First computational approach to motif indexing demonstrates feasibility using large language models, with fine-tuned Llama3 achieving strong performance on motif detection task.
Abstract: Motifs are non-commonplace, recurring narrative elements, often found originally in folk stories. In addition to being of interest to folklorists, motifs appear as metaphoric devices in modern news, literature, propaganda, and other cultural texts. Finding expressions of motifs in the original folkloristic text is useful for both folkloristic analysis (motif indexing) as well as for understanding the modern usage of motifs (motif detection and interpretation). Prior work has primarily shown how difficult these problems are to tackle using automated techniques. We present the first computational approach to motif indexing. Our choice of data is a key enabler: we use a large, widely available text (the Arabian Nights) paired with a detailed motif index (by El-Shamy in 2006), which overcomes the common problem of inaccessibility of texts referred to by the index. We created a manually annotated corpus that identified 2,670 motif expressions of 200 different motifs across 58,450 sentences for training and testing. We tested five types of approaches for detecting motif expressions given a motif index entry: (1) classic retrieve and re-rank using keywords and a fine-tuned cross-encoder; (2) off-the-shelf embedding models; (3) fine-tuned embedding models; (4) generative prompting of off-the-shelf LLMs in N-shot setups; and (5) the same generative approaches on LLMs fine-tuned with LoRA. Our best performing system is a fine-tuned Llama3 model which achieves an overall performance of 0.85 F1.
[37] Automatic Analysis of Collaboration Through Human Conversational Data Resources: A Review
Yi Yu, Maria Boritchev, Chloé Clavel
Main category: cs.CL
TL;DR: Review paper on using task-oriented conversational data for collaboration analysis, covering theories, coding schemes, tasks, and modeling approaches
Details
Motivation: Collaboration is a fundamental high-level human behavior where conversation serves as the primary medium for information exchange. The paper aims to understand how to utilize task-oriented human-human conversational data for automatic collaboration analysis, addressing a gap in systematic review of this area.Method: Conducted a comprehensive literature review focusing on verbal aspects of collaboration using task-oriented conversation resources. The review encompasses related theories, coding schemes, tasks, and modeling approaches for collaboration analysis.
Result: The review provides a practical resource that synthesizes existing work on collaboration analysis using conversational data and identifies unexplored areas for future research in this domain.
Conclusion: Task-oriented conversational data is a valuable resource for analyzing collaborative processes, and the review serves as both a practical guide and a roadmap for future research directions in collaboration analysis.
Abstract: Collaboration is a task-oriented, high-level human behavior. In most cases, conversation serves as the primary medium for information exchange and coordination, making conversational data a valuable resource for the automatic analysis of collaborative processes. In this paper, we focus on verbal aspects of collaboration and conduct a review of collaboration analysis using task-oriented conversation resources, encompassing related theories, coding schemes, tasks, and modeling approaches. We aim to address the question of how to utilize task-oriented human-human conversational data for collaboration analysis. We hope our review will serve as a practical resource and illuminate unexplored areas for future collaboration analysis.
[38] LLM-MRD: LLM-Guided Multi-View Reasoning Distillation for Fake News Detection
Weilin Zhou, Shanwen Tan, Enhao Gu, Yurong Qian
Main category: cs.CL
TL;DR: LLM-MRD: A teacher-student framework for multimodal fake news detection that uses LLM-generated reasoning chains as supervision to distill comprehensive multi-view reasoning into an efficient student model.
Details
Motivation: Existing multimodal fake news detection methods lack comprehensive multi-view judgment and suffer from inefficiency due to high computational costs of LLMs used for reasoning.Method: Proposes LLM-MRD with Student Multi-view Reasoning module (textual, visual, cross-modal perspectives) and Teacher Multi-view Reasoning module that generates deep reasoning chains. Uses Calibration Distillation mechanism to efficiently distill complex reasoning knowledge into the student model.
Result: Significantly outperforms state-of-the-art baselines with comprehensive average improvement of 5.19% in ACC and 6.33% in F1-Fake across all competing methods and datasets.
Conclusion: LLM-MRD effectively addresses limitations of existing approaches by providing comprehensive multi-view reasoning while maintaining efficiency through knowledge distillation.
Abstract: Multimodal fake news detection is crucial for mitigating societal disinformation. Existing approaches attempt to address this by fusing multimodal features or leveraging Large Language Models (LLMs) for advanced reasoning. However, these methods suffer from serious limitations, including a lack of comprehensive multi-view judgment and fusion, and prohibitive reasoning inefficiency due to the high computational costs of LLMs. To address these issues, we propose \textbf{LLM}-Guided \textbf{M}ulti-View \textbf{R}easoning \textbf{D}istillation for Fake News Detection ( \textbf{LLM-MRD}), a novel teacher-student framework. The Student Multi-view Reasoning module first constructs a comprehensive foundation from textual, visual, and cross-modal perspectives. Then, the Teacher Multi-view Reasoning module generates deep reasoning chains as rich supervision signals. Our core Calibration Distillation mechanism efficiently distills this complex reasoning-derived knowledge into the efficient student model. Experiments show LLM-MRD significantly outperforms state-of-the-art baselines. Notably, it demonstrates a comprehensive average improvement of 5.19% in ACC and 6.33% in F1-Fake when evaluated across all competing methods and datasets. Our code is available at https://github.com/Nasuro55/LLM-MRD
[39] PrefPO: Pairwise Preference Prompt Optimization
Rahul Singhal, Pradyumna Tambwekar, Karime Maamari
Main category: cs.CL
TL;DR: PrefPO is a preference-based prompt optimization method that uses LLM discriminators to provide pairwise feedback to LLM optimizers, reducing need for labeled data and improving prompt hygiene.
Details
Motivation: Prompt engineering is effective but labor-intensive, and existing automated optimization methods require labeled datasets (often unavailable) and produce verbose, repetitive prompts.Method: PrefPO uses an LLM discriminator to express pairwise preferences over model outputs and provide feedback to an LLM optimizer, iteratively improving performance. It only needs a starting prompt and natural language criteria, reducing need for labeled data and hyperparameter tuning.
Result: PrefPO matches or exceeds SOTA methods on 6/9 BBH tasks and performs comparably to TextGrad on IFEval-Hard (82.4% vs 84.5%). It works in both labeled and unlabeled settings, improves prompt hygiene (reduces length/repetition issues 3-5x), and is less susceptible to prompt hacking (37% vs 86% for TextGrad).
Conclusion: PrefPO offers an effective, minimal approach to prompt optimization that reduces reliance on labeled data, improves prompt quality, and is more robust against gaming evaluation criteria.
Abstract: Prompt engineering is effective but labor-intensive, motivating automated optimization methods. Existing methods typically require labeled datasets, which are often unavailable, and produce verbose, repetitive prompts. We introduce PrefPO, a minimal prompt optimization approach inspired by reinforcement learning from human feedback (RLHF). Its preference-based approach reduces the need for labeled data and hyperparameter tuning-only a starting prompt and natural language criteria are needed. PrefPO uses an LLM discriminator to express pairwise preferences over model outputs and provide feedback to an LLM optimizer, iteratively improving performance. We evaluate PrefPO on 9 BIG-Bench Hard (BBH) tasks and IFEval-Hard, a newly-curated, challenging subset of IFEval. PrefPO matches or exceeds SOTA methods, including GEPA, MIPRO, and TextGrad, on 6/9 tasks and performs comparably to TextGrad on IFEval-Hard (82.4% vs 84.5%). Unlike other methods, PrefPO can optimize in both labeled and unlabeled settings. Without labels, PrefPO closely matches its labeled performance on 6/9 tasks, proving effective without ground truth. PrefPO also improves prompt hygiene: we find existing methods produce prompts 14.7x their original length or with 34% repetitive content; PrefPO reduces these issues by 3-5x. Furthermore, both LLM and human judges rate PrefPO’s prompts higher than TextGrad’s. Finally, we identify prompt hacking in prompt optimizers, where methods game evaluation criteria, and find PrefPO is susceptible at half the rate of TextGrad (37% vs 86%), generating fewer brittle, misaligned prompts.
[40] Memory-Driven Role-Playing: Evaluation and Enhancement of Persona Knowledge Utilization in LLMs
Kai Wang, Haoyang You, Yang Zhang, Zhongjie Wang
Main category: cs.CL
TL;DR: A memory-driven role-playing paradigm for LLMs that treats persona knowledge as internal memory, requiring retrieval and application based on dialogue context without explicit cues, with evaluation framework, prompting architecture, and benchmark.
Details
Motivation: LLMs struggle with consistent characterization in long, open-ended dialogues, frequently failing to recall and apply persona knowledge without explicit cues, requiring a more rigorous test of autonomous knowledge use.Method: Proposes Memory-Driven Role-Playing paradigm inspired by Stanislavski’s “emotional memory” theory, with three components: MREval (evaluation framework assessing Anchoring, Recalling, Bounding, Enacting), MRPrompt (prompting architecture for structured memory retrieval and response generation), and MRBench (bilingual benchmark for fine-grained diagnosis).
Result: Experiments show MRPrompt enables small models (Qwen3-8B) to match performance of larger closed-source LLMs (Qwen3-Max, GLM-4.7), and confirms upstream memory gains directly enhance downstream response quality, validating the staged theoretical foundation.
Conclusion: The memory-driven paradigm provides comprehensive diagnostic for four-staged role-playing abilities across LLMs, demonstrating that structured memory retrieval mechanisms can significantly improve role-playing consistency and quality.
Abstract: A core challenge for faithful LLM role-playing is sustaining consistent characterization throughout long, open-ended dialogues, as models frequently fail to recall and accurately apply their designated persona knowledge without explicit cues. To tackle this, we propose the Memory-Driven Role-Playing paradigm. Inspired by Stanislavski’s “emotional memory” acting theory, this paradigm frames persona knowledge as the LLM’s internal memory store, requiring retrieval and application based solely on dialogue context, thereby providing a rigorous test of depth and autonomous use of knowledge. Centered on this paradigm, we contribute: (1) MREval, a fine-grained evaluation framework assessing four memory-driven abilities - Anchoring, Recalling, Bounding, and Enacting; (2) MRPrompt, a prompting architecture that guides structured memory retrieval and response generation; and (3) MRBench, a bilingual (Chinese/English) benchmark for fine-grained diagnosis. The novel paradigm provides a comprehensive diagnostic for four-staged role-playing abilities across 12 LLMs. Crucially, experiments show that MRPrompt allows small models (e.g., Qwen3-8B) to match the performance of much larger closed-source LLMs (e.g., Qwen3-Max and GLM-4.7), and confirms that upstream memory gains directly enhance downstream response quality, validating the staged theoretical foundation.
[41] Prompt-tuning with Attribute Guidance for Low-resource Entity Matching
Lihui Liu, Carl Yang
Main category: cs.CL
TL;DR: PROMPTATTRIB is a low-resource entity matching method that uses attribute-level prompt tuning with fuzzy logic reasoning and contrastive learning for improved accuracy and interpretability.
Details
Motivation: Traditional entity matching requires large labeled datasets which are costly to create. Existing prompt-tuning methods focus on entity-level matching and lack interpretability, overlooking important attribute-level information.Method: PROMPTATTRIB uses both entity-level and attribute-level prompts to incorporate richer contextual information, employs fuzzy logic formulas to infer final matching labels, and integrates dropout-based contrastive learning on soft prompts inspired by SimCSE.
Result: Extensive experiments on real-world datasets demonstrate the effectiveness of PROMPTATTRIB in improving entity matching performance with minimal labeled data.
Conclusion: PROMPTATTRIB provides a comprehensive solution for low-resource entity matching that addresses limitations of existing methods by incorporating attribute-level information, improving interpretability, and enhancing performance through contrastive learning.
Abstract: Entity Matching (EM) is an important task that determines the logical relationship between two entities, such as Same, Different, or Undecidable. Traditional EM approaches rely heavily on supervised learning, which requires large amounts of high-quality labeled data. This labeling process is both time-consuming and costly, limiting practical applicability. As a result, there is a strong need for low-resource EM methods that can perform well with minimal labeled data. Recent prompt-tuning approaches have shown promise for low-resource EM, but they mainly focus on entity-level matching and often overlook critical attribute-level information. In addition, these methods typically lack interpretability and explainability. To address these limitations, this paper introduces PROMPTATTRIB, a comprehensive solution that tackles EM through attribute-level prompt tuning and logical reasoning. PROMPTATTRIB uses both entity-level and attribute-level prompts to incorporate richer contextual information and employs fuzzy logic formulas to infer the final matching label. By explicitly considering attributes, the model gains a deeper understanding of the entities, resulting in more accurate matching. Furthermore, PROMPTATTRIB integrates dropout-based contrastive learning on soft prompts, inspired by SimCSE, which further boosts EM performance. Extensive experiments on real-world datasets demonstrate the effectiveness of PROMPTATTRIB.
[42] Scalable Prompt Routing via Fine-Grained Latent Task Discovery
Yunyi Zhang, Soji Adeshina, Patrick Guan, Ashwin Ganesh, Zhen Han, Vassilis N. Ioannidis, Huzefa Rangwala, George Karypis
Main category: cs.CL
TL;DR: Two-stage routing architecture for LLM selection: first stage discovers latent task types via graph clustering and classification, second stage uses mixture-of-experts with task-specific heads for quality estimation, balancing task-level stability with prompt-specific adaptability.
Details
Motivation: Existing prompt routing approaches struggle with scaling model pools containing dozens of frontier models with narrow performance gaps; manual task taxonomies can't capture fine-grained capability distinctions, and monolithic routers fail to differentiate subtle differences across diverse tasks.Method: Two-stage architecture: 1) Graph-based clustering discovers latent task types, trains classifier for prompt assignment; 2) Mixture-of-experts with task-specific prediction heads for specialized quality estimation. Inference aggregates both stages to balance task-level stability with prompt-specific adaptability.
Result: Evaluated on 10 benchmarks with 11 frontier models, consistently outperforms existing baselines and surpasses strongest individual model while incurring less than half its cost.
Conclusion: Proposed two-stage routing architecture effectively addresses limitations of existing approaches for large-scale model pools, achieving superior performance with significant cost savings through automated fine-grained task discovery and task-aware quality estimation.
Abstract: Prompt routing dynamically selects the most appropriate large language model from a pool of candidates for each query, optimizing performance while managing costs. As model pools scale to include dozens of frontier models with narrow performance gaps, existing approaches face significant challenges: manually defined task taxonomies cannot capture fine-grained capability distinctions, while monolithic routers struggle to differentiate subtle differences across diverse tasks. We propose a two-stage routing architecture that addresses these limitations through automated fine-grained task discovery and task-aware quality estimation. Our first stage employs graph-based clustering to discover latent task types and trains a classifier to assign prompts to discovered tasks. The second stage uses a mixture-of-experts architecture with task-specific prediction heads for specialized quality estimates. At inference, we aggregate predictions from both stages to balance task-level stability with prompt-specific adaptability. Evaluated on 10 benchmarks with 11 frontier models, our method consistently outperforms existing baselines and surpasses the strongest individual model while incurring less than half its cost.
[43] Is Evaluation Awareness Just Format Sensitivity? Limitations of Probe-Based Evidence under Controlled Prompt Structure
Viliana Devbunova
Main category: cs.CL
TL;DR: Probe-based evaluation awareness detection in LLMs primarily tracks benchmark structure rather than true evaluation context, failing to generalize to free-form prompts.
Details
Motivation: To determine whether existing probe-based methods for detecting evaluation awareness in LLMs actually measure evaluation context or just surface-level benchmark structure.Method: Used controlled 2x2 dataset and diagnostic rewrites to test if probe signals persist when controlling for prompt format, comparing benchmark-canonical vs free-form prompts.
Result: Probes primarily track benchmark-canonical structure and fail to generalize to free-form prompts independent of linguistic style.
Conclusion: Standard probe-based methodologies don’t reliably disentangle evaluation context from structural artifacts, limiting evidential strength of existing results.
Abstract: Prior work uses linear probes on benchmark prompts as evidence of evaluation awareness in large language models. Because evaluation context is typically entangled with benchmark format and genre, it is unclear whether probe-based signals reflect context or surface structure. We test whether these signals persist under partial control of prompt format using a controlled 2x2 dataset and diagnostic rewrites. We find that probes primarily track benchmark-canonical structure and fail to generalize to free-form prompts independent of linguistic style. Thus, standard probe-based methodologies do not reliably disentangle evaluation context from structural artifacts, limiting the evidential strength of existing results.
[44] Vocabulary shapes cross-lingual variation of word-order learnability in language models
Jonas Mayer Martins, Jaap Jumelet, Viola Priesemann, Lisa Beinborn
Main category: cs.CL
TL;DR: Transformer language models trained on synthetic word-order variants show vocabulary structure, not word-order freedom, is the key predictor of computational learnability across languages.
Details
Motivation: To understand why some languages permit free word order while others have fixed word order, and to identify the key factors affecting computational learnability of word order across languages.Method: Pretrained transformer language models on synthetic word-order variants of natural languages, measuring model surprisal as an indicator of learnability, and analyzing correlations with linguistic features.
Result: Greater word-order irregularity consistently raises model surprisal (reduced learnability), but sentence reversal has weak effects. Vocabulary structure, not the free/fixed word-order distinction, strongly predicts model surprisal across languages.
Conclusion: Vocabulary structure emerges as the primary driver of computational word-order learnability, challenging the traditional free vs. fixed word-order classification as the main explanatory factor.
Abstract: Why do some languages like Czech permit free word order, while others like English do not? We address this question by pretraining transformer language models on a spectrum of synthetic word-order variants of natural languages. We observe that greater word-order irregularity consistently raises model surprisal, indicating reduced learnability. Sentence reversal, however, affects learnability only weakly. A coarse distinction of free- (e.g., Czech and Finnish) and fixed-word-order languages (e.g., English and French) does not explain cross-lingual variation. Instead, the structure of the word and subword vocabulary strongly predicts the model surprisal. Overall, vocabulary structure emerges as a key driver of computational word-order learnability across languages.
[45] Cooperation and Exploitation in LLM Policy Synthesis for Sequential Social Dilemmas
Víctor Gallego
Main category: cs.CL
TL;DR: LLM-based policy synthesis framework generates Python agent policies through iterative prompting, evaluation, and refinement with performance feedback, showing dense social metrics feedback outperforms sparse reward-only feedback in cooperative multi-agent environments.
Details
Motivation: To develop a framework for synthesizing programmatic agent policies using LLMs instead of traditional reinforcement learning, exploring how different types of performance feedback (sparse vs. dense with social metrics) affect policy quality in multi-agent social dilemmas.Method: Uses LLMs to iteratively generate Python policy functions for multi-agent environments, evaluates them in self-play, and refines policies using performance feedback. Compares sparse feedback (scalar reward only) against dense feedback (reward plus social metrics: efficiency, equality, sustainability, peace) across Sequential Social Dilemmas (Gathering and Cleanup) with frontier LLMs (Claude Sonnet 4.6, Gemini 3.1 Pro).
Result: Dense feedback consistently matches or exceeds sparse feedback on all metrics, with largest advantage in Cleanup public goods game where social metrics help calibrate costly cleaning-harvesting tradeoff. Social metrics serve as coordination signals guiding LLMs toward effective cooperative strategies like territory partitioning and adaptive role assignment. Also identified five reward hacking attack classes and discussed safety-expressiveness tradeoffs.
Conclusion: LLM policy synthesis with dense social feedback enables effective cooperative strategy development in multi-agent environments, though requires careful consideration of reward hacking vulnerabilities and safety-expressiveness tradeoffs.
Abstract: We study LLM policy synthesis: using a large language model to iteratively generate programmatic agent policies for multi-agent environments. Rather than training neural policies via reinforcement learning, our framework prompts an LLM to produce Python policy functions, evaluates them in self-play, and refines them using performance feedback across iterations. We investigate feedback engineering (the design of what evaluation information is shown to the LLM during refinement) comparing sparse feedback (scalar reward only) against dense feedback (reward plus social metrics: efficiency, equality, sustainability, peace). Across two canonical Sequential Social Dilemmas (Gathering and Cleanup) and two frontier LLMs (Claude Sonnet 4.6, Gemini 3.1 Pro), dense feedback consistently matches or exceeds sparse feedback on all metrics. The advantage is largest in the Cleanup public goods game, where providing social metrics helps the LLM calibrate the costly cleaning-harvesting tradeoff. Rather than triggering over-optimization of fairness, social metrics serve as a coordination signal that guides the LLM toward more effective cooperative strategies, including territory partitioning, adaptive role assignment, and the avoidance of wasteful aggression. We further perform an adversarial experiment to determine whether LLMs can reward hack these environments. We characterize five attack classes and discuss mitigations, highlighting an inherent tension in LLM policy synthesis between expressiveness and safety. Code at https://github.com/vicgalle/llm-policies-social-dilemmas.
[46] Inducing Sustained Creativity and Diversity in Large Language Models
Queenie Luo, Gary King, Michael Puett, Michael D. Smith
Main category: cs.CL
TL;DR: Novel decoding scheme for LLMs that generates sustained creativity and diverse outputs for exploratory search quests, enabling users to explore many unique alternatives without model access.
Details
Motivation: Current LLM decoding methods are optimized for prompts with correct answers, producing homogeneous results that fail to support exploratory search quests where users need to evaluate many diverse and creative alternatives over extended exploration.Method: Developed a novel, easy-to-implement decoding scheme that induces sustained creativity and diversity in LLMs without requiring access to the model’s inner vector space, enabling generation of many conceptually unique results.
Result: The algorithm unlocks an LLM’s vast knowledge beyond modal decoding paths, allowing search quest users to more quickly explore search spaces and find satisfying answers through sustained creative output.
Conclusion: The proposed decoding scheme addresses limitations of current LLM methods for exploratory search, enabling sustained creativity and diversity that better supports users in complex search quests requiring evaluation of many alternatives.
Abstract: We address a not-widely-recognized subset of exploratory search, where a user sets out on a typically long “search quest” for the perfect wedding dress, overlooked research topic, killer company idea, etc. The first few outputs of current large language models (LLMs) may be helpful but only as a start, since the quest requires learning the search space and evaluating many diverse and creative alternatives along the way. Although LLMs encode an impressive fraction of the world’s knowledge, common decoding methods are narrowly optimized for prompts with correct answers and thus return mostly homogeneous and conventional results. Other approaches, including those designed to increase diversity across a small set of answers, start to repeat themselves long before search quest users learn enough to make final choices, or offer a uniform type of “creativity” to every user asking similar questions. We develop a novel, easy-to-implement decoding scheme that induces sustained creativity and diversity in LLMs, producing as many conceptually unique results as desired, even without access to the inner workings of an LLM’s vector space. The algorithm unlocks an LLM’s vast knowledge, both orthodox and heterodox, well beyond modal decoding paths. With this approach, search quest users can more quickly explore the search space and find satisfying answers.
[47] EvidenceRL: Reinforcing Evidence Consistency for Trustworthy Language Models
J. Ben Tamo, Yuxing Lu, Benoit L. Marteau, Micky C. Nnamdi, May D. Wang
Main category: cs.CL
TL;DR: EvidenceRL is a reinforcement learning framework that improves LLM response grounding in evidence to reduce hallucinations in high-stakes domains like medical diagnosis and legal reasoning.
Details
Motivation: LLMs often produce plausible but unsubstantiated answers (hallucinations), which is particularly dangerous in high-stakes domains like healthcare and law where decisions must be evidence-based and verifiable.Method: EvidenceRL uses reinforcement learning with Group Relative Policy Optimization (GRPO) to optimize LLM responses based on two scores: grounding (entailment with retrieved evidence and context) and correctness (agreement with reference answers).
Result: Significant improvements in evidence grounding and faithfulness across domains: cardiac diagnosis F1@3 increased from 37.0 to 54.5, grounding rose from 47.6 to 78.2, hallucinations dropped nearly 5×, and evidence-supported diagnoses increased from 31.8% to 61.6%. Legal reasoning faithfulness improved from 32.8% to 67.6%.
Conclusion: EvidenceRL effectively reduces hallucinations and improves evidence grounding in LLMs for high-stakes applications without sacrificing task accuracy, demonstrating consistent behavioral improvements across different domains.
Abstract: Large Language Models (LLMs) are fluent but prone to hallucinations, producing answers that appear plausible yet are unsupported by available evidence. This failure is especially problematic in high-stakes domains where decisions must be justified by verifiable information. We introduce \textbf{EvidenceRL}, a reinforcement learning framework that enforces evidence adherence during training. EvidenceRL scores candidate responses for grounding (entailment with retrieved evidence and context) and correctness (agreement with reference answers) and optimizes the generator using Group Relative Policy Optimization (GRPO). We evaluate across two high-stakes domains, cardiac diagnosis and legal reasoning, where EvidenceRL consistently improves evidence grounding and faithfulness without sacrificing task accuracy. On cardiac diagnosis, F1@3 increases from 37.0 to 54.5 on Llama-3.2-3B while grounding ($G_{\max}@3$) rises from 47.6 to 78.2; hallucinations drop nearly 5$\times$ and evidence-supported diagnoses increase from 31.8% to 61.6%. On legal reasoning, EvidenceRL raises Faithfulness from 32.8% to 67.6% on Llama-3.1-8B, demonstrating consistent behavioral change across domains. Our code is open-sourced at https://github.com/Wizaaard/EvidenceRL.git.
[48] FDARxBench: Benchmarking Regulatory and Clinical Reasoning on FDA Generic Drug Assessment
Betty Xiong, Jillian Fisher, Benjamin Newman, Meng Hu, Shivangi Gupta, Yejin Choi, Lanyan Fang, Russ B Altman
Main category: cs.CL
TL;DR: FDARxBench: An expert-curated benchmark for document-grounded QA using FDA drug labels, evaluating factual grounding, multi-hop reasoning, and refusal behavior in regulatory contexts.
Details
Motivation: Current language models struggle with accurate question answering on complex regulatory documents like FDA drug labels, which contain heterogeneous clinical and regulatory information. There's a need for challenging, real-world benchmarks to evaluate document-grounded QA capabilities in safety-critical domains.Method: Created FDARxBench through collaboration with FDA regulatory assessors using a multi-stage pipeline for generating high-quality QA examples. Includes factual, multi-hop, and refusal tasks with evaluation protocols for both open-book and closed-book reasoning.
Result: Experiments across proprietary and open-weight models reveal substantial gaps in factual grounding, long-context retrieval, and safe refusal behavior. Models struggle with the complexity of regulatory-grade document comprehension.
Conclusion: FDARxBench provides a substantial foundation for challenging regulatory-grade evaluation of label comprehension and supports evaluation of LLM behavior on drug-label questions, though motivated by FDA generic drug assessment needs.
Abstract: We introduce an expert curated, real-world benchmark for evaluating document-grounded question-answering (QA) motivated by generic drug assessment, using the U.S. Food and Drug Administration (FDA) drug label documents. Drug labels contain rich but heterogeneous clinical and regulatory information, making accurate question answering difficult for current language models. In collaboration with FDA regulatory assessors, we introduce FDARxBench, and construct a multi-stage pipeline for generating high-quality, expert curated, QA examples spanning factual, multi-hop, and refusal tasks, and design evaluation protocols to assess both open-book and closed-book reasoning. Experiments across proprietary and open-weight models reveal substantial gaps in factual grounding, long-context retrieval, and safe refusal behavior. While motivated by FDA generic drug assessment needs, this benchmark also provides a substantial foundation for challenging regulatory-grade evaluation of label comprehension. The benchmark is designed to support evaluation of LLM behavior on drug-label questions.
[49] TextReasoningBench: Does Reasoning Really Improve Text Classification in Large Language Models?
Xinyu Guo, Yazhou Zhang, Jing Qin
Main category: cs.CL
TL;DR: Systematic evaluation of reasoning strategies for text classification with LLMs reveals limited benefits and high token costs, challenging the assumption that reasoning uniformly improves performance across tasks.
Details
Motivation: The paper addresses the gap in understanding whether explicit reasoning strategies (like Chain-of-Thought) truly benefit text classification tasks, given their substantial token and time costs, and the implicit assumption that deliberative reasoning uniformly helps heterogeneous NLP tasks.Method: Introduces TextReasoningBench, a benchmark comparing seven reasoning strategies (IO, CoT, SC-CoT, ToT, GoT, BoC, long-CoT) across ten LLMs on five text classification datasets, using both traditional metrics and new cost-aware metrics for efficiency evaluation.
Result: Three key findings: 1) Reasoning doesn’t universally improve classification - CoT/SC-CoT yield limited gains (+1-3%), complex methods often fail or degrade performance; 2) Reasoning is inefficient - strategies increase token consumption 10-100× for marginal improvements; 3) Performance gains per token are low.
Conclusion: The study challenges the blanket application of reasoning strategies to text classification, showing limited benefits and high costs, suggesting need for more targeted reasoning approaches rather than uniform application across all NLP tasks.
Abstract: Eliciting explicit, step-by-step reasoning traces from large language models (LLMs) has emerged as a dominant paradigm for enhancing model capabilities. Although such reasoning strategies were originally designed for problems requiring explicit multi-step reasoning, they have increasingly been applied to a broad range of NLP tasks. This expansion implicitly assumes that deliberative reasoning uniformly benefits heterogeneous tasks. However, whether such reasoning mechanisms truly benefit classification tasks remains largely underexplored, especially considering their substantial token and time costs. To fill this gap, we introduce TextReasoningBench, a systematic benchmark designed to evaluate the effectiveness and efficiency of reasoning strategies for text classification with LLMs. We compare seven reasoning strategies, namely IO, CoT, SC-CoT, ToT, GoT, BoC, and long-CoT across ten LLMs on five text classification datasets. Beyond traditional metrics such as accuracy and macro-F1, we introduce two cost-aware evaluation metrics that quantify the performance gain per reasoning token and the efficiency of performance improvement relative to token cost growth. Experimental results reveal three notable findings: (1) Reasoning does not universally improve classification performance: while moderate strategies such as CoT and SC-CoT yield consistent but limited gains (typically +1% to +3% on big models), more complex methods (e.g., ToT and GoT) often fail to outperform simpler baselines and can even degrade performance, especially on small models; (2) Reasoning is often inefficient: many reasoning strategies increase token consumption by 10$\times$ to 100$\times$ (e.g., SC-CoT and ToT) while providing only marginal performance improvements.
[50] BEAVER: A Training-Free Hierarchical Prompt Compression Method via Structure-Aware Page Selection
Zhengpei Hu, Kai Li, Dapeng Fu, Chang Zeng, Yue Li, Yuanhao Tang, Jianqiang Huang
Main category: cs.CL
TL;DR: BEAVER is a training-free framework for compressing long contexts in LLMs using hierarchical selection instead of token pruning, achieving high efficiency while preserving semantic integrity.
Details
Motivation: The exponential growth of LLM context windows has created severe bottlenecks in inference latency and information utilization. Existing compression methods suffer from high training costs or semantic fragmentation due to aggressive token pruning.Method: BEAVER shifts compression from linear token removal to structure-aware hierarchical selection. It uses dual-path pooling to map variable-length contexts into dense page-level tensors, and employs a hybrid planner combining semantic and lexical dual-branch selection with sentence smoothing to preserve discourse integrity.
Result: On four long-context benchmarks, BEAVER achieves comparable performance to SOTA methods like LongLLMLingua. On the RULER benchmark, it maintains high fidelity in multi-needle retrieval where baselines deteriorate. BEAVER reduces latency by 26.4x on 128k contexts.
Conclusion: BEAVER offers a scalable, training-free solution for high-throughput long-context applications by balancing compression efficiency with semantic preservation through hierarchical structure-aware selection.
Abstract: The exponential expansion of context windows in LLMs has unlocked capabilities for long-document understanding but introduced severe bottlenecks in inference latency and information utilization. Existing compression methods often suffer from high training costs or semantic fragmentation due to aggressive token pruning. In this paper, we propose BEAVER, a novel training-free framework that shifts compression from linear token removal to structure-aware hierarchical selection. BEAVER maximizes hardware parallelism by mapping variable-length contexts into dense page-level tensors via dual-path pooling, and preserves discourse integrity through a hybrid planner combining semantic and lexical dual-branch selection with sentence smoothing. Extensive evaluations on four long-context benchmarks demonstrate that BEAVER achieves comparable performance to state-of-the-art (SOTA) methods like LongLLMLingua. Notably, on the RULER benchmark, BEAVER maintains high fidelity in multi-needle retrieval where baselines deteriorate. Regarding efficiency, BEAVER reduces latency by 26.4x on 128k contexts, offering a scalable solution for high-throughput applications. Our code is available at https://cslikai.cn/BEAVER/.
[51] Structured Prompting for Arabic Essay Proficiency: A Trait-Centric Evaluation Approach
Salim Al Mandhari, Hieu Pham Dinh, Mo El-Haj, Paul Rayson
Main category: cs.CL
TL;DR: Novel prompt engineering framework for Arabic Automatic Essay Scoring using LLMs with three-tier prompting strategy (standard, hybrid, rubric-guided) for trait-specific evaluation.
Details
Motivation: Addresses scarcity of scalable, linguistically informed AES tools for Arabic language, particularly in low-resource educational contexts, by leveraging LLMs for trait-specific essay scoring.Method: Three-tier prompting strategy: standard, hybrid (simulating multi-agent evaluation with trait specialist raters), and rubric-guided (incorporating scored exemplars). Evaluated eight LLMs on QAES dataset in zero-shot and few-shot settings.
Result: Fanar-1-9B-Instruct achieved highest trait level agreement (QWK = 0.28, CI = 0.41). Rubric-guided prompting yielded consistent gains across all traits and models, with discourse-level traits (Development and Style) showing greatest improvements.
Conclusion: Structured prompting, not model scale alone, enables effective AES in Arabic. First comprehensive framework for proficiency-oriented Arabic AES, setting foundation for scalable assessment in low-resource educational contexts.
Abstract: This paper presents a novel prompt engineering framework for trait specific Automatic Essay Scoring (AES) in Arabic, leveraging large language models (LLMs) under zero-shot and few-shot configurations. Addressing the scarcity of scalable, linguistically informed AES tools for Arabic, we introduce a three-tier prompting strategy (standard, hybrid, and rubric-guided) that guides LLMs in evaluating distinct language proficiency traits such as organization, vocabulary, development, and style. The hybrid approach simulates multi-agent evaluation with trait specialist raters, while the rubric-guided method incorporates scored exemplars to enhance model alignment. In zero and few-shot settings, we evaluate eight LLMs on the QAES dataset, the first publicly available Arabic AES resource with trait level annotations. Experimental results using Quadratic Weighted Kappa (QWK) and Confidence Intervals show that Fanar-1-9B-Instruct achieves the highest trait level agreement in both zero and few-shot prompting (QWK = 0.28 and CI = 0.41), with rubric-guided prompting yielding consistent gains across all traits and models. Discourse-level traits such as Development and Style showed the greatest improvements. These findings confirm that structured prompting, not model scale alone, enables effective AES in Arabic. Our study presents the first comprehensive framework for proficiency oriented Arabic AES and sets the foundation for scalable assessment in low resource educational contexts.
[52] DataProphet: Demystifying Supervision Data Generalization in Multimodal LLMs
Xuan Qi, Luxi He, Dan Roth, Xingyu Fu
Main category: cs.CL
TL;DR: DATAPROPHET: A training-free metric for predicting which vision-language datasets will improve multimodal LLM performance on target benchmarks before training.
Details
Motivation: Current MLLM training selects supervision data based on intuitive similarity to target benchmarks, but it's unclear if this reliably predicts performance gains. The paper aims to develop a method to estimate dataset influence before training.Method: Analyzed transfer across 14 vision-language datasets spanning 7 tasks, found intuitive similarity unreliable. Proposed DATAPROPHET - a training-free metric combining multimodal perplexity, similarity, and data diversity to rank supervision datasets.
Result: DATAPROPHET achieves 86.0% Kendall’s tau correlation with actual post-training performance gains, yields up to 6.9% improvement over uniform selection, 1.4% over training-based baseline, and 0.2% above oracle selection.
Conclusion: Intuitive task similarity is unreliable for predicting MLLM transferability. DATAPROPHET provides an effective training-free method for selecting supervision data that strongly correlates with actual performance improvements.
Abstract: Conventional wisdom for selecting supervision data for multimodal large language models (MLLMs) is to prioritize datasets that appear similar to the target benchmark, such as text-intensive or vision-centric tasks. However, it remains unclear whether such intuitive similarity reliably predicts downstream performance gains. In this work, we take a first step toward answering a practical question: can we estimate the influence of a training dataset on a target benchmark before any training is performed? To investigate this question, we conduct an in-depth analysis of transfer across 14 vision-language datasets spanning 7 diverse tasks. Our results show that intuitive task similarity is an unreliable predictor of transferability, and that generalization depends more on the specific dataset than on its broad task category. Motivated by this finding, we propose DATAPROPHET, a simple and effective training-free metric that combines multimodal perplexity, similarity, and data diversity. Experiments show that DATAPROPHET produces supervision-data rankings that strongly correlate with rankings based on actual post-training performance gains, achieving a Kendall’s tau of 86.0%. Moreover, DATAPROPHET enables better supervision-data selection, yielding up to 6.9% improvement over uniform selection, 1.4% over a state-of-the-art training-based baseline, and 0.2% above oracle selection based on experimental performance. Our code and data will be released.
[53] EvoTaxo: Building and Evolving Taxonomy from Social Media Streams
Yiyang Li, Tianyi Ma, Yanfang Ye
Main category: cs.CL
TL;DR: EvoTaxo is an LLM-based framework for building and evolving taxonomies from temporally ordered social media streams, addressing challenges of short, noisy, and dynamic content through structured draft actions, dual-view clustering, and concept memory banks.
Details
Motivation: Social media posts are short, noisy, semantically entangled, and temporally dynamic, making taxonomy construction challenging. Existing methods designed for static corpora struggle with balancing robustness, scalability, and sensitivity to evolving discourse.Method: EvoTaxo converts each post into structured draft actions over current taxonomy, accumulates structural evidence over time windows, consolidates candidate edits through dual-view clustering (semantic similarity + temporal locality), and uses refinement-and-arbitration with concept memory banks to preserve semantic boundaries.
Result: Experiments on two Reddit corpora show EvoTaxo produces more balanced taxonomies than baselines with clearer post-to-leaf assignment, better corpus coverage at comparable taxonomy size, and stronger structural quality. Case study on /r/ICE_Raids demonstrates meaningful temporal shift capture.
Conclusion: EvoTaxo effectively addresses challenges of taxonomy induction from social media streams by combining LLM-based structured actions with temporal evidence accumulation and dual-view clustering, enabling robust evolution of taxonomies over time.
Abstract: Constructing taxonomies from social media corpora is challenging because posts are short, noisy, semantically entangled, and temporally dynamic. Existing taxonomy induction methods are largely designed for static corpora and often struggle to balance robustness, scalability, and sensitivity to evolving discourse. We propose EvoTaxo, a LLM-based framework for building and evolving taxonomies from temporally ordered social media streams. Rather than clustering raw posts directly, EvoTaxo converts each post into a structured draft action over the current taxonomy, accumulates structural evidence over time windows, and consolidates candidate edits through dual-view clustering that combines semantic similarity with temporal locality. A refinement-and-arbitration procedure then selects reliable edits before execution, while each node maintains a concept memory bank to preserve semantic boundaries over time. Experiments on two Reddit corpora show that EvoTaxo produces more balanced taxonomies than baselines, with clearer post-to-leaf assignment, better corpus coverage at comparable taxonomy size, and stronger structural quality. A case study on the Reddit community /r/ICE_Raids further shows that EvoTaxo captures meaningful temporal shifts in discourse. Our codebase is available here.
[54] TAB-AUDIT: Detecting AI-Fabricated Scientific Tables via Multi-View Likelihood Mismatch
Shuo Huang, Yan Pen, Lizhen Qu
Main category: cs.CL
TL;DR: A systematic study on detecting AI-generated fabricated scientific tables in NLP papers, introducing FabTab benchmark and TAB-AUDIT framework with within-table mismatch feature for forensic detection.
Details
Motivation: Growing concerns about AI-generated fabricated scientific manuscripts breaching academic integrity, with tables serving as critical evidence for claims in empirical papers.Method: Constructed FabTab benchmark dataset (1,173 AI-generated + 1,215 human-authored NLP papers), identified systematic differences between fabricated/real tables, operationalized into discriminative features within TAB-AUDIT framework, with key feature being within-table mismatch (perplexity gap between table skeleton and numerical content).
Result: RandomForest on these features significantly outperforms prior SOTA methods, achieving 0.987 AUROC in-domain and 0.883 AUROC out-of-domain.
Conclusion: Experimental tables provide critical forensic signals for detecting AI-generated scientific fraud, offering new benchmark for future research in academic integrity protection.
Abstract: AI-generated fabricated scientific manuscripts raise growing concerns with large-scale breaches of academic integrity. In this work, we present the first systematic study on detecting AI-generated fabricated scientific tables in empirical NLP papers, as information in tables serve as critical evidence for claims. We construct FabTab, the first benchmark dataset of fabricated manuscripts with tables, comprising 1,173 AI-generated papers and 1,215 human-authored ones in empirical NLP. Through a comprehensive analysis, we identify systematic differences between fabricated and real tables and operationalize them into a set of discriminative features within the TAB-AUDIT framework. The key feature, within-table mismatch, captures the perplexity gap between a table’s skeleton and its numerical content. Experimental results show that RandomForest built on these features significantly outperform prior state-of-the-art methods, achieving 0.987 AUROC in-domain and 0.883 AUROC out-of-domain. Our findings highlight experimental tables as a critical forensic signal for detecting AI-generated scientific fraud and provide a new benchmark for future research.
[55] LoopRPT: Reinforcement Pre-Training for Looped Language Models
Guo Tang, Shixin Jiang, Heng Chang, Nuo Chen, Yuhan Li, Huiming Fan, Jia Li, Ming Liu, Bing Qin
Main category: cs.CL
TL;DR: LoopRPT: A reinforcement pre-training framework for looped language models that shapes intermediate latent representations rather than just output tokens, improving reasoning efficiency and quality.
Details
Motivation: Existing RL methods for language models focus on output tokens, creating a structural mismatch with looped architectures where reasoning happens implicitly in latent space. There's a need for RL paradigms that can directly shape intermediate representations in looped models.Method: LoopRPT reframes next-token prediction as next-token reasoning and assigns reinforcement signals directly to latent steps using an EMA teacher reference and noisy latent rollouts. This enables RL to directly shape intermediate representations and compress reasoning into fewer iterations.
Result: LoopRPT consistently improves per-step representation quality and achieves Pareto dominance in accuracy-computation trade-offs across multiple model scales. Significant gains on hard tokens show it enhances early-stage reasoning rather than just encouraging premature exits.
Conclusion: Reinforcement pre-training is a principled paradigm for learning efficient latent reasoning in looped language models, enabling better reasoning compression and quality.
Abstract: Looped language models (LoopLMs) perform iterative latent computation to refine internal representations, offering a promising alternative to explicit chain-of-thought (CoT) reasoning. However, existing reinforcement learning (RL) paradigms primarily target output tokens, creating a structural mismatch with looped architectures whose reasoning unfolds implicitly. In this work, we propose LoopRPT, a reinforcement pre-training framework tailored for LoopLMs. By reframing next-token prediction as a next-token reasoning task, LoopRPT assigns reinforcement signals directly to latent steps using an EMA teacher reference and noisy latent rollouts. This formulation enables RL to directly shape intermediate representations, compressing effective reasoning into fewer iterations. We instantiate LoopRPT on the Ouro architecture across multiple model scales. Results demonstrate that LoopRPT consistently improves per-step representation quality, achieving Pareto dominance in accuracy-computation trade-offs. Notably, significant gains on hard tokens indicate that LoopRPT enhances early-stage reasoning rather than merely encouraging premature exits. Our findings highlight reinforcement pre-training as a principled paradigm for learning efficient latent reasoning in LoopLMs.
[56] PoC: Performance-oriented Context Compression for Large Language Models via Performance Prediction
Runsong Zhao, Shilei Liu, Jiwei Tang, Langming Liu, Haibin Chen, Weidong Zhang, Yujin Yuan, Tong Xiao, Jingbo Zhu, Wenbo Su, Bo Zheng
Main category: cs.CL
TL;DR: PoC introduces performance-oriented context compression where developers specify acceptable performance floors instead of compression ratios, using lightweight predictors to find optimal compression that meets constraints.
Details
Motivation: Existing context compression methods with fixed compression ratios cause unpredictable performance degradation, hindering reliable deployment of LLMs. There's a need for a more reliable approach that guarantees performance while maximizing compression.Method: PoC uses lightweight performance predictors to automatically find the most aggressive compression ratio that satisfies a specified performance constraint. Two variants: context-agnostic predictor (simple) and context-aware predictor (considers input’s inherent compressibility). These predictors guide off-the-shelf compressors.
Result: On QA and summarization benchmarks, context-aware predictor achieves lower performance prediction error than context-agnostic predictor. Context-aware PoC attains superior overall performance compared to fixed-ratio compression methods.
Conclusion: PoC enables more reliable, efficient, and performance-aware deployment of context compression for LLMs by shifting from compression-ratio-oriented to performance-oriented paradigm, paving way for practical adoption.
Abstract: While context compression can mitigate the growing inference costs of Large Language Models (LLMs) by shortening contexts, existing methods that specify a target compression ratio or length suffer from unpredictable performance degradation, hindering their reliable deployment. We introduce a paradigm shift to Performance-oriented Context Compression (PoC), where developers specify an acceptable performance floor instead of a compression ratio. PoC employs a lightweight performance predictor to automatically find the most aggressive compression ratio that satisfies this constraint before steering an off-the-shelf compressor. We design and compare two predictor variants: a simple context-agnostic predictor and a more sophisticated context-aware one that considers the input’s inherent compressibility. On both question-answering and summarization benchmarks, the context-aware predictor consistently achieves lower performance prediction error than the context-agnostic predictor, while the resulting context-aware PoC attains a superior overall performance. Our work paves the way for a more reliable, efficient, and performance-aware deployment of context compression for LLMs.
[57] Rethinking Ground Truth: A Case Study on Human Label Variation in MLLM Benchmarking
Tomas Ruiz, Tanalp Agustoslu, Carsten Schwemmer
Main category: cs.CL
TL;DR: The paper introduces an evaluation protocol for MLLMs that accounts for human label variation (agreement vs disagreement), finding that larger models perform best on high-agreement data but often underperform medium-sized models on ambiguous/disagreement cases.
Details
Motivation: Human label variation (systematic differences in annotator judgments) is underexplored in MLLM benchmarks, and current evaluations based solely on consensus labels may overstate model capabilities, especially in subjective domains like content moderation.Method: Proposed an evaluation protocol for MLLMs that explicitly accounts for both human label agreement and disagreement conditions. Applied this to two state-of-the-art MLLM families (Gemma 3, Qwen 2.5 VL) using non-aggregated human annotations from a social media content classification dataset.
Result: Larger models tend to perform best on high-agreement subsets but often underperform medium-sized models when human disagreement is high. Parameter count alone doesn’t determine sensitivity to ambiguity and subjectivity.
Conclusion: Benchmarks based solely on consensus labels can overstate model capabilities in subjective domains. Incorporating human label variation yields more realistic and robust assessments of MLLMs, particularly important for content moderation pipelines.
Abstract: Human Label Variation (HLV), i.e. systematic differences among annotators’ judgments, remains underexplored in benchmarks despite rapid progress in large language model (LLM) development. We address this gap by introducing an evaluation protocol for multimodal large language model (MLLM) benchmarking that explicitly accounts for two conditions: (1) human label agreement and (2) disagreement. We apply this protocol to two state-of-the-art MLLM families (Gemma 3, Qwen 2.5 VL) using non-aggregated human annotations from a social media content classification dataset. Across tasks, we find that larger models tend to perform best on high-agreement subsets, yet often underperform medium-sized models when human disagreement is high, indicating that parameter count alone does not determine sensitivity to ambiguity and subjectivity. These results show that benchmarks based solely on consensus labels can overstate model capabilities in such domains and that incorporating human label variation yields more realistic and robust assessments of MLLMs in content moderation pipelines.
[58] Neither Here Nor There: Cross-Lingual Representation Dynamics of Code-Mixed Text in Multilingual Encoders
Debajyoti Mazumder, Divyansh Pathak, Prashant Kodali, Jasabanta Patro
Main category: cs.CL
TL;DR: The paper investigates how multilingual language models represent code-mixed text (Hindi-English), finding they align code-mixed inputs poorly with constituent languages, and proposes a trilingual alignment objective to improve cross-lingual understanding.
Details
Motivation: While multilingual encoder models are widely used for code-mixed analysis tasks, there's limited understanding of how they internally represent code-mixed inputs and whether these representations meaningfully connect to the constituent languages being mixed. The authors aim to uncover these internal representations and improve cross-lingual alignment.Method: Constructed a unified trilingual corpus of parallel English, Hindi (Devanagari), and Romanized code-mixed sentences. Used probing techniques including CKA (Centered Kernel Alignment), token-level saliency, and entropy-based uncertainty analysis to examine cross-lingual representation alignment across standard multilingual encoders and their code-mixed adapted variants. Introduced a trilingual post-training alignment objective to bring code-mixed representations closer to both constituent languages simultaneously.
Result: Standard models align English and Hindi well, but code-mixed inputs remain loosely connected to either language. Continued pre-training on code-mixed data improves English-code-mixed alignment at the cost of English-Hindi alignment. Models process code-mixed text through an English-dominant semantic subspace, while native-script Hindi provides complementary signals that reduce representational uncertainty. The proposed trilingual alignment objective yields more balanced cross-lingual alignment and downstream gains on sentiment analysis and hate speech detection.
Conclusion: Grounding code-mixed representations in their constituent languages meaningfully helps cross-lingual understanding. The findings reveal important asymmetries in how models process code-mixed text and demonstrate that targeted alignment objectives can improve multilingual model performance on code-mixed tasks.
Abstract: Multilingual encoder-based language models are widely adopted for code-mixed analysis tasks, yet we know surprisingly little about how they represent code-mixed inputs internally - or whether those representations meaningfully connect to the constituent languages being mixed. Using Hindi-English as a case study, we construct a unified trilingual corpus of parallel English, Hindi (Devanagari), and Romanized code-mixed sentences, and probe cross-lingual representation alignment across standard multilingual encoders and their code-mixed adapted variants via CKA, token-level saliency, and entropy-based uncertainty analysis. We find that while standard models align English and Hindi well, code-mixed inputs remain loosely connected to either language - and that continued pre-training on code-mixed data improves English-code-mixed alignment at the cost of English-Hindi alignment. Interpretability analyses further reveal a clear asymmetry: models process code-mixed text through an English-dominant semantic subspace, while native-script Hindi provides complementary signals that reduce representational uncertainty. Motivated by these findings, we introduce a trilingual post-training alignment objective that brings code-mixed representations closer to both constituent languages simultaneously, yielding more balanced cross-lingual alignment and downstream gains on sentiment analysis and hate speech detection - showing that grounding code-mixed representations in their constituent languages meaningfully helps cross-lingual understanding.
[59] FrameNet Semantic Role Classification by Analogy
Van-Duy Ngo, Stergos Afantenos, Emiliano Lorini, Miguel Couceiro
Main category: cs.CL
TL;DR: A novel approach to Semantic Role Classification in FrameNet using analogical relations, transforming the task into binary classification with a lightweight ANN that achieves state-of-the-art results without explicit semantic role training.
Details
Motivation: To improve Semantic Role Classification by reformulating it as an analogical reasoning problem, allowing for more efficient and effective classification without requiring explicit semantic role information during training.Method: Defines analogies as formal relations over frame evoking lexical units and frame element pairs, constructs a new dataset with binary labels (valid/invalid analogical instances), trains a lightweight ANN for binary classification, and recovers semantic roles during inference through probability distributions over candidates via random sampling and analogical transfer.
Result: Surpasses previous state-of-the-art results while maintaining computational efficiency and frugality, with rapid convergence and minimal parameters.
Conclusion: The analogical approach to Semantic Role Classification is effective and efficient, demonstrating that semantic roles can be recovered without explicit training on role information through analogical reasoning.
Abstract: In this paper, we adopt a relational view of analogies applied to Semantic Role Classification in FrameNet. We define analogies as formal relations over the Cartesian product of frame evoking lexical units (LUs) and frame element (FEs) pairs, which we use to construct a new dataset. Each element of this binary relation is labelled as a valid analogical instance if the frame elements share the same semantic role, or as invalid otherwise. This formulation allows us to transform Semantic Role Classification into binary classification and train a lightweight Artificial Neural Network (ANN) that exhibits rapid convergence with minimal parameters. Unconventionally, no Semantic Role information is introduced to the neural network during training. We recover semantic roles during inference by computing probability distributions over candidates of all semantic roles within a given frame through random sampling and analogical transfer. This approach allows us to surpass previous state-of-the-art results while maintaining computational efficiency and frugality.
[60] Semantic Delta: An Interpretable Signal Differentiating Human and LLMs Dialogue
Riccardo Scantamburlo, Mauro Mezzanzana, Giacomo Buonanno, Francesco Bertolotti
Main category: cs.CL
TL;DR: The paper introduces “semantic delta” - a lightweight metric based on thematic intensity distribution to distinguish human-written vs LLM-generated dialogue, finding LLMs exhibit stronger thematic concentration than human discourse.
Details
Motivation: To develop an interpretable statistical feature for distinguishing human-written and LLM-generated dialogue, addressing the need for better understanding of LLM behavioral mimicry and providing complementary detection signals.Method: Uses Empath lexical analysis to map texts to thematic intensity scores, defines semantic delta as difference between two most dominant category intensities, compares LLM-generated conversational data against heterogeneous human corpora using Welch t-test.
Result: AI-generated texts consistently produce higher semantic deltas than human texts, indicating more rigid topic structure, while human dialogue shows broader, more balanced semantic spread.
Conclusion: Semantic delta provides a computationally inexpensive zero-shot metric for LLM detection that can complement existing techniques, revealing quantifiable differences in thematic distribution between human and AI conversational dynamics.
Abstract: Do LLMs talk like us? This question intrigues a multitude of scholar and it is relevant in many fields, from education to academia. This work presents an interpretable statistical feature for distinguishing human written and LLMs generated dialogue. We introduce a lightweight metric derived from semantic categories distribution. Using the Empath lexical analysis framework, each text is mapped to a set of thematic intensity scores. We define semantic delta as the difference between the two most dominant category intensities within a dialogue, hypothesizing that LLM outputs exhibit stronger thematic concentration than human discourse. To evaluate this hypothesis, conversational data were generated from multiple LLM configurations and compared against heterogeneous human corpora, including scripted dialogue, literary works, and online discussions. A Welch t-test was applied to the resulting distributions of semantic delta values. Results show that AI-generated texts consistently produce higher deltas than human texts, indicating a more rigid topics structure, whereas human dialogue displays a broader and more balanced semantic spread. Rather than replacing existing detection techniques, the proposed zero-shot metric provides a computationally inexpensive complementary signal that can be integrated into ensemble detection systems. These finding also contribute to the broader empirical understanding of LLM behavioural mimicry and suggest that thematic distribution constitutes a quantifiable dimension along which current models fall short of human conversational dynamics.
[61] Span-Level Machine Translation Meta-Evaluation
Stefano Perrella, Eric Morales Agostinho, Hugo Zaragoza
Main category: cs.CL
TL;DR: Proposes MPP (match with partial overlap and partial credit) as a robust meta-evaluation strategy for MT error detection systems, addressing inconsistencies in existing evaluation methods.
Details
Motivation: While MT error detection has advanced to locate errors and assign categories/severity, there's no reliable way to measure the evaluation capabilities of such auto-evaluators, as existing techniques yield inconsistent rankings.Method: Investigates different span-level precision/recall/F-score implementations, shows their inconsistencies, and proposes MPP with micro-averaging as a robust meta-evaluation strategy.
Result: Demonstrates that seemingly similar evaluation approaches produce substantially different rankings, identifies unsuitable techniques, and validates MPP as a consistent evaluation method.
Conclusion: MPP provides a reliable meta-evaluation framework for MT error detection, enabling proper assessment of state-of-the-art systems and released as public code.
Abstract: Machine Translation (MT) and automatic MT evaluation have improved dramatically in recent years, enabling numerous novel applications. Automatic evaluation techniques have evolved from producing scalar quality scores to precisely locating translation errors and assigning them error categories and severity levels. However, it remains unclear how to reliably measure the evaluation capabilities of auto-evaluators that do error detection, as no established technique exists in the literature. This work investigates different implementations of span-level precision, recall, and F-score, showing that seemingly similar approaches can yield substantially different rankings, and that certain widely-used techniques are unsuitable for evaluating MT error detection. We propose “match with partial overlap and partial credit” (MPP) with micro-averaging as a robust meta-evaluation strategy and release code for its use publicly. Finally, we use MPP to assess the state of the art in MT error detection.
[62] Translation from the Information Bottleneck Perspective: an Efficiency Analysis of Spatial Prepositions in Bitexts
Antoine Taroni, Ludovic Moncla, Frederique Laforest
Main category: cs.CL
TL;DR: This paper applies Information Bottleneck theory to translation, treating source sentences as stimuli and target sentences as compressed meanings, finding that human translations of spatial prepositions across languages cluster near optimal efficiency frontiers.
Details
Motivation: The paper aims to test whether the Information Bottleneck framework, which predicts natural language systems balance informativity and simplicity, applies to linguistic stimuli in translation contexts, particularly for spatial prepositions across languages.Method: Framed translation as an IB optimization problem using bitexts from English, German, and Serbian translations of a French novel. Conducted pile-sorting pilot study (N=35) to obtain similarity judgments of preposition pairs, trained a low-rank projection model (D=5) to predict these judgments, and compared attested translations to counterfactual alternatives against IB optimal frontiers.
Result: The projection model achieved Spearman correlation of 0.78 for predicting similarity judgments. Attested translations of spatial prepositions lie closer to the IB optimal frontier than counterfactual alternatives, providing evidence for communicative efficiency pressure in human translation.
Conclusion: Human translators exhibit communicative efficiency pressure in the spatial domain, and translation can serve as a window into cognitive efficiency pressures shaping cross-linguistic semantic systems.
Abstract: Efficient communication requires balancing informativity and simplicity when encoding meanings. The Information Bottleneck (IB) framework captures this trade-off formally, predicting that natural language systems cluster near an optimal accuracy-complexity frontier. While supported in visual domains such as colour and motion, linguistic stimuli such as words in sentential context remain unexplored. We address this gap by framing translation as an IB optimisation problem, treating source sentences as stimuli and target sentences as compressed meanings. This allows IB analyses to be performed directly on bitexts rather than controlled naming experiments. We applied this to spatial prepositions across English, German and Serbian translations of a French novel. To estimate informativity, we conducted a pile-sorting pilot-study (N=35) and obtained similarity judgements of pairs of prepositions. We trained a low-rank projection model (D=5) that predicts these judgements (Spearman correlation: 0.78). Attested translations of prepositions lie closer to the IB optimal frontier than counterfactual alternatives, offering preliminary evidence that human translators exhibit communicative efficiency pressure in the spatial domain. More broadly, this work suggests that translation can serve as a window into the cognitive efficiency pressures shaping cross-linguistic semantic systems.
[63] SAGE: Sustainable Agent-Guided Expert-tuning for Culturally Attuned Translation in Low-Resource Southeast Asia
Zhixiang Lu, Chong Zhang, Yulong Li, Angelos Stefanidis, Anh Nguyen, Imran Razzak, Jionglong Su, Zhengyong Jiang
Main category: cs.CL
TL;DR: SAGE framework uses RL agents to curate compact, culturally relevant datasets for efficient LLM fine-tuning, achieving SOTA translation for low-resource Southeast Asian languages with 97% less data and 95% less energy.
Details
Motivation: Address the linguistic divide in low-resource Southeast Asia by overcoming dual challenges: scarcity of high-quality culturally relevant data and prohibitive energy costs of training LLMs on massive web corpora, balancing digital inclusion with environmental sustainability.Method: Sustainable Agent-Guided Expert-tuning (SAGE) uses reinforcement learning agent optimized via Group Relative Policy Optimization (GRPO) to autonomously curate compact training sets using semantic reward signals from expert-constructed community dialogues, then fine-tunes open-source LLMs with Low-Rank Adaptation (LoRA).
Result: Achieved state-of-the-art performance on BLEU-4 and COMET-22 metrics for English to seven low-resource Southeast Asian language translation, while reducing data usage by 97.1% and training energy consumption by 95.2% compared to baselines trained on full datasets.
Conclusion: SAGE provides a scalable, responsible pathway to bridge the digital divide in the Global South by delivering high-performance translation models with minimal environmental footprint through energy-aware data curation and efficient fine-tuning.
Abstract: The vision of an inclusive World Wide Web is impeded by a severe linguistic divide, particularly for communities in low-resource regions of Southeast Asia. While large language models (LLMs) offer a potential solution for translation, their deployment in data-poor contexts faces a dual challenge: the scarcity of high-quality, culturally relevant data and the prohibitive energy costs of training on massive, noisy web corpora. To resolve the tension between digital inclusion and environmental sustainability, we introduce Sustainable Agent-Guided Expert-tuning (SAGE). This framework pioneers an energy-aware paradigm that prioritizes the “right data” over “big data”. Instead of carbon-intensive training on unfiltered datasets, SAGE employs a reinforcement learning (RL) agent, optimized via Group Relative Policy Optimization (GRPO), to autonomously curate a compact training set. The agent utilizes a semantic reward signal derived from a small, expert-constructed set of community dialogues to filter out noise and cultural misalignment. We then efficiently fine-tune open-source LLMs on this curated data using Low-Rank Adaptation (LoRA). We applied SAGE to translation tasks between English and seven low-resource languages (LRLs) in Southeast Asia. Our approach establishes new state-of-the-art performance on BLEU-4 and COMET-22 metrics, effectively capturing local linguistic nuances. Crucially, SAGE surpasses baselines trained on full datasets while reducing data usage by 97.1% and training energy consumption by 95.2%. By delivering high-performance models with a minimal environmental footprint, SAGE offers a scalable and responsible pathway to bridge the digital divide in the Global South.
[64] Hybrid topic modelling for computational close reading: Mapping narrative themes in Pushkin’s Evgenij Onegin
Angelo Maria Sabatini
Main category: cs.CL
TL;DR: Hybrid topic modeling framework combining LDA and sPLS-DA for computational literary analysis of narrative poetry, applied to Pushkin’s “Eugene Onegin” in Italian translation.
Details
Motivation: To develop a computational framework for analyzing thematic structure and longitudinal dynamics in narrative poetry that integrates unsupervised and supervised approaches, addressing small-corpus instability while maintaining interpretability.Method: Combines Latent Dirichlet Allocation (LDA) for unsupervised topic modeling with sparse Partial Least Squares Discriminant Analysis (sPLS-DA) as supervised probe. Uses multi-seed consensus protocol for small-corpus stability, segments text into documents of lemmatised content words, and identifies narrative hubs (contiguous stanza groups) to extend bag-of-words to narrative level.
Result: Identified five stable and interpretable topics in Pushkin’s “Eugene Onegin.” The hybrid approach enhanced interpretability by identifying lexical markers that refine each theme, and narrative hubs revealed how thematic mixtures align with the poem’s emotional and structural arc.
Conclusion: The framework offers computational close reading for literary analysis, demonstrating that lightweight probabilistic models can yield reproducible thematic maps of complex poetic narratives, even when stylistic features are abstracted away, providing a transparent template for comparative studies of high-density literary texts.
Abstract: This study presents a hybrid topic modelling framework for computational literary analysis that integrates Latent Dirichlet Allocation (LDA) with sparse Partial Least Squares Discriminant Analysis (sPLS-DA) to model thematic structure and longitudinal dynamics in narrative poetry. As a case study, we analyse Evgenij Onegin-Aleksandr S. Pushkin’s novel in verse-using an Italian translation, testing whether unsupervised and supervised lexical structures converge in a small-corpus setting. The poetic text is segmented into thirty-five documents of lemmatised content words, from which five stable and interpretable topics emerge. To address small-corpus instability, a multi-seed consensus protocol is adopted. Using sPLS-DA as a supervised probe enhances interpretability by identifying lexical markers that refine each theme. Narrative hubs-groups of contiguous stanzas marking key episodes-extend the bag-of-words approach to the narrative level, revealing how thematic mixtures align with the poem’s emotional and structural arc. Rather than replacing traditional literary interpretation, the proposed framework offers a computational form of close reading, illustrating how lightweight probabilistic models can yield reproducible thematic maps of complex poetic narratives, even when stylistic features such as metre, phonology, or native morphology are abstracted away. Despite relying on a single lemmatised translation, the approach provides a transparent methodological template applicable to other high-density literary texts in comparative studies.
[65] When Contextual Inference Fails: Cancelability in Interactive Instruction Following
Natalia Bila, Kata Naszádi, Alexandra Mayn, Christof Monz
Main category: cs.CL
TL;DR: Models struggle to separate literal meaning from contextual inference in collaborative tasks, showing dissociation between detecting speaker unreliability and using that information for efficient communication strategies.
Details
Motivation: To investigate how language models handle the separation between literal interpretation and contextual inference in collaborative tasks, particularly when dealing with underspecified instructions from different types of speakers (cooperative vs. literally reliable).Method: Introduces Build What I Mean (BWIM), an interactive benchmark based on a psycholinguistic paradigm contrasting cooperative and literally reliable speakers. Models must resolve ambiguity by either performing contextual inference or requesting clarification at a communication cost.
Result: State-of-the-art LLMs show dissociation between judgment and action: they can detect speaker unreliability in explicit confidence ratings but fail to exploit this information for efficient clarification behavior, instead showing suboptimal strategies like partner-blind over-clarification and question-averse guessing.
Conclusion: Current LLMs have limitations in integrating pragmatic reasoning with action selection in interactive contexts, highlighting the need for better models of contextual meaning construction in collaborative tasks.
Abstract: We investigate the separation of literal interpretation from contextual inference in a collaborative block-building task where a builder must resolve underspecified instructions using contextual inferences. Building on an existing two-speaker psycholinguistic paradigm – which contrasts a pragmatically cooperative speaker with one who is only literally reliable – we introduce Build What I Mean (BWIM), an interactive benchmark for contextual meaning construction. In BWIM, models must resolve ambiguity by either performing a contextual inference or requesting clarification at a small communication cost. Evaluating several state-of-the-art LLMs, we find a dissociation between judgment and action: while models detect speaker unreliability in explicit confidence ratings, they fail to exploit this information to guide efficient clarification behavior. Instead, we observe suboptimal strategies, such as partner-blind over-clarification and question-averse guessing under uncertainty.
[66] An Agentic Approach to Generating XAI-Narratives
Yifan He, David Martens
Main category: cs.CL
TL;DR: Multi-agent framework for generating and refining XAI narratives using LLMs, with iterative feedback from critic agents to improve faithfulness and coherence.
Details
Motivation: Existing XAI methods are too technical and expert-oriented, creating a need for more interpretable and accessible explanations. LLM-generated narratives offer promise but need improvement in faithfulness and coherence.Method: Proposes a multi-agent framework with Narrator agent generating/revising narratives and multiple Critic Agents providing feedback on faithfulness and coherence metrics. Five agentic systems designed and evaluated across five LLMs on five tabular datasets.
Result: Basic Design, Critic Design, and Critic-Rule Design effectively improve narrative faithfulness across all LLMs. Claude-4.5-Sonnet on Basic Design reduces unfaithful narratives by 90% after three iterations. Ensemble strategy enhances performance for four LLMs.
Conclusion: Agentic systems show strong potential for producing faithful and coherent XAI narratives, making AI explanations more accessible through natural language.
Abstract: Explainable AI (XAI) research has experienced substantial growth in recent years. Existing XAI methods, however, have been criticized for being technical and expert-oriented, motivating the development of more interpretable and accessible explanations. In response, large language model (LLM)-generated XAI narratives have been proposed as a promising approach for translating post-hoc explanations into more accessible, natural-language explanations. In this work, we propose a multi-agent framework for XAI narrative generation and refinement. The framework comprises the Narrator, which generates and revises narratives based on feedback from multiple Critic Agents on faithfulness and coherence metrics, thereby enabling narrative improvement through iteration. We design five agentic systems (Basic Design, Critic Design, Critic-Rule Design, Coherent Design, and Coherent-Rule Design) and systematically evaluate their effectiveness across five LLMs on five tabular datasets. Results validate that the Basic Design, the Critic Design, and the Critic-Rule Design are effective in improving the faithfulness of narratives across all LLMs. Claude-4.5-Sonnet on Basic Design performs best, reducing the number of unfaithful narratives by 90% after three rounds of iteration. To address recurrent issues, we further introduce an ensemble strategy based on majority voting. This approach consistently enhances performance for four LLMs, except for DeepSeek-V3.2-Exp. These findings highlight the potential of agentic systems to produce faithful and coherent XAI narratives.
[67] RouterKGQA: Specialized–General Model Routing for Constraint-Aware Knowledge Graph Question Answering
Bo Yuan, Hexuan Deng, Xuebo Liu, Min Zhang
Main category: cs.CL
TL;DR: RouterKGQA is a framework that combines specialized and general models for knowledge graph question answering, using a specialized model for path generation and a general LLM for guided repair only when needed, achieving better performance with lower computational cost.
Details
Motivation: To address the limitations of existing KGQA approaches: retrieval-based methods using small specialized models often produce unreachable paths and miss implicit constraints, while agent-based methods using large general models achieve better structural grounding but at substantially higher computational cost.Method: RouterKGQA introduces specialized-general model collaboration where a specialized model generates reasoning paths, and a general LLM performs KG-guided repair only when needed. It also includes constraint-aware answer filtering to reduce redundant answers and a more efficient general agent workflow to lower inference costs.
Result: RouterKGQA outperforms previous best methods by 3.57 points in F1 and 0.49 points in Hits@1 on average across benchmarks, while requiring only 1.15 average LLM calls per question.
Conclusion: The framework demonstrates that combining specialized and general models can achieve better KGQA performance with significantly reduced computational cost compared to pure agent-based approaches.
Abstract: Knowledge graph question answering (KGQA) is a promising approach for mitigating LLM hallucination by grounding reasoning in structured and verifiable knowledge graphs. Existing approaches fall into two paradigms: retrieval-based methods utilize small specialized models, which are efficient but often produce unreachable paths and miss implicit constraints, while agent-based methods utilize large general models, which achieve stronger structural grounding at substantially higher cost. We propose RouterKGQA, a framework for specialized–general model collaboration, in which a specialized model generates reasoning paths and a general model performs KG-guided repair only when needed, improving performance at minimal cost. We further equip the specialized with constraint-aware answer filtering, which reduces redundant answers. In addition, we design a more efficient general agent workflow, further lowering inference cost. Experimental results show that RouterKGQA outperforms the previous best by 3.57 points in F1 and 0.49 points in Hits@1 on average across benchmarks, while requiring only 1.15 average LLM calls per question. Codes and models are available at https://github.com/Oldcircle/RouterKGQA.
[68] LoASR-Bench: Evaluating Large Speech Language Models on Low-Resource Automatic Speech Recognition Across Language Families
Jianan Chen, Xiaoxue Gao, Tatsuya Kawahara, Nancy F. Chen
Main category: cs.CL
TL;DR: LoASR-Bench is a comprehensive benchmark for evaluating speech language models on low-resource ASR across 25 languages from 9 language families, revealing current limitations in handling diverse real-world low-resource languages.
Details
Motivation: Existing SpeechLM benchmarks focus primarily on high-resource languages, leaving a critical gap in understanding ASR behavior for low-resource languages. This hinders deployment of SpeechLM-based ASR in real-world multilingual scenarios where support for diverse language families is essential.Method: Proposed LoASR-Bench benchmark comprising 25 languages from 9 language families with both Latin and non-Latin scripts, enabling cross-linguistic and cross-script assessment of SpeechLM performance in low-resource ASR.
Result: Experimental results highlight limitations of current SpeechLMs in handling real-world low-resource languages, demonstrating the need for improved generalization across diverse language families.
Conclusion: The benchmark reveals critical gaps in SpeechLM performance for low-resource languages and provides a comprehensive evaluation framework to drive improvements in multilingual ASR systems.
Abstract: Large language models (LLMs) have driven substantial advances in speech language models (SpeechLMs), yielding strong performance in automatic speech recognition (ASR) under high-resource conditions. However, existing benchmarks predominantly focus on high-resource languages, leaving the ASR behavior of SpeechLMs in low-resource languages insufficiently understood. This gap is critical, as practical ASR systems must reliably support low-resource languages and generalize across diverse language families, and it directly hinders the deployment of SpeechLM-based ASR in real-world multilingual scenarios. As a result, it is essential to evaluate SpeechLMs on low-resource languages to ensure their generalizability across different language families. To address this problem, we propose \textbf{LoASR-Bench}, a comprehensive benchmark designed to evaluate \textbf{lo}w-resource \textbf{a}utomatic \textbf{s}peech \textbf{r}ecognition (\textbf{ASR}) of the latest SpeechLMs across diverse language families. LoASR-Bench comprises 25 languages from 9 language families, featuring both Latin and non-Latin scripts, enabling cross-linguistic and cross-script assessment of ASR performance of current SpeechLMs. Experimental results highlight the limitations of the latest SpeechLMs in handling real-world low-resource languages.
[69] Predicting States of Understanding in Explanatory Interactions Using Cognitive Load-Related Linguistic Cues
Yu Wang, Olcay Türk, Angela Grimminger, Hendrik Buschmeier
Main category: cs.CL
TL;DR: Paper investigates predicting listener understanding states in dialogue using verbal (surprisal, syntactic complexity) and nonverbal (gaze variation) linguistic cues, showing multimodal approach improves classification of four understanding states.
Details
Motivation: To develop methods for real-time prediction of listener understanding in explanatory dialogues, which could enhance communication systems, tutoring applications, and human-computer interaction by detecting when listeners are confused or misunderstanding.Method: Analyzed MUNDEX corpus of face-to-face board game explanations with listener self-annotated understanding states. Examined three linguistic cues: speaker utterance surprisal (information value), syntactic complexity, and listener gaze variation. Used statistical analysis and classification experiments with off-the-shelf classifiers and fine-tuned German BERT-based multimodal classifier.
Result: Individual cues correlate with listener understanding levels. Multimodal classification (text + linguistic cues) improves prediction of four understanding states (Understanding, Partial Understanding, Non-Understanding, Misunderstanding) compared to text-only approaches.
Conclusion: Verbal and nonverbal linguistic features can effectively predict listener understanding states in dialogue, demonstrating the value of multimodal approaches for understanding detection in human communication.
Abstract: We investigate how verbal and nonverbal linguistic features, exhibited by speakers and listeners in dialogue, can contribute to predicting the listener’s state of understanding in explanatory interactions on a moment-by-moment basis. Specifically, we examine three linguistic cues related to cognitive load and hypothesised to correlate with listener understanding: the information value (operationalised with surprisal) and syntactic complexity of the speaker’s utterances, and the variation in the listener’s interactive gaze behaviour. Based on statistical analyses of the MUNDEX corpus of face-to-face dialogic board game explanations, we find that individual cues vary with the listener’s level of understanding. Listener states (‘Understanding’, ‘Partial Understanding’, ‘Non-Understanding’ and ‘Misunderstanding’) were self-annotated by the listeners using a retrospective video-recall method. The results of a subsequent classification experiment, involving two off-the-shelf classifiers and a fine-tuned German BERT-based multimodal classifier, demonstrate that prediction of these four states of understanding is generally possible and improves when the three linguistic cues are considered alongside textual features.
[70] An Empirical Study of SFT-DPO Interaction and Parameterization in Small Language Models
Yuming Feng, Christy Yang
Main category: cs.CL
TL;DR: Systematic comparison of SFT, DPO, and parameterization methods (FFT vs LoRA) on small-scale GPT-2 models shows FFT consistently outperforms LoRA, while DPO provides only task-dependent marginal gains over strong SFT baselines.
Details
Motivation: While DPO is widely used for language model alignment after SFT, its empirical behavior under small-scale models and modest data regimes is not well understood. The paper aims to systematically compare different training approaches (SFT-only, DPO-only, staged SFT-to-DPO) and parameterization methods (full fine-tuning vs LoRA) on small decoder models.Method: The study uses GPT-2-scale decoder models and evaluates on two tasks: paraphrase detection and Shakespearean sonnet continuation. It compares three training approaches (SFT-only, DPO-only, staged SFT-to-DPO) and two parameterization methods (full fine-tuning vs LoRA). The research systematically analyzes performance in small-scale regimes with modest data.
Result: DPO yields only small, task-dependent gains over strong SFT baselines. DPO can match competitive SFT accuracy without warm start when preference construction closely parallels the supervised objective. Parameterization dominates performance: full fine-tuning consistently outperforms LoRA at matched training depth, and LoRA does not reduce wall-clock time on their hardware setup.
Conclusion: In small-scale regimes, supervised full-parameter adaptation remains the primary performance lever, while preference optimization (DPO) and low-rank adaptation (LoRA) provide limited marginal returns. The findings suggest that for small models with modest data, traditional supervised fine-tuning with full parameter updates may be more effective than more complex alignment techniques.
Abstract: Direct Preference Optimization (DPO) is widely used after supervised fine-tuning (SFT) to align language models, yet empirical behavior under small backbones and modest data is under-specified. We systematically compare SFT-only, DPO-only, and staged SFT-to-DPO training alongside full fine-tuning (FFT) versus LoRA on a GPT-2-scale decoder, evaluating paraphrase detection and Shakespearean sonnet continuation. DPO yields small, task-dependent gains over strong SFT and can match competitive SFT accuracy without a warm start when the preference construction closely parallels the supervised objective. In contrast, parameterization dominates: FFT consistently outperforms LoRA at matched training depth, and LoRA does not reduce wall-clock time on our hardware. These findings indicate that, in this small-scale regime, supervised full-parameter adaptation remains the primary performance lever, while preference optimization and low-rank adaptation provide limited marginal returns.
[71] Current LLMs still cannot ’talk much’ about grammar modules: Evidence from syntax
Mohammed Q. Shormani
Main category: cs.CL
TL;DR: ChatGPT struggles with accurate translation of specialized generative syntax terminology from English to Arabic, with only 25% accuracy on core syntax terms, highlighting limitations in LLMs’ linguistic understanding.
Details
Motivation: To evaluate how well Large Language Models can handle specialized linguistic terminology, specifically generative syntax terms, when translating between languages (English to Arabic), and to assess their understanding of core syntax properties.Method: Collected 44 generative syntax terms from previous works and experience, translated them by humans and ChatGPT-5, then used analytical and comparative approach to evaluate translation accuracy across three categories: accurate, inaccurate, and partially correct.
Result: Only 25% of ChatGPT translations were accurate, 38.6% were inaccurate, and 36.4% were partially correct (considered appropriate). LLMs show significant limitations in handling specialized syntax terminology with syntactic and semantic challenges.
Conclusion: LLMs cannot effectively ’talk much’ about core syntax properties, requiring closer collaboration between AI specialists and linguists to improve translation accuracy and linguistic understanding in language models.
Abstract: We aim to examine the extent to which Large Language Models (LLMs) can ’talk much’ about grammar modules, providing evidence from syntax core properties translated by ChatGPT into Arabic. We collected 44 terms from generative syntax previous works, including books and journal articles, as well as from our experience in the field. These terms were translated by humans, and then by ChatGPT-5. We then analyzed and compared both translations. We used an analytical and comparative approach in our analysis. Findings unveil that LLMs still cannot ’talk much’ about the core syntax properties embedded in the terms under study involving several syntactic and semantic challenges: only 25% of ChatGPT translations were accurate, while 38.6% were inaccurate, and 36.4.% were partially correct, which we consider appropriate. Based on these findings, a set of actionable strategies were proposed, the most notable of which is a close collaboration between AI specialists and linguists to better LLMs’ working mechanism for accurate or at least appropriate translation.
[72] Reasoning Gets Harder for LLMs Inside A Dialogue
Ivan Kartáč, Mateusz Lango, Ondřej Dušek
Main category: cs.CL
TL;DR: BOULDER benchmark shows LLMs perform worse on reasoning tasks when framed as multi-turn dialogues compared to isolated tasks, highlighting the need for evaluation in realistic interactive settings.
Details
Motivation: Current LLM reasoning benchmarks focus on isolated tasks, but real-world usage in task-oriented dialogue requires reasoning while generating text and adhering to instructions on role, format, and style. This mismatch raises concerns about whether benchmark performance accurately reflects models' reasoning robustness in practical TOD settings.Method: Introduces BOULDER, a dynamic benchmark covering eight travel-related tasks requiring arithmetic, spatial, and temporal reasoning with both commonsense and formal aspects. Each problem is presented in both isolated and dialogue-based variants for controlled comparison while mitigating data contamination. Experiments conducted on eight LLMs with ablations and qualitative analysis.
Result: Experiments reveal a substantial and consistent performance gap between isolated and dialogue settings. The gap is largely driven by the multi-turn nature of dialogue, with additional effects from role conditioning and tool-use requirements.
Conclusion: LLM reasoning performance significantly degrades in realistic interactive dialogue scenarios compared to isolated tasks, highlighting the need to evaluate LLMs in more realistic settings that better reflect real-world usage patterns.
Abstract: Large Language Models (LLMs) achieve strong performance on many reasoning benchmarks, yet these evaluations typically focus on isolated tasks that differ from real-world usage in task-oriented dialogue (TOD). In this setting, LLMs must perform reasoning inherently while generating text and adhering to instructions on role, format, and style. This mismatch raises concerns about whether benchmark performance accurately reflects models’ reasoning robustness in TOD setting. We investigate how framing reasoning tasks within TOD affects LLM performance by introducing BOULDER, a new dynamic benchmark covering eight travel-related tasks that require arithmetic, spatial, and temporal reasoning with both commonsense and formal aspects. Each problem is presented in both isolated and dialogue-based variants, enabling controlled comparison while mitigating data contamination. Experiments on eight LLMs reveal a substantial and consistent performance gap between isolated and dialogue settings. Through ablations and qualitative analysis, we show that this gap is largely driven by the multi-turn nature of dialogue, with additional effects from role conditioning and tool-use requirements. Our results highlight the need to evaluate LLM reasoning in realistic interactive scenarios.
[73] Enhancing Hyperspace Analogue to Language (HAL) Representations via Attention-Based Pooling for Text Classification
Ali Sakour, Zoalfekar Sakour
Main category: cs.CL
TL;DR: The paper proposes an attention-based pooling mechanism for HAL word embeddings to improve sentence-level representations for sentiment analysis, achieving significant accuracy gains over mean pooling.
Details
Motivation: Standard mean pooling of HAL word embeddings leads to information loss by giving equal weight to all tokens, including uninformative structural words, which dilutes the impact of contextually important words.Method: Integrates a learnable, temperature-scaled additive attention mechanism into HAL representation pipeline, with Truncated SVD to reduce dimensionality of co-occurrence matrices before attention layer.
Result: Achieves 82.38% test accuracy on IMDB sentiment analysis, a 6.74 percentage point improvement over mean pooling baseline (75.64%). Attention weights successfully suppress stop-words and focus on sentiment-bearing tokens.
Conclusion: Attention-based pooling significantly improves both classification performance and model interpretability for sentence-level HAL representations compared to traditional mean pooling.
Abstract: The Hyperspace Analogue to Language (HAL) model relies on global word co-occurrence matrices to construct distributional semantic representations. While these representations capture lexical relationships effectively, aggregating them into sentence-level embeddings via standard mean pooling often results in information loss. Mean pooling assigns equal weight to all tokens, thereby diluting the impact of contextually salient words with uninformative structural tokens. In this paper, we address this limitation by integrating a learnable, temperature-scaled additive attention mechanism into the HAL representation pipeline. To mitigate the sparsity and high dimensionality of the raw co-occurrence matrices, we apply Truncated Singular Value Decomposition (SVD) to project the vectors into a dense latent space prior to the attention layer. We evaluate the proposed architecture on the IMDB sentiment analysis dataset. Empirical results demonstrate that the attention-based pooling approach achieves a test accuracy of 82.38%, yielding an absolute improvement of 6.74 percentage points over the traditional mean pooling baseline (75.64%). Furthermore, qualitative analysis of the attention weights indicates that the mechanism successfully suppresses stop-words and selectively attends to sentiment-bearing tokens, improving both classification performance and model interpretability.
[74] Semantic Token Clustering for Efficient Uncertainty Quantification in Large Language Models
Qi Cao, Andrew Gambardella, Takeshi Kojima, Yutaka Matsuo, Yusuke Iwasawa
Main category: cs.CL
TL;DR: STC: Efficient uncertainty quantification for LLMs using semantic token clustering without repeated sampling or auxiliary models.
Details
Motivation: LLMs often produce unreliable outputs with overconfidence, and existing uncertainty quantification methods require substantial computational overhead through repeated sampling or auxiliary models.Method: Semantic Token Clustering (STC) groups tokens into semantically consistent clusters using embedding clustering and prefix matching, then quantifies uncertainty based on probability mass aggregated over corresponding semantic clusters.
Result: STC achieves performance comparable to state-of-the-art baselines while substantially reducing computational overhead, requiring only a single generation.
Conclusion: STC provides an efficient uncertainty quantification method that leverages inherent semantic information in LLMs without additional computational burden.
Abstract: Large language models (LLMs) have demonstrated remarkable capabilities across diverse tasks. However, the truthfulness of their outputs is not guaranteed, and their tendency toward overconfidence further limits reliability. Uncertainty quantification offers a promising way to identify potentially unreliable outputs, but most existing methods rely on repeated sampling or auxiliary models, introducing substantial computational overhead. To address these limitations, we propose Semantic Token Clustering (STC), an efficient uncertainty quantification method that leverages the semantic information inherently encoded in LLMs. Specifically, we group tokens into semantically consistent clusters using embedding clustering and prefix matching, and quantify uncertainty based on the probability mass aggregated over the corresponding semantic cluster. Our approach requires only a single generation and does not depend on auxiliary models. Experimental results show that STC achieves performance comparable to state-of-the-art baselines while substantially reducing computational overhead.
[75] Evaluating Evidence Grounding Under User Pressure in Instruction-Tuned Language Models
Sai Koneru, Elphin Joe, Christine Kirchhoff, Jian Wu, Sarah Rajtmajer
Main category: cs.CL
TL;DR: Study examines how instruction-tuned LLMs balance user alignment vs. evidence faithfulness in epistemic conflicts using climate assessment data, finding evidence alone insufficient against user pressure.
Details
Motivation: To evaluate the tension between user-alignment pressures and faithfulness to in-context evidence in instruction-tuned language models, particularly in contested domains where models must navigate conflicting demands.Method: Introduced controlled epistemic-conflict framework using U.S. National Climate Assessment data. Conducted fine-grained ablations over evidence composition and uncertainty cues across 19 instruction-tuned models (0.27B to 32B parameters). Analyzed models under neutral prompts vs. user pressure scenarios.
Result: 1) Richer evidence improves accuracy under neutral prompts but doesn’t prevent user-aligned reversals under pressure. 2) Negative partial-evidence interaction: adding epistemic nuance (research gaps) increases sycophancy susceptibility. 3) Robustness scales non-monotonically. 4) Models differ in distributional concentration under conflict.
Conclusion: In controlled fixed-evidence settings, providing richer in-context evidence alone offers no guarantee against user pressure without explicit training for epistemic integrity.
Abstract: In contested domains, instruction-tuned language models must balance user-alignment pressures against faithfulness to the in-context evidence. To evaluate this tension, we introduce a controlled epistemic-conflict framework grounded in the U.S. National Climate Assessment. We conduct fine-grained ablations over evidence composition and uncertainty cues across 19 instruction-tuned models spanning 0.27B to 32B parameters. Across neutral prompts, richer evidence generally improves evidence-consistent accuracy and ordinal scoring performance. Under user pressure, however, evidence does not reliably prevent user-aligned reversals in this controlled fixed-evidence setting. We report three primary failure modes. First, we identify a negative partial-evidence interaction, where adding epistemic nuance, specifically research gaps, is associated with increased susceptibility to sycophancy in families like Llama-3 and Gemma-3. Second, robustness scales non-monotonically: within some families, certain low-to-mid scale models are especially sensitive to adversarial user pressure. Third, models differ in distributional concentration under conflict: some instruction-tuned models maintain sharply peaked ordinal distributions under pressure, while others are substantially more dispersed; in scale-matched Qwen comparisons, reasoning-distilled variants (DeepSeek-R1-Qwen) exhibit consistently higher dispersion than their instruction-tuned counterparts. These findings suggest that, in a controlled fixed-evidence setting, providing richer in-context evidence alone offers no guarantee against user pressure without explicit training for epistemic integrity.
[76] Measuring Faithfulness Depends on How You Measure: Classifier Sensitivity in LLM Chain-of-Thought Evaluation
Richard J. Young
Main category: cs.CL
TL;DR: Chain-of-thought faithfulness measurements are not objective properties of models but depend heavily on classifier choice, with different classifiers producing significantly different results on identical data.
Details
Motivation: To challenge the assumption that chain-of-thought faithfulness is an objective, measurable property of models, showing that different evaluation methodologies yield systematically different results.Method: Applied three classifiers (regex-only detector, regex-plus-LLM pipeline, and Claude Sonnet 4 judge) to 10,276 influenced reasoning traces from 12 open-weight models across 9 families and 7B to 1T parameters.
Result: Different classifiers produced overall faithfulness rates of 74.4%, 82.6%, and 69.7% on identical data, with statistically significant differences. Classifier choice can reverse model rankings and shows systematic disagreements rather than random errors.
Conclusion: Published faithfulness numbers cannot be meaningfully compared across studies using different classifiers; future evaluations should report sensitivity ranges across multiple classification methodologies rather than single point estimates.
Abstract: Recent work on chain-of-thought (CoT) faithfulness reports single aggregate numbers (e.g., DeepSeek-R1 acknowledges hints 39% of the time), implying that faithfulness is an objective, measurable property of a model. This paper demonstrates that it is not. Three classifiers (a regex-only detector, a two-stage regex-plus-LLM pipeline, and an independent Claude Sonnet 4 judge) are applied to 10,276 influenced reasoning traces from 12 open-weight models spanning 9 families and 7B to 1T parameters. On identical data, these classifiers produce overall faithfulness rates of 74.4%, 82.6%, and 69.7%, respectively, with non-overlapping 95% confidence intervals. Per-model gaps range from 2.6 to 30.6 percentage points; all are statistically significant (McNemar’s test, p < 0.001). The disagreements are systematic, not random: inter-classifier agreement measured by Cohen’s kappa ranges from 0.06 (“slight”) for sycophancy hints to 0.42 (“moderate”) for grader hints, and the asymmetry is pronounced: for sycophancy, 883 cases are classified as faithful by the pipeline but unfaithful by the Sonnet judge, while only 2 go the other direction. Classifier choice can also reverse model rankings: Qwen3.5-27B ranks 1st under the pipeline but 7th under the Sonnet judge; OLMo-3.1-32B moves in the opposite direction, from 9th to 3rd. The root cause is that different classifiers operationalize related faithfulness constructs at different levels of stringency (lexical mention versus epistemic dependence), and these constructs yield divergent measurements on the same behavior. These results demonstrate that published faithfulness numbers cannot be meaningfully compared across studies that use different classifiers, and that future evaluations should report sensitivity ranges across multiple classification methodologies rather than single point estimates.
[77] Taking a Deep Breath: Enhancing Language Modeling of Large Language Models with Sentinel Tokens
Weiyao Luo, Suncong Zheng, Heming Xia, Weikang Wang, Yan Lei, Tianyu Liu, Shuang Chen, Zhifang Sui
Main category: cs.CL
TL;DR: A method to improve LLMs’ long-context modeling by inserting special summary tokens at chunk boundaries and modifying attention masks to aggregate chunk information.
Details
Motivation: Transformer-based LLMs suffer performance degradation when modeling long-term contexts because they discard information to reduce computational overhead. The paper aims to address this limitation by enabling LLMs to better handle long sequences.Method: Segment text into multiple chunks, insert special
Result: Experiments on language modeling and out-of-domain downstream tasks validate the superiority of the approach, showing improved performance in long-context scenarios.
Conclusion: The proposed method effectively enables LLMs to better handle long-term contexts by summarizing chunk information through special tokens, improving performance on various tasks.
Abstract: Large language models (LLMs) have shown promising efficacy across various tasks, becoming powerful tools in numerous aspects of human life. However, Transformer-based LLMs suffer a performance degradation when modeling long-term contexts due to they discard some information to reduce computational overhead. In this work, we propose a simple yet effective method to enable LLMs to take a deep breath, encouraging them to summarize information contained within discrete text chunks. Specifically, we segment the text into multiple chunks and insert special token
[78] Test-Time Alignment for Large Language Models via Textual Model Predictive Control
Kuang-Da Wang, Teng-Ruei Chen, Yu Heng Hung, Guo-Xun Ko, Shuoyang Ding, Yueh-Hua Wu, Yu-Chiang Frank Wang, Chao-Han Huck Yang, Wen-Chih Peng, Ping-Chun Hsieh
Main category: cs.CL
TL;DR: TMPC is a novel predictive planning framework that adapts Model Predictive Control for aligning LLMs at inference time, using hierarchical principles to overcome horizon/dimensionality trade-offs in test-time alignment.
Details
Motivation: Finetuning LLMs for human preference alignment is resource-intensive, motivating lightweight test-time alternatives. Current approaches face fundamental challenges: token-level actions suffer from horizon curse, while response-level actions face dimensionality curse.Method: Proposes Textual Model Predictive Control (TMPC) with two hierarchical principles: (1) Hindsight Subgoal Identification - analyzes generation to retrospectively identify high-reward intermediate outputs as subgoals; (2) Subgoal-Conditioned Re-Generation - uses identified subgoals to guide subsequent planning iterations.
Result: TMPC consistently improves performance across three tasks with distinct segmentation properties: discourse-level translation, long-form response generation, and program synthesis, demonstrating generality.
Conclusion: TMPC provides an effective framework for test-time alignment of LLMs by overcoming the fundamental trade-off between horizon and dimensionality through hierarchical planning principles inspired by control theory and reinforcement learning.
Abstract: Aligning Large Language Models (LLMs) with human preferences through finetuning is resource-intensive, motivating lightweight alternatives at test time. We address test-time alignment through the lens of sequential decision making, a perspective that reveals two fundamental challenges. When actions are defined at the token level, as in guided decoding, alignment suffers from the curse of horizon. Conversely, when actions are at the response level, as in traditional iterative refinement, the curse of dimensionality emerges. To resolve this trade-off, we draw inspiration from Model Predictive Control (MPC) in control theory to propose Textual Model Predictive Control (TMPC), a novel predictive planning framework adapted for aligning LLMs at inference time. A key limitation of standard MPC is its reliance on predefined, hard segment boundaries, which are often absent in text generation. TMPC overcomes this by introducing two principles inspired by hierarchical reinforcement learning: (1) Hindsight Subgoal Identification, where TMPC analyzes generation subgoals to retrospectively identify high-reward intermediate outputs as subgoals. This allows the framework to discover meaningful, task-specific planning steps (e.g., a sentence in machine translation or a bug fix in code generation.). (2) Subgoal-Conditioned Re-Generation, where these identified subgoals are used to guide subsequent planning iterations. By conditioning on these proven, high-quality subgoals, TMPC ensures stable improvement by building upon previously validated successes. TMPC is evaluated on three tasks with distinct segmentation properties: discourse-level translation, long-form response generation, and program synthesis. The results demonstrate that TMPC consistently improves performance, highlighting the generality.
[79] Disambiguation of Emotion Annotations by Contextualizing Events in Plausible Narratives
Johannes Schäfer, Roman Klinger
Main category: cs.CL
TL;DR: Paper presents EBS dataset for emotion analysis by generating contextual narratives to resolve ambiguity in emotion classification through automatic context generation.
Details
Motivation: Addresses ambiguity in emotion analysis from missing information and subjectivity, aiming to fill missing context to resolve ambiguous emotion classification instances.Method: Develops method to automatically generate reasonable contexts for ambiguous classification instances using techniques from short story generation to create coherent narratives, resulting in Emotional BackStories (EBS) dataset.
Result: Generated contextual narratives clarify emotion interpretation, with relief and sadness benefiting most from additional context, while joy doesn’t require it. Dataset enables systematic examination of contextualized emotion analysis.
Conclusion: Context generation can resolve ambiguity in emotion analysis, with EBS dataset providing first comprehensive framework for contextualized emotion analysis, showing varying benefits across different emotions.
Abstract: Ambiguity in emotion analysis stems both from potentially missing information and the subjectivity of interpreting a text. The latter did receive substantial attention, but can we fill missing information to resolve ambiguity? We address this question by developing a method to automatically generate reasonable contexts for an otherwise ambiguous classification instance. These generated contexts may act as illustrations of potential interpretations by different readers, as they can fill missing information with their individual world knowledge. This task to generate plausible narratives is a challenging one: We combine techniques from short story generation to achieve coherent narratives. The resulting English dataset of Emotional BackStories, EBS, allows for the first comprehensive and systematic examination of contextualized emotion analysis. We conduct automatic and human annotation and find that the generated contextual narratives do indeed clarify the interpretation of specific emotions. Particularly relief and sadness benefit from our approach, while joy does not require the additional context we provide.
[80] Semantic-Driven Topic Modeling for Analyzing Creativity in Virtual Brainstorming
Melkamu Abay Mersha, Jugal Kalita
Main category: cs.CL
TL;DR: A semantic-driven topic modeling framework using transformer embeddings, UMAP, HDBSCAN, and topic extraction to analyze brainstorming transcripts, achieving higher coherence than LDA, ETM, and BERTopic.
Details
Motivation: Virtual brainstorming generates large volumes of ideas that are difficult to analyze manually. There's a need for automated approaches to evaluate group creativity efficiently and objectively, as manual coding is time-consuming and subjective.Method: Four-component framework: 1) Sentence-BERT for transformer-based embeddings, 2) UMAP for dimensionality reduction, 3) HDBSCAN for clustering, and 4) topic extraction with refinement. Captures semantic similarity at sentence level to discover coherent themes while filtering noise and identifying outliers.
Result: Achieved average coherence score of 0.687 (CV), outperforming LDA, ETM, and BERTopic baselines. Provides interpretable insights into depth and diversity of topics, supporting both convergent and divergent dimensions of group creativity.
Conclusion: Demonstrates potential of embedding-based topic modeling for analyzing collaborative ideation. Provides efficient, scalable framework for studying creativity in synchronous virtual meetings.
Abstract: Virtual brainstorming sessions have become a central component of collaborative problem solving, yet the large volume and uneven distribution of ideas often make it difficult to extract valuable insights efficiently. Manual coding of ideas is time-consuming and subjective, underscoring the need for automated approaches to support the evaluation of group creativity. In this study, we propose a semantic-driven topic modeling framework that integrates four modular components: transformer-based embeddings (Sentence-BERT), dimensionality reduction (UMAP), clustering (HDBSCAN), and topic extraction with refinement. The framework captures semantic similarity at the sentence level, enabling the discovery of coherent themes from brainstorming transcripts while filtering noise and identifying outliers. We evaluate our approach on structured Zoom brainstorming sessions involving student groups tasked with improving their university. Results demonstrate that our model achieves higher topic coherence compared to established methods such as LDA, ETM, and BERTopic, with an average coherence score of 0.687 (CV), outperforming baselines by a significant margin. Beyond improved performance, the model provides interpretable insights into the depth and diversity of topics explored, supporting both convergent and divergent dimensions of group creativity. This work highlights the potential of embedding-based topic modeling for analyzing collaborative ideation and contributes an efficient and scalable framework for studying creativity in synchronous virtual meetings.
[81] Responsible AI Technical Report
KT, :, Yunjin Park, Jungwon Yoon, Junhyung Moon, Myunggyo Oh, Wonhyuk Lee, Sujin Kim, Youngchol Kim, Eunmi Kim, Hyoungjun Park, Eunyoung Shin, Wonyoung Lee, Somin Lee, Minwook Ju, Minsung Noh, Dongyoung Jeong, Jeongyeop Kim, Wanjin Park, Soonmin Bae
Main category: cs.CL
TL;DR: KT developed a Responsible AI assessment methodology and risk mitigation technologies including SafetyGuard to block harmful AI responses, focusing on regulatory compliance and systematic risk management throughout AI lifecycle.
Details
Motivation: To ensure safety and reliability of AI services by addressing regulatory requirements from the Basic Act on AI implementation and global AI governance trends, while providing practical tools for organizations to develop Responsible AI.Method: Developed a unique RAI assessment methodology based on KT’s AI risk taxonomy tailored to domestic environment, with systematic verification of model safety and robustness, plus proprietary Guardrail technology (SafetyGuard) for real-time harmful response blocking.
Result: Created a comprehensive framework for regulatory compliance and risk management throughout AI development to operation, with practical tools for risk mitigation including the released SafetyGuard system.
Conclusion: The research provides valuable insights for organizations developing Responsible AI and supports enhancement of safety in the domestic AI development ecosystem through systematic risk assessment and mitigation technologies.
Abstract: KT developed a Responsible AI (RAI) assessment methodology and risk mitigation technologies to ensure the safety and reliability of AI services. By analyzing the Basic Act on AI implementation and global AI governance trends, we established a unique approach for regulatory compliance and systematically identify and manage all potential risk factors from AI development to operation. We present a reliable assessment methodology that systematically verifies model safety and robustness based on KT’s AI risk taxonomy tailored to the domestic environment. We also provide practical tools for managing and mitigating identified AI risks. With the release of this report, we also release proprietary Guardrail : SafetyGuard that blocks harmful responses from AI models in real-time, supporting the enhancement of safety in the domestic AI development ecosystem. We also believe these research outcomes provide valuable insights for organizations seeking to develop Responsible AI.
[82] LiRA: A Multi-Agent Framework for Reliable and Readable Literature Review Generation
Gregory Hok Tjoan Go, Khang Ly, Anders Søgaard, Amin Tabatabaei, Maarten de Rijke, Xinyi Chen
Main category: cs.CL
TL;DR: LiRA is a multi-agent system that automates literature review writing using specialized LLM agents for outlining, writing, editing, and reviewing, outperforming existing baselines in quality and citation accuracy.
Details
Motivation: The exponential growth of scientific publications makes comprehensive literature reviews increasingly difficult to maintain. While prior work automated retrieval and screening, the actual writing phase remains under-explored, particularly regarding readability and factual accuracy.Method: LiRA employs a multi-agent collaborative workflow that mimics human literature review processes. It uses specialized agents for content outlining, subsection writing, editing, and reviewing to produce cohesive review articles. The system is evaluated on SciReviewGen and a proprietary ScienceDirect dataset.
Result: LiRA outperforms current baselines (AutoSurvey and MASS-Survey) in writing and citation quality while maintaining competitive similarity to human-written reviews. It also demonstrates robustness to reviewer model variation in real-world scenarios with document retrieval.
Conclusion: The findings highlight the potential of agentic LLM workflows to improve automated scientific writing reliability and usability, even without domain-specific tuning.
Abstract: The rapid growth of scientific publications has made it increasingly difficult to keep literature reviews comprehensive and up-to-date. Though prior work has focused on automating retrieval and screening, the writing phase of systematic reviews remains largely under-explored, especially with regard to readability and factual accuracy. To address this, we present LiRA (Literature Review Agents), a multi-agent collaborative workflow which emulates the human literature review process. LiRA utilizes specialized agents for content outlining, subsection writing, editing, and reviewing, producing cohesive and comprehensive review articles. Evaluated on SciReviewGen and a proprietary ScienceDirect dataset, LiRA outperforms current baselines such as AutoSurvey and MASS-Survey in writing and citation quality, while maintaining competitive similarity to human-written reviews. We further evaluate LiRA in real-world scenarios using document retrieval and assess its robustness to reviewer model variation. Our findings highlight the potential of agentic LLM workflows, even without domain-specific tuning, to improve the reliability and usability of automated scientific writing.
[83] Can AI Truly Represent Your Voice in Deliberations? A Comprehensive Study of Large-Scale Opinion Aggregation with LLMs
Shenzhe Zhu, Shu Yang, Michiel A. Bakker, Alex Pentland, Jiaxin Pei
Main category: cs.CL
TL;DR: DeliberationBank: A human-grounded dataset and DeliberationJudge model for evaluating LLM-generated deliberation summaries to address fairness concerns and underrepresentation of minority perspectives.
Details
Motivation: LLMs show promise for summarizing large-scale public deliberations but risk underrepresenting minority perspectives and exhibiting bias, raising fairness concerns in high-stakes policy contexts. Current evaluation methods relying on LLMs as judges show weak alignment with human judgments.Method: Created DeliberationBank dataset with opinion data from 3,000 participants across 10 deliberation questions and summary judgment data annotated by 4,500 participants across four dimensions. Trained DeliberationJudge, a fine-tuned DeBERTa model that can rate deliberation summaries from individual perspectives.
Result: DeliberationJudge is more efficient and more aligned with human judgments compared to various LLM judges. Evaluation of 18 LLMs revealed persistent weaknesses in deliberation summarization, especially underrepresentation of minority positions.
Conclusion: The framework provides a scalable and reliable way to evaluate deliberation summarization, helping ensure AI systems are more representative and equitable for policymaking.
Abstract: Large-scale public deliberations generate thousands of free-form contributions that must be synthesized into representative and neutral summaries for policy use. While LLMs have been shown as a promising tool to generate summaries for large-scale deliberations, they also risk underrepresenting minority perspectives and exhibiting bias with respect to the input order, raising fairness concerns in high-stakes contexts. Studying and fixing these issues requires a comprehensive evaluation at a large scale, yet current practice often relies on LLMs as judges, which show weak alignment with human judgments. To address this, we present DeliberationBank, a large-scale human-grounded dataset with (1) opinion data spanning ten deliberation questions created by 3,000 participants and (2) summary judgment data annotated by 4,500 participants across four dimensions (representativeness, informativeness, neutrality, policy approval). Using these datasets, we train DeliberationJudge, a fine-tuned DeBERTa model that can rate deliberation summaries from individual perspectives. DeliberationJudge is more efficient and more aligned with human judgements compared to a wide range of LLM judges. With DeliberationJudge, we evaluate 18 LLMs and reveal persistent weaknesses in deliberation summarization, especially underrepresentation of minority positions. Our framework provides a scalable and reliable way to evaluate deliberation summarization, helping ensure AI systems are more representative and equitable for policymaking.
[84] Modeling Turn-Taking with Semantically Informed Gestures
Varsha Suresh, M. Hamza Mughal, Christian Theobalt, Vera Demberg
Main category: cs.CL
TL;DR: Paper introduces DnD Gesture++ dataset with semantic gesture annotations and shows gestures improve multimodal turn-taking prediction in conversations
Details
Motivation: Humans use multimodal cues (speech, gestures, gaze) for turn-taking in conversations. While linguistic and acoustic features are studied, gestures' complementary role in modeling turn transitions needs systematic investigation.Method: Extended DnD Gesture corpus with 2,663 semantic gesture annotations (iconic, metaphoric, deictic, discourse types). Used Mixture-of-Experts framework integrating text, audio, and gestures for turn-taking prediction.
Result: Incorporating semantically guided gestures yields consistent performance gains over baselines, demonstrating gestures’ complementary role in multimodal turn-taking prediction.
Conclusion: Gestures provide valuable complementary cues for turn-taking prediction beyond text and audio, and semantic gesture annotations enable more effective multimodal modeling of conversational dynamics.
Abstract: In conversation, humans use multimodal cues, such as speech, gestures, and gaze, to manage turn-taking. While linguistic and acoustic features are informative, gestures provide complementary cues for modeling these transitions. To study this, we introduce DnD Gesture++, an extension of the multi-party DnD Gesture corpus enriched with 2,663 semantic gesture annotations spanning iconic, metaphoric, deictic, and discourse types. Using this dataset, we model turn-taking prediction through a Mixture-of-Experts framework integrating text, audio, and gestures. Experiments show that incorporating semantically guided gestures yields consistent performance gains over baselines, demonstrating their complementary role in multimodal turn-taking.
[85] BitSkip: An Empirical Analysis of Quantization and Early Exit Composition in Transformers
Ramshankar Bhuvaneswaran, Handan Liu
Main category: cs.CL
TL;DR: BitSkip explores interactions between quantization and dynamic routing in LLMs, finding that simple 8-bit quantization without Hadamard transforms outperforms more complex approaches and even matches full-precision baseline quality.
Details
Motivation: To systematically study the compositional effects of extreme quantization and dynamic routing techniques in LLMs, which are individually well-documented but poorly understood in combination.Method: Introduces BitSkip, a hybrid architectural framework for exploring interactions between quantization and dynamic routing. Tests various configurations including 8-bit quantization without Hadamard transforms (BitSkip-V1), 4-bit quantization, and Hadamard-enhanced versions.
Result: BitSkip-V1 (8-bit without Hadamard) outperforms more complex 4-bit and Hadamard-enhanced counterparts, achieving perplexity of 1.13 vs 1.19 for full-precision baseline. Hadamard transforms catastrophically degrade performance by over 37,000% due to training instability. BitSkip-V1 shows superior early-exit characteristics with layer 18 providing 32.5% speed gain for only 4% quality loss.
Conclusion: Simple 8-bit quantization without complex transforms can achieve near-baseline quality while enabling efficient early-exit strategies, challenging assumptions about the benefits of more complex quantization techniques.
Abstract: The pursuit of efficient Large Language Models (LLMs) has led to increasingly complex techniques like extreme quantization and dynamic routing. While individual benefits of these methods are well-documented, their compositional effects remain poorly understood. This paper introduces BitSkip, a hybrid architectural framework for systematically exploring these interactions. Counter-intuitively, our findings reveal that a simple 8-bit quantized model without Hadamard transform (BitSkip-V1) not only outperforms its more complex 4-bit and Hadamard-enhanced counterparts but also competes the full-precision baseline in quality (perplexity of 1.13 vs 1.19) . The introduction of Hadamard transforms, even at 8-bit precision, catastrophically degraded performance by over 37,000%, tracing fundamental training instability. Our BitSkip-V1 recipe demonstrates superior early-exit characteristics, with layer 18 providing optimal 32.5% speed gain for minimal 4% quality loss.
[86] Rep2Text: Decoding Full Text from a Single LLM Token Representation
Haiyan Zhao, Zirui He, Yiming Tang, Fan Yang, Ali Payani, Dianbo Liu, Mengnan Du
Main category: cs.CL
TL;DR: Rep2Text framework decodes text from LLM last-token representations using adapter mapping to token embedding space for autoregressive reconstruction
Details
Motivation: LLMs have achieved remarkable progress but remain opaque; investigate to what extent original input text can be recovered from a single last-token representationMethod: Propose Rep2Text framework with trainable adapter that maps target model’s last-token representation into token embedding space of decoding language model for autoregressive text reconstruction
Result: On average, roughly half of tokens in 16-token sequences can be recovered while preserving strong semantic coherence; shows information bottleneck effect with sequence length; scaling effects less pronounced; robust generalization to clinical data
Conclusion: Last-token representations contain substantial information about input text; Rep2Text enables text recovery from compressed representations with semantic preservation; reveals insights about information compression in LLMs
Abstract: Large language models (LLMs) have achieved remarkable progress across diverse tasks, yet their internal mechanisms remain largely opaque. In this work, we investigate a fundamental question: to what extent can the original input text be recovered from a single last-token representation in an LLM? To this end, we propose Rep2Text, a novel framework for decoding text from last-token representations. Rep2Text employs a trainable adapter that maps a target model’s last-token representation into the token embedding space of a decoding language model, which then autoregressively reconstructs the input text. Experiments across various model combinations (Llama-3.1-8B, Gemma-7B, Mistral-7B-v0.1, Llama-3.2-3B, etc.) show that, on average, roughly half of the tokens in 16-token sequences can be recovered from this compressed representation while preserving strong semantic coherence. Further analysis reveals a clear information bottleneck effect: as sequence length increases, token-level recovery declines, while semantic information remains relatively well preserved. We also find that scaling effects are less pronounced in inversion tasks. Finally, our framework demonstrates robust generalization to out-of-distribution clinical data.
[87] HalluClean: A Unified Framework to Combat Hallucinations in LLMs
Yaxin Zhao, Yu Zhang
Main category: cs.CL
TL;DR: HalluClean: A lightweight, task-agnostic framework for detecting and correcting hallucinations in LLM-generated text using a reasoning-enhanced paradigm with planning, execution, and revision stages.
Details
Motivation: LLMs often produce hallucinated content that undermines factual reliability, creating a need for methods to detect and correct these hallucinations to enhance trustworthiness in real-world applications.Method: Uses a reasoning-enhanced paradigm with three explicit stages: planning (identify potential hallucinations), execution (verify claims), and revision (correct unsupported content). Employs minimal task-routing prompts for zero-shot generalization across domains without external knowledge sources or supervised detectors.
Result: Extensive evaluations on five tasks (question answering, dialogue, summarization, math word problems, contradiction detection) show HalluClean significantly improves factual consistency and outperforms competitive baselines.
Conclusion: HalluClean demonstrates potential to enhance the trustworthiness of LLM outputs through its lightweight, task-agnostic approach to hallucination detection and correction.
Abstract: Large language models (LLMs) have achieved impressive performance across a wide range of natural language processing tasks, yet they often produce hallucinated content that undermines factual reliability. To address this challenge, we introduce HalluClean, a lightweight and task-agnostic framework for detecting and correcting hallucinations in LLM-generated text. HalluClean adopts a reasoning-enhanced paradigm, explicitly decomposing the process into planning, execution, and revision stages to identify and refine unsupported claims. It employs minimal task-routing prompts to enable zero-shot generalization across diverse domains, without relying on external knowledge sources or supervised detectors. We conduct extensive evaluations on five representative tasks-question answering, dialogue, summarization, math word problems, and contradiction detection. Experimental results show that HalluClean significantly improves factual consistency and outperforms competitive baselines, demonstrating its potential to enhance the trustworthiness of LLM outputs in real-world applications.
[88] Auditing Google’s AI Overviews and Featured Snippets: A Case Study on Baby Care and Pregnancy
Desheng Hu, Joachim Baumann, Aleksandra Urman, Elsa Lichtenegger, Robin Forsberg, Aniko Hannak, Christo Wilson
Main category: cs.CL
TL;DR: Systematic audit of Google’s AI-generated health content (AI Overviews & Featured Snippets) reveals concerning inconsistencies and lack of medical safeguards in pregnancy/baby care information.
Details
Motivation: Google Search increasingly surfaces AI-generated content through AI Overviews and Featured Snippets that users rely on but cannot control, raising concerns about information quality in high-stakes health domains like pregnancy and baby care.Method: Conducted systematic algorithm audit of 1,508 real baby care and pregnancy-related queries, evaluating quality dimensions including answer consistency, relevance, medical safeguards, source categories, and sentiment alignment using a robust evaluation framework.
Result: Found concerning gaps: 33% inconsistency between AIO and FS on same page, critically low medical safeguards (11% in AIO, 7% in FS), health/wellness websites dominate sources but FS often links to commercial sources, despite high relevance scores.
Conclusion: Findings highlight serious implications for public health information access and demonstrate need for stronger quality controls in AI-mediated health information; methodology provides transferable framework for auditing AI systems across high-stakes domains.
Abstract: Google Search increasingly surfaces AI-generated content through features like AI Overviews (AIO) and Featured Snippets (FS), which users frequently rely on despite having no control over their presentation. Through a systematic algorithm audit of 1,508 real baby care and pregnancy-related queries, we evaluate the quality and consistency of these information displays. Our robust evaluation framework assesses multiple quality dimensions, including answer consistency, relevance, presence of medical safeguards, source categories, and sentiment alignment. Our results reveal concerning gaps in information consistency, with information in AIO and FS displayed on the same search result page being inconsistent with each other in 33% of cases. Despite high relevance scores, both features critically lack medical safeguards (present in just 11% of AIO and 7% of FS responses). While health and wellness websites dominate source categories for both, AIO and FS, FS also often link to commercial sources. These findings have important implications for public health information access and demonstrate the need for stronger quality controls in AI-mediated health information. Our methodology provides a transferable framework for auditing AI systems across high-stakes domains where information quality directly impacts user well-being.
[89] L2V-CoT: Cross-Modal Transfer of Chain-of-Thought Reasoning via Latent Intervention
Yuliang Zhan, Xinyu Tang, Han Wan, Jian Li, Ji-Rong Wen, Hao Sun
Main category: cs.CL
TL;DR: L2V-CoT: A training-free method that transfers Chain-of-Thought reasoning from LLMs to VLMs using frequency-domain latent representation intervention.
Details
Motivation: Vision-Language Models struggle with multi-step reasoning tasks due to limited multimodal reasoning data. Existing methods for transferring CoT reasoning from LLMs to VLMs require high training costs or architectural alignment, creating a need for more efficient approaches.Method: Uses Linear Artificial Tomography to show LLMs and VLMs share similar low-frequency latent representations of CoT reasoning. Proposes L2V-CoT which extracts and resamples low-frequency CoT representations from LLMs in the frequency domain, enabling dimension matching and latent injection into VLMs during inference without training.
Result: Extensive experiments show L2V-CoT consistently outperforms training-free baselines and even surpasses supervised methods in transferring CoT reasoning capabilities to VLMs.
Conclusion: The shared low-frequency latent representations between LLMs and VLMs enable effective training-free transfer of CoT reasoning capabilities, providing an efficient alternative to costly supervised approaches.
Abstract: Recently, Chain-of-Thought (CoT) reasoning has significantly enhanced the capabilities of large language models (LLMs), but Vision-Language Models (VLMs) still struggle with multi-step reasoning tasks due to limited multimodal reasoning data. To bridge this gap, researchers have explored methods to transfer CoT reasoning from LLMs to VLMs. However, existing approaches either need high training costs or require architectural alignment. In this paper, we use Linear Artificial Tomography (LAT) to empirically show that LLMs and VLMs share similar low-frequency latent representations of CoT reasoning despite architectural differences. Based on this insight, we propose L2V-CoT, a novel training-free latent intervention approach that transfers CoT reasoning from LLMs to VLMs. L2V-CoT extracts and resamples low-frequency CoT representations from LLMs in the frequency domain, enabling dimension matching and latent injection into VLMs during inference to enhance reasoning capabilities. Extensive experiments demonstrate that our approach consistently outperforms training-free baselines and even surpasses supervised methods.
[90] TempPerturb-Eval: On the Joint Effects of Internal Temperature and External Perturbations in RAG Robustness
Yongxin Zhou, Philippe Mulhem, Didier Schwab
Main category: cs.CL
TL;DR: Systematic analysis of how text perturbations (simulating noisy retrieval) interact with temperature settings in RAG systems, showing high-temperature settings amplify vulnerability to noise.
Details
Motivation: Current RAG evaluation examines retrieval quality and generation parameters like temperature in isolation, overlooking their interaction effects on system robustness.Method: Proposes RAG Perturbation-Temperature Analysis Framework that subjects retrieved documents to three distinct perturbation types across varying temperature settings, tested on HotpotQA with open-source and proprietary LLMs.
Result: Performance degradation follows distinct patterns: high-temperature settings consistently amplify vulnerability to perturbations, while certain perturbation types exhibit non-linear sensitivity across temperature range.
Conclusion: Provides diagnostic benchmark for RAG robustness, analytical framework for quantifying perturbation-temperature interactions, and practical guidelines for model selection and parameter tuning under noisy retrieval conditions.
Abstract: The evaluation of Retrieval-Augmented Generation (RAG) systems typically examines retrieval quality and generation parameters like temperature in isolation, overlooking their interaction. This work presents a systematic investigation of how text perturbations (simulating noisy retrieval) interact with temperature settings across multiple LLM runs. We propose a comprehensive RAG Perturbation-Temperature Analysis Framework that subjects retrieved documents to three distinct perturbation types across varying temperature settings. Through extensive experiments on HotpotQA with both open-source and proprietary LLMs, we demonstrate that performance degradation follows distinct patterns: high-temperature settings consistently amplify vulnerability to perturbations, while certain perturbation types exhibit non-linear sensitivity across the temperature range. Our work yields three key contributions: (1) a diagnostic benchmark for assessing RAG robustness, (2) an analytical framework for quantifying perturbation-temperature interactions, and (3) practical guidelines for model selection and parameter tuning under noisy retrieval conditions.
[91] Dementia-R1: Reinforced Pretraining and Reasoning from Unstructured Clinical Notes for Real-World Dementia Prognosis
Choonghan Kim, Hyunmin Hwang, Hangeol Chang, Jaemin Kim, Jinse Park, Jae-Sung Lim, Jong Chul Ye
Main category: cs.CL
TL;DR: Dementia-R1: An RL-based framework for longitudinal dementia prognosis from clinical notes that uses Cold-Start RL with pre-training on clinical indices to improve reasoning over symptom trajectories.
Details
Motivation: LLMs struggle with longitudinal prediction tasks like dementia prognosis that require reasoning over complex, non-monotonic symptom trajectories across multiple visits. Standard supervised training lacks symptom evolution annotations, and direct RL suffers from sparse binary rewards.Method: Introduces Dementia-R1 with Cold-Start RL strategy that pre-trains the model to predict verifiable clinical indices extracted from patient histories before determining final clinical status, enhancing disease progression reasoning.
Result: Achieves best overall performance on AMC real-world cohort (84.02% AUROC), outperforming models up to 10x larger. Generalizes to Parkinson’s disease dementia prediction (78.37% AUROC) and achieves highest AUROC among LLM baselines on ADNI benchmark (83.17%).
Conclusion: Dementia-R1 demonstrates strong longitudinal reasoning over fluctuating cognitive trajectories and provides an effective RL-based framework for clinical prognosis from unstructured text.
Abstract: While Large Language Models (LLMs) have shown strong performance on clinical text understanding, they struggle with longitudinal prediction tasks such as dementia prognosis, which require reasoning over complex, non-monotonic symptom trajectories across multiple visits. Standard supervised training lacks explicit annotations for symptom evolution, while direct Reinforcement Learning (RL) is hindered by sparse binary rewards. To address this challenge, we introduce Dementia-R1, an RL-based framework for longitudinal dementia prognosis from unstructured clinical notes. Our approach adopts a Cold-Start RL strategy that pre-trains the model to predict verifiable clinical indices extracted from patient histories, enhancing the capability to reason about disease progression before determining the final clinical status. Extensive experiments show that Dementia-R1 achieves the best overall performance on the AMC real-world unstructured cohort, reaching an AUROC of 84.02% and outperforming models up to 10x larger. The framework also generalizes to Parkinson’s disease dementia prediction in an independent hospital cohort, achieving an AUROC of 78.37%. On the ADNI benchmark, our 7B model attains the highest AUROC among all LLM baselines at 83.17%, demonstrating strong longitudinal reasoning over fluctuating cognitive trajectories. Code is available at https://anonymous.4open.science/r/dementiar1-CDB5.
[92] A Multi-Perspective Benchmark and Moderation Model for Evaluating Safety and Adversarial Robustness
Naseem Machlovi, Maryam Saleki, Ruhul Amin, Mohamed Rahouti, Shawqi Al-Maliki, Junaid Qadir, Mohamed M. Abdallah, Ala Al-Fuqaha
Main category: cs.CL
TL;DR: GuardEval benchmark and GemmaGuard model for fine-grained content moderation in LLMs, addressing nuanced safety issues like implicit bias and jailbreak prompts
Details
Motivation: Current LLMs struggle with nuanced content moderation cases like implicit offensiveness, subtle biases, and jailbreak prompts due to subjective nature and training data biases, requiring better safety systemsMethod: Created GuardEval benchmark dataset with 106 fine-grained categories across emotions, offensive language, biases, and safety concerns; developed GemmaGuard (GGuard) using QLoRA fine-tuning of Gemma3-12B on GuardEval
Result: GGuard achieves macro F1 score of 0.832, substantially outperforming OpenAI Moderator (0.64) and Llama Guard (0.61), demonstrating improved safety and adversarial robustness
Conclusion: Multi-perspective, human-centered safety benchmarks are critical for consistent moderation; diverse representative data improves safety on complex borderline cases
Abstract: As large language models (LLMs) become deeply embedded in daily life, the urgent need for safer moderation systems that distinguish between naive and harmful requests while upholding appropriate censorship boundaries has never been greater. While existing LLMs can detect dangerous or unsafe content, they often struggle with nuanced cases such as implicit offensiveness, subtle gender and racial biases, and jailbreak prompts, due to the subjective and context-dependent nature of these issues. Furthermore, their heavy reliance on training data can reinforce societal biases, resulting in inconsistent and ethically problematic outputs. To address these challenges, we introduce GuardEval, a unified multi-perspective benchmark dataset designed for both training and evaluation, containing 106 fine-grained categories spanning human emotions, offensive and hateful language, gender and racial bias, and broader safety concerns. We also present GemmaGuard (GGuard), a Quantized Low-Rank Adaptation (QLoRA), fine-tuned version of Gemma3-12B trained on GuardEval, to assess content moderation with fine-grained labels. Our evaluation shows that GGuard achieves a macro F1 score of 0.832, substantially outperforming leading moderation models, including OpenAI Moderator (0.64) and Llama Guard (0.61). We show that multi-perspective, human-centered safety benchmarks are critical for mitigating inconsistent moderation decisions. GuardEval and GGuard together demonstrate that diverse, representative data materially improve safety, and adversarial robustness on complex, borderline cases.
[93] Identifying and Mitigating Bottlenecks in Role-Playing Agents: A Systematic Study of Disentangling Character Profile Axes
Yonghyun Jun, Junhyuk Choi, Jihyeong Park, Jeonghyun Park, Liu Nicole Geumheon, Hwanhee Lee
Main category: cs.CL
TL;DR: A diagnostic framework for LLM role-playing agents reveals that character morality (not familiarity or structure) is the key factor affecting performance, with immoral characters suffering due to alignment priors, addressed by a training-free decoding strategy.
Details
Motivation: Despite various construction methodologies for LLM role-playing agents, it's unclear which aspects of character profiles genuinely drive role-playing quality. The paper aims to systematically diagnose the impact of character profiles along three axes: Familiarity, Structure, and Disposition.Method: Introduced a diagnostic framework with three axes (Familiarity, Structure, Disposition), designed a unified hierarchical schema (5 dimensions, 28 fields) for character attributes, constructed a controlled dataset of 211 personas varying along these axes, evaluated five LLMs on single and multi-turn benchmarks, and proposed Field-Aware Contrastive Decoding (FACD) to mitigate alignment-induced performance gaps.
Result: Revealed striking asymmetry: Familiarity and Structure show negligible impact, while Disposition (Moral vs. Immoral) produces large, consistent performance degradation for immoral characters across all conditions. Performance drop concentrates in motivation-related attributes, indicating alignment priors actively suppress tokens needed for faithful immoral portrayal. FACD significantly reduces the Moral-Immoral performance gap without sacrificing moral-character performance.
Conclusion: Character morality is the primary driver of role-playing quality in LLMs, with alignment priors creating a bottleneck for faithful immoral portrayal. The proposed FACD offers an effective, training-free solution to mitigate this issue while maintaining performance on moral characters.
Abstract: Advancements in Large Language Model (LLM) Role-Playing Agents have focused on various construction methodologies, yet it remains unclear which aspects of character profiles genuinely drive role-playing quality. To bridge this gap, we introduce a systematic diagnostic framework that disentangles the impact of character profiles along three axes: Familiarity (Known vs. Unknown), Structure (Structured vs. Unstructured), and Disposition (Moral vs. Immoral). To investigate these axes, we design a unified hierarchical schema (5 dimensions, 28 fields) standardizing character attributes and construct a controlled dataset of 211 personas varying along these three axes. We evaluate five LLMs on single and multi-turn benchmarks. Our results reveal a striking asymmetry: Familiarity and Structure show negligible impact, while Valence produces large, consistent performance degradation for immoral characters across all conditions. This performance drop concentrates in motivation-related attributes, indicating that alignment priors actively suppress tokens needed for faithful immoral portrayal. To mitigate this alignment-induced bottleneck, we propose Field-Aware Contrastive Decoding (FACD), a training-free strategy that selectively amplizes suppressed immoral-field signals, significantly reducing the Moral-Immoral performance gap without sacrificing moral-character performance.
[94] NAACL: Noise-AwAre Verbal Confidence Calibration for Robust LLMs in RAG Systems
Jiayu Liu, Rui Wang, Qing Zong, Yumeng Wang, Cheng Qian, Qingcheng Zeng, Tianshi Zheng, Haochen Shi, Dadi Guo, Baixuan Xu, Chunyang Li, Yangqiu Song
Main category: cs.CL
TL;DR: NAACL is a noise-aware calibration framework that improves LLM confidence calibration in retrieval-augmented generation by addressing overconfidence caused by noisy retrieved contexts.
Details
Motivation: Retrieval-augmented generation (RAG) is widely used to improve LLM grounding, but confidence calibration in RAG settings remains poorly understood. LLMs exhibit poor calibration due to noisy retrieved contexts (contradictory or irrelevant evidence) that inflate false certainty and cause severe overconfidence.Method: Proposes NAACL Rules (Noise-AwAre Confidence CaLibration Rules) as a principled foundation for resolving overconfidence under noise. Designs NAACL framework that synthesizes supervision from ~2K HotpotQA examples guided by these rules, then performs supervised fine-tuning (SFT) to equip models with intrinsic noise awareness without stronger teacher models.
Result: NAACL yields substantial gains, improving ECE (Expected Calibration Error) scores by 10.9% in-domain and 8.0% out-of-domain. Bridges the gap between retrieval noise and verbal calibration.
Conclusion: NAACL paves the way for both accurate and epistemically reliable LLMs by addressing confidence calibration in noisy RAG settings through noise-aware calibration rules and framework.
Abstract: Accurately assessing model confidence is essential for deploying large language models (LLMs) in mission-critical factual domains. While retrieval-augmented generation (RAG) is widely adopted to improve grounding, confidence calibration in RAG settings remains poorly understood. We conduct a systematic study across four benchmarks, revealing that LLMs exhibit poor calibration performance due to noisy retrieved contexts. Specifically, contradictory or irrelevant evidence tends to inflate the model’s false certainty, leading to severe overconfidence. To address this, we propose NAACL Rules (Noise-AwAre Confidence CaLibration Rules) to provide a principled foundation for resolving overconfidence under noise. We further design NAACL, a noise-aware calibration framework that synthesizes supervision from about 2K HotpotQA examples guided by these rules. By performing supervised fine-tuning (SFT) with this data, NAACL equips models with intrinsic noise awareness without relying on stronger teacher models. Empirical results show that NAACL yields substantial gains, improving ECE scores by 10.9% in-domain and 8.0% out-of-domain. By bridging the gap between retrieval noise and verbal calibration, NAACL paves the way for both accurate and epistemically reliable LLMs.
[95] DLLM Agent: See Farther, Run Faster
Huiling Zhen, Weizhe Lin, Renxi Liu, Kai Han, Yiming Li, Yuchuan Tian, Hanting Chen, Xiaoguang Li, Xiaosong Li, Chen Chen, Xianzhi Yu, Mingxuan Yuan, Youliang Yan, Peifeng Qin, Jun Wang, Yu Wang, Dacheng Tao, Yunhe Wang
Main category: cs.CL
TL;DR: Diffusion LLMs in agent workflows show 30%+ faster end-to-end execution than autoregressive agents with comparable accuracy, requiring fewer interaction rounds and tool invocations due to better planning.
Details
Motivation: To explore whether diffusion-based LLMs offer systematic advantages over autoregressive models for multi-step decision making in agentic systems, particularly in planning and tool-use behaviors.Method: Created DLLM and AR backbones within the same DeepDiver agent workflow, performed matched agent-oriented fine-tuning on identical trajectory data, and compared performance across benchmarks and case studies.
Result: DLLM Agents are over 30% faster end-to-end than AR agents with comparable accuracy, require fewer interaction rounds and tool invocations, show higher planner hit rates, and converge earlier to correct action paths with less backtracking.
Conclusion: Diffusion backbones offer significant efficiency advantages for agentic systems but require careful handling of tool-call failures and attention masking for multi-turn inputs to achieve optimal performance.
Abstract: Diffusion large language models (DLLMs) have emerged as an alternative to autoregressive (AR) decoding with appealing efficiency and modeling properties, yet their implications for agentic multi-step decision making remain underexplored. We ask a concrete question: when the generation paradigm is changed but the agent framework and supervision are held fixed, do diffusion backbones induce systematically different planning and tool-use behaviors, and do these differences translate into end-to-end efficiency gains? We study this in a controlled setting by instantiating DLLM and AR backbones within the same agent workflow (DeepDiver) and performing matched agent-oriented fine-tuning on the same trajectory data, yielding diffusion-backed DLLM Agents and directly comparable AR agents. Across benchmarks and case studies, we find that, at comparable accuracy, DLLM Agents are on average over 30% faster end to end than AR agents, with some cases exceeding 8x speedup. Conditioned on correct task completion, DLLM Agents also require fewer interaction rounds and tool invocations, consistent with higher planner hit rates that converge earlier to a correct action path with less backtracking. We further identify two practical considerations for deploying diffusion backbones in tool-using agents. First, naive DLLM policies are more prone to structured tool-call failures, necessitating stronger tool-call-specific training to emit valid schemas and arguments. Second, for multi-turn inputs interleaving context and action spans, diffusion-style span corruption requires aligned attention masking to avoid spurious context-action information flow; without such alignment, performance degrades. Finally, we analyze attention dynamics across workflow stages and observe paradigm-specific coordination patterns, suggesting stronger global planning signals in diffusion-backed agents.
[96] LHAW: Controllable Underspecification for Long-Horizon Tasks
George Pu, Michael S. Lee, Udari Madhushani Sehwag, David J. Lee, Bryan Zhu, Yash Maurya, Mohit Raghavendra, Yuan Xue, Samuel Marc Denton
Main category: cs.CL
TL;DR: LHAW is a modular pipeline that transforms well-specified tasks into controllable underspecified variants across four dimensions (Goals, Constraints, Inputs, Context) to systematically evaluate how long-horizon workflow agents handle ambiguity and clarification-seeking behavior.
Details
Motivation: Long-horizon workflow agents need to handle ambiguous situations requiring clarification to ensure correct task execution, but current progress is limited by lack of scalable, task-agnostic frameworks for systematically curating and measuring ambiguity's impact across custom workflows.Method: LHAW is a modular, dataset-agnostic synthetic pipeline that transforms any well-specified task into controllable underspecified variants by systematically removing information across four dimensions (Goals, Constraints, Inputs, Context) at configurable severity levels, then validates variants through empirical agent trials to classify them as outcome-critical, divergent, or benign based on observed terminal state divergence.
Result: The authors release 285 task variants from TheAgentCompany, SWE-Bench Pro and MCP-Atlas according to their taxonomy, and provide formal analysis measuring how current agents detect, reason about, and resolve underspecification across ambiguous settings.
Conclusion: LHAW provides the first systematic framework for cost-sensitive evaluation of agent clarification behavior in long-horizon settings, enabling development of reliable autonomous systems that can handle ambiguity effectively.
Abstract: Long-horizon workflow agents that operate effectively over extended periods are essential for truly autonomous systems. Their reliable execution critically depends on the ability to reason through ambiguous situations in which clarification seeking is necessary to ensure correct task execution. However, progress is limited by the lack of scalable, task-agnostic frameworks for systematically curating and measuring the impact of ambiguity across custom workflows. We address this gap by introducing LHAW (Long-Horizon Augmented Workflows), a modular, dataset-agnostic synthetic pipeline that transforms any well-specified task into controllable underspecified variants by systematically removing information across four dimensions - Goals, Constraints, Inputs, and Context - at configurable severity levels. Unlike approaches that rely on LLM predictions of ambiguity, LHAW validates variants through empirical agent trials, classifying them as outcome-critical, divergent, or benign based on observed terminal state divergence. We release 285 task variants from TheAgentCompany, SWE-Bench Pro and MCP-Atlas according to our taxonomy alongside formal analysis measuring how current agents detect, reason about, and resolve underspecification across ambiguous settings. LHAW provides the first systematic framework for cost-sensitive evaluation of agent clarification behavior in long-horizon settings, enabling development of reliable autonomous systems.
[97] The Art of Efficient Reasoning: Data, Reward, and Optimization
Taiqiang Wu, Zenan Xu, Bo Zhou, Ngai Wong
Main category: cs.CL
TL;DR: Systematic investigation of efficient reasoning for LLMs through RL-based reward shaping to incentivize short yet accurate thinking trajectories, with comprehensive evaluation metrics and practical guidelines.
Details
Motivation: LLMs benefit from scaled Chain-of-Thought reasoning but suffer from heavy computational overhead; need for efficient reasoning that encourages short yet accurate thinking trajectories through reward shaping with RL.Method: Systematic investigation using fine-grained metrics including length distribution conditioned on correctness and performance across token budgets (2k-32k); extensive experiments (0.2M GPU hours) analyzing training prompts, rollouts, reward shaping, and optimization strategies; validation across Qwen3 models (0.6B-30B).
Result: Reveals two-stage training paradigm (length adaptation and reasoning refinement); key finding: maintain sufficient density of positive reward signals and avoid short-is-correct trap; learned length bias generalizes across domains and difficulty levels.
Conclusion: Provides valuable insights and practical guidelines for efficient reasoning in LLMs, demonstrating robustness and generalization across model sizes; weights made publicly available.
Abstract: Large Language Models (LLMs) consistently benefit from scaled Chain-of-Thought (CoT) reasoning, but also suffer from heavy computational overhead. To address this issue, efficient reasoning aims to incentivize short yet accurate thinking trajectories, typically through reward shaping with Reinforcement Learning (RL). In this paper, we systematically investigate the mechanics of efficient reasoning for LLMs. For comprehensive evaluation, we advocate for more fine-grained metrics, including length distribution conditioned on correctness and performance across a wide spectrum of token budgets ranging from 2k to 32k. First, we reveal that the training process follows a two-stage paradigm: length adaptation and reasoning refinement. Through extensive experiments (about 0.2 million GPU hours) in a unified protocol, we deconstruct training prompts and rollouts, reward shaping, and optimization strategies. A central finding is to maintain a sufficient density of positive reward signals and avoid the short-is-correct trap. Moreover, the learned length bias generalizes across domains and difficulty levels. We distill these findings into valuable insights and practical guidelines, and validate them across the Qwen3 models ranging from 0.6B to 30B, demonstrating the robustness and generalization. Weights are available at https://wutaiqiang.github.io/project/Art
[98] CIRCUS: Circuit Consensus under Uncertainty via Stability Ensembles
Swapnil Parekh
Main category: cs.CL
TL;DR: CIRCUS is a method for robust circuit discovery in neural networks that addresses threshold sensitivity by treating circuit discovery as uncertainty over explanations, identifying robust core structures through consensus across multiple pruning configurations.
Details
Motivation: Current mechanistic circuit analysis suffers from threshold sensitivity - different pruning thresholds yield different circuits, making it hard to distinguish robust computational structure from threshold artifacts. There's a need for principled methods to separate core computational elements from contingent or noisy connections.Method: CIRCUS prunes attribution graphs under multiple configurations, computes empirical inclusion frequencies for each edge, and extracts consensus circuits containing edges present in every configuration. This yields a core/contingent/noise decomposition analogous to Bayesian variable selection.
Result: On Gemma-2-2B and Llama-3.2-1B models, consensus circuits are 40x smaller than union of all configurations while maintaining explanatory power. They consistently outperform influence-ranked and random baselines, with causal relevance confirmed by activation patching.
Conclusion: CIRCUS provides a principled approach to robust circuit discovery that separates threshold artifacts from genuine computational structure, offering a more reliable foundation for mechanistic interpretability research.
Abstract: Every mechanistic circuit carries an invisible asterisk: it reflects not just the model’s computation, but the analyst’s choice of pruning threshold. Change that choice and the circuit changes, yet current practice treats a single pruned subgraph as ground truth with no way to distinguish robust structure from threshold artifacts. We introduce CIRCUS, which reframes circuit discovery as a problem of uncertainty over explanations. CIRCUS prunes one attribution graph under B configurations, assigns each edge an empirical inclusion frequency s(e) in [0,1] measuring how robustly it survives across the configuration family, and extracts a consensus circuit of edges present in every view. This yields a principled core/contingent/noise decomposition (analogous to posterior model-inclusion indicators in Bayesian variable selection) that separates robust structure from threshold-sensitive artifacts, with negligible overhead. On Gemma-2-2B and Llama-3.2-1B, consensus circuits are 40x smaller than the union of all configurations while retaining comparable influence-flow explanatory power, consistently outperform influence-ranked and random baselines, and are confirmed causally relevant by activation patching.
[99] Strategic Navigation or Stochastic Search? How Agents and Humans Reason Over Document Collections
Łukasz Borchmann, Jordy Van Landeghem, Michał Turski, Shreyansh Padarha, Ryan Othniel Kearns, Adam Mahdi, Niels Rogge, Clémentine Fourrier, Siwei Han, Huaxiu Yao, Artemis Llabrés, Yiming Xu, Dimosthenis Karatzas, Hao Zhang, Anupam Datta
Main category: cs.CL
TL;DR: MADQA benchmark evaluates multimodal agents’ strategic reasoning on PDF document workflows, revealing they rely on brute-force search rather than genuine strategic planning despite matching human accuracy on different questions.
Details
Motivation: To determine whether multimodal agents demonstrate genuine strategic reasoning or merely stochastic trial-and-error search in document-intensive workflows, and to provide a benchmark for evaluating agentic abilities.Method: Introduces MADQA benchmark with 2,250 human-authored questions based on 800 heterogeneous PDF documents, designed using Classical Test Theory for discriminative power. Develops novel evaluation protocol measuring accuracy-effort trade-off to assess agentic behavior.
Result: Best agents match human searchers in raw accuracy but succeed on largely different questions, relying on brute-force search to compensate for weak strategic planning. They fail to close the nearly 20% gap to oracle performance and persist in unproductive loops.
Conclusion: Current multimodal agents lack genuine strategic reasoning capabilities and rely on inefficient search methods. The released dataset and evaluation framework aim to facilitate transition from brute-force retrieval to calibrated, efficient reasoning.
Abstract: Multimodal agents offer a promising path to automating complex document-intensive workflows. Yet, a critical question remains: do these agents demonstrate genuine strategic reasoning, or merely stochastic trial-and-error search? To address this, we introduce MADQA, a benchmark of 2,250 human-authored questions grounded in 800 heterogeneous PDF documents. Guided by Classical Test Theory, we design it to maximize discriminative power across varying levels of agentic abilities. To evaluate agentic behaviour, we introduce a novel evaluation protocol measuring the accuracy-effort trade-off. Using this framework, we show that while the best agents can match human searchers in raw accuracy, they succeed on largely different questions and rely on brute-force search to compensate for weak strategic planning. They fail to close the nearly 20% gap to oracle performance, persisting in unproductive loops. We release the dataset and evaluation harness to help facilitate the transition from brute-force retrieval to calibrated, efficient reasoning.
[100] Prompt Injection as Role Confusion
Charles Ye, Jasmine Cui, Dylan Hadfield-Menell
Main category: cs.CL
TL;DR: Paper reveals that prompt injection attacks succeed due to “role confusion” - models assign authority based on how text is written rather than its source, allowing attackers to spoof reasoning and inherit authority.
Details
Motivation: Despite extensive safety training, language models remain vulnerable to prompt injection attacks. The paper aims to understand the fundamental mechanism behind why these attacks work, tracing it to how models internally identify roles and assign authority.Method: The authors design novel “role probes” to capture how models internally identify “who is speaking.” They test their insight by injecting spoofed reasoning into user prompts and tool outputs, and measure attack success across multiple open- and closed-weight models.
Result: The attacks achieve average success rates of 60% on StrongREJECT and 61% on agent exfiltration, with near-zero baselines. The degree of internal role confusion strongly predicts attack success before generation begins, revealing that security is defined at the interface but authority is assigned in latent space.
Conclusion: Prompt injection attacks exploit a fundamental role-confusion mechanism where models infer roles from how text is written rather than its source. This creates a security gap between interface-level protections and latent-space authority assignment. The paper provides a unifying mechanistic framework for understanding diverse prompt-injection attacks.
Abstract: Language models remain vulnerable to prompt injection attacks despite extensive safety training. We trace this failure to role confusion: models infer roles from how text is written, not where it comes from. We design novel role probes to capture how models internally identify “who is speaking.” These reveal why prompt injection works: untrusted text that imitates a role inherits that role’s authority. We test this insight by injecting spoofed reasoning into user prompts and tool outputs, achieving average success rates of 60% on StrongREJECT and 61% on agent exfiltration, across multiple open- and closed-weight models with near-zero baselines. Strikingly, the degree of internal role confusion strongly predicts attack success before generation begins. Our findings reveal a fundamental gap: security is defined at the interface but authority is assigned in latent space. More broadly, we introduce a unifying, mechanistic framework for prompt injection, demonstrating that diverse prompt-injection attacks exploit the same underlying role-confusion mechanism.
[101] A Unified Framework to Quantify Cultural Intelligence of AI
Sunipa Dev, Vinodkumar Prabhakaran, Rutledge Chin Feman, Aida Davani, Remi Denton, Charu Kalia, Piyawat Lertvittayakumjorn, Madhurima Maji, Rida Qadri, Negar Rostamzadeh, Renee Shelby, Romina Stella, Hayk Stepanyan, Erin van Liemt, Aishwarya Verma, Oscar Wahltinez, Edem Wornyo, Andrew Zaldivar, Saška Mojsilović
Main category: cs.CL
TL;DR: Unable to analyze paper 2603.01211 due to HTTP 429 error when fetching from arXiv API
Details
Motivation: Cannot determine motivation as paper content could not be retrievedMethod: Cannot determine method as paper content could not be retrieved
Result: Cannot determine results as paper content could not be retrieved
Conclusion: Cannot draw conclusions about paper content due to retrieval failure
Abstract: Failed to fetch summary for 2603.01211: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.01211&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[102] BiT-MCTS: A Theme-based Bidirectional MCTS Approach to Chinese Fiction Generation
Zhaoyi Li, Xu Zhang, Xiaojun Wan
Main category: cs.CL
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot determine conclusion as paper content is unavailable
Abstract: Failed to fetch summary for 2603.14410: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.14410&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[103] Beyond bouba/kiki: Multidimensional semantic signals are deeply woven into the fabric of natural language
Gexin Zhao
Main category: cs.CL
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) when querying arXiv API
Details
Motivation: Cannot determine motivation as paper content is unavailable due to API rate limitingMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot draw conclusions about the paper due to technical limitations in accessing the content
Abstract: Failed to fetch summary for 2603.17306: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.17306&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[104] Self-Distilled Reasoner: On-Policy Self-Distillation for Large Language Models
Siyan Zhao, Zhihui Xie, Mengchen Liu, Jing Huang, Guan Pang, Feiyu Chen, Aditya Grover
Main category: cs.CL
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Unable to determine motivation due to API rate limiting preventing access to paper contentMethod: No method information available due to API access failure
Result: No results available due to inability to retrieve paper content
Conclusion: Cannot analyze paper due to technical limitations in accessing arXiv data
Abstract: Failed to fetch summary for 2601.18734: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.18734&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[105] Balancing the Reasoning Load: Difficulty-Differentiated Policy Optimization with Length Redistribution for Efficient and Robust Reinforcement Learning
Yinan Xia, Haotian Zhang, Huiming Wang
Main category: cs.CL
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to failed paper fetchMethod: Unable to determine method due to failed paper fetch
Result: Unable to determine results due to failed paper fetch
Conclusion: Unable to determine conclusion due to failed paper fetch
Abstract: Failed to fetch summary for 2603.18533: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.18533&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
cs.CV
[106] Diffusion-Guided Semantic Consistency for Multimodal Heterogeneity
Jing Liu, Zhengliang Guo, Yan Wang, Xiaoguang Zhu, Yao Du, Zehua Wang, Victor C. M. Leung
Main category: cs.CV
TL;DR: SemanticFL: A federated learning framework using pre-trained diffusion models to address non-IID data challenges in multimodal perception by creating shared semantic latent spaces.
Details
Motivation: Federated learning suffers from performance degradation with non-IID client data, especially in multimodal perception settings. Existing methods fail to address semantic discrepancies between clients, leading to suboptimal performance for multimedia systems requiring robust perception.Method: Leverages multi-layer semantic representations from pre-trained Stable Diffusion models (VAE-encoded latents and U-Net hierarchical features) to create a shared latent space aligning heterogeneous clients. Uses efficient client-server architecture offloading heavy computation to server, with unified consistency mechanism employing cross-modal contrastive learning to stabilize convergence.
Result: Extensive experiments on CIFAR-10, CIFAR-100, and TinyImageNet under diverse heterogeneity scenarios show SemanticFL surpasses existing FL approaches, achieving accuracy gains of up to 5.49% over FedAvg.
Conclusion: SemanticFL effectively addresses non-IID challenges in federated learning for multimodal perception by leveraging diffusion model semantics, creating robust representations for heterogeneous data.
Abstract: Federated learning (FL) is severely challenged by non-independent and identically distributed (non-IID) client data, a problem that degrades global model performance, especially in multimodal perception settings. Conventional methods often fail to address the underlying semantic discrepancies between clients, leading to suboptimal performance for multimedia systems requiring robust perception. To overcome this, we introduce SemanticFL, a novel framework that leverages the rich semantic representations of pre-trained diffusion models to provide privacy-preserving guidance for local training. Our approach leverages multi-layer semantic representations from a pre-trained Stable Diffusion model (including VAE-encoded latents and U-Net hierarchical features) to create a shared latent space that aligns heterogeneous clients, facilitated by an efficient client-server architecture that offloads heavy computation to the server. A unified consistency mechanism, employing cross-modal contrastive learning, further stabilizes convergence. We conduct extensive experiments on benchmarks including CIFAR-10, CIFAR-100, and TinyImageNet under diverse heterogeneity scenarios. Our results demonstrate that SemanticFL surpasses existing federated learning approaches, achieving accuracy gains of up to 5.49% over FedAvg, validating its effectiveness in learning robust representations for heterogeneous and multimodal data for perception tasks.
[107] AURORA: Adaptive Unified Representation for Robust Ultrasound Analysis
Ufaq Khan, L. D. M. S. Sai Teja, Ayuba Shakiru, Mai A. Shaaban, Yutong Xie, Muhammad Bilal, Muhammad Haris Khan
Main category: cs.CV
TL;DR: A unified multi-task transformer framework for ultrasound image analysis that handles segmentation, detection, classification, and landmark regression across diverse organs and datasets, achieving 81.84% average score on test set.
Details
Motivation: Ultrasound images vary widely across scanners, operators, and anatomical targets, causing poor generalization of models trained in one setting to new hospitals and clinical conditions. The Foundation Model Challenge for Ultrasound Image Analysis requires a single model to handle multiple tasks across diverse organs and datasets.Method: Uses a transformer visual encoder from Qwen3-VL family, projects intermediate token features into spatial feature maps, fuses them using lightweight multi-scale feature pyramid for pixel-level predictions and global reasoning. Each task has small task-specific prediction heads, with task-aware sampling and selective loss balancing to manage heterogeneous supervision and reduce task imbalance.
Result: Performance improved from 67% to 85% on validation set and achieved average score of 81.84% on official test set across all tasks (segmentation, detection, classification, landmark regression).
Conclusion: The proposed unified multi-task framework is simple to optimize and adaptable across wide range of ultrasound analysis tasks, effectively addressing generalization challenges in medical imaging.
Abstract: Ultrasound images vary widely across scanners, operators, and anatomical targets, which often causes models trained in one setting to generalize poorly to new hospitals and clinical conditions. The Foundation Model Challenge for Ultrasound Image Analysis (FMC-UIA) reflects this difficulty by requiring a single model to handle multiple tasks, including segmentation, detection, classification, and landmark regression across diverse organs and datasets. We propose a unified multi-task framework based on a transformer visual encoder from the Qwen3-VL family. Intermediate token features are projected into spatial feature maps and fused using a lightweight multi-scale feature pyramid, enabling both pixel-level predictions and global reasoning within a shared representation. Each task is handled by a small task-specific prediction head, while training uses task-aware sampling and selective loss balancing to manage heterogeneous supervision and reduce task imbalance. Our method is designed to be simple to optimize and adaptable across a wide range of ultrasound analysis tasks. The performance improved from 67% to 85% on the validation set and achieved an average score of 81.84% on the official test set across all tasks. The code is publicly available at: https://github.com/saitejalekkala33/FMCUIA-ISBI.git
[108] EgoForge: Goal-Directed Egocentric World Simulator
Yifan Shen, Jiateng Liu, Xinzhuo Li, Yuanzhe Liu, Bingxuan Li, Houze Yang, Wenqi Jia, Yijiang Li, Tianjiao Yu, James Matthew Rehg, Xu Cao, Ismini Lourentzou
Main category: cs.CV
TL;DR: EgoForge: A generative world model for egocentric video simulation from minimal static inputs (single image + instruction + optional exocentric view) using reward-guided diffusion refinement.
Details
Motivation: Egocentric video simulation is challenging due to rapid viewpoint changes, hand-object interactions, and goal-directed procedures dependent on human intent. Existing approaches have limitations: hand-centric synthesis with limited scene evolution, static view translation without action dynamics, or reliance on dense supervision like camera trajectories and multi-camera capture.Method: EgoForge uses minimal static inputs (single egocentric image, high-level instruction, optional auxiliary exocentric view). Introduces VideoDiffusionNFT, a trajectory-level reward-guided refinement that optimizes goal completion, temporal causality, scene consistency, and perceptual fidelity during diffusion sampling.
Result: Extensive experiments show EgoForge achieves consistent gains in semantic alignment, geometric stability, and motion fidelity over strong baselines. Demonstrates robust performance in real-world smart-glasses experiments.
Conclusion: EgoForge presents an effective approach for egocentric goal-directed world simulation from minimal inputs, addressing key challenges in viewpoint changes, hand-object interactions, and intent-aligned procedural generation.
Abstract: Generative world models have shown promise for simulating dynamic environments, yet egocentric video remains challenging due to rapid viewpoint changes, frequent hand-object interactions, and goal-directed procedures whose evolution depends on latent human intent. Existing approaches either focus on hand-centric instructional synthesis with limited scene evolution, perform static view translation without modeling action dynamics, or rely on dense supervision, such as camera trajectories, long video prefixes, synchronized multicamera capture, etc. In this work, we introduce EgoForge, an egocentric goal-directed world simulator that generates coherent, first-person video rollouts from minimal static inputs: a single egocentric image, a high-level instruction, and an optional auxiliary exocentric view. To improve intent alignment and temporal consistency, we propose VideoDiffusionNFT, a trajectory-level reward-guided refinement that optimizes goal completion, temporal causality, scene consistency, and perceptual fidelity during diffusion sampling. Extensive experiments show EgoForge achieves consistent gains in semantic alignment, geometric stability, and motion fidelity over strong baselines, and robust performance in real-world smart-glasses experiments.
[109] Factored Levenberg-Marquardt for Diffeomorphic Image Registration: An efficient optimizer for FireANTs
Rohit Jena, Pratik Chaudhari, James C. Gee
Main category: cs.CV
TL;DR: A memory-efficient Levenberg-Marquardt optimizer for diffeomorphic image registration that reduces memory usage by up to 24.6% while maintaining performance across multiple medical imaging datasets.
Details
Motivation: Existing Adam optimizer in FireANTs requires storing momentum and squared-momentum estimates, consuming significant memory and limiting application to large medical images. Need for memory-efficient optimization for diffeomorphic image registration.Method: Proposes a modified Levenberg-Marquardt optimizer with single scalar damping parameter adaptively tuned using trust region approach. Includes Metropolis-Hastings style rejection step to prevent worsening updates. Tested on brain MRI, lung CT, and cross-modal abdominal registration.
Result: Reduces memory by up to 24.6% for large volumes while retaining performance across all four datasets. Single hyperparameter configuration transfers across modalities, matching or outperforming Adam on three of four benchmarks.
Conclusion: The modified LM optimizer provides memory-efficient alternative to Adam for diffeomorphic image registration, enabling application to larger medical images while maintaining or improving registration performance.
Abstract: FireANTs introduced a novel Eulerian descent method for plug-and-play behavior with arbitrary optimizers adapted for diffeomorphic image registration as a test-time optimization problem, with a GPU-accelerated implementation. FireANTs uses Adam as its default optimizer for fast and more robust optimization. However, Adam requires storing state variables (i.e. momentum and squared-momentum estimates), each of which can consume significant memory, prohibiting its use for significantly large images. In this work, we propose a modified Levenberg-Marquardt (LM) optimizer that requires only a single scalar damping parameter as optimizer state, that is adaptively tuned using a trust region approach. The resulting optimizer reduces memory by up to 24.6% for large volumes, and retaining performance across all four datasets. A single hyperparameter configuration tuned on brain MRI transfers without modification to lung CT and cross-modal abdominal registration, matching or outperforming Adam on three of four benchmarks. We also perform ablations on the effectiveness of using Metropolis-Hastings style rejection step to prevent updates that worsen the loss function.
[110] Principled Multimodal Representation Learning
Xiaohao Liu, Xiaobo Xia, See-Kiong Ng, Tat-Seng Chua
Main category: cs.CV
TL;DR: PMRL is a novel multimodal representation learning framework that achieves simultaneous alignment of multiple modalities without anchor dependency by optimizing the dominant singular value of representation matrices to create a shared leading direction.
Details
Motivation: Traditional multimodal representation learning methods rely on pairwise contrastive learning with predefined anchor modalities, which restricts alignment across all modalities. Recent approaches for simultaneous alignment face challenges including fixed anchor point limitations and instability from optimizing product of singular values.Method: PMRL is grounded in theoretical insight that full alignment corresponds to a rank-1 Gram matrix. It optimizes the dominant singular value of representation matrices to align modalities along a shared leading direction. Uses softmax-based loss treating singular values as logits to prioritize largest singular value, plus instance-wise contrastive regularization on leading eigenvectors to maintain inter-instance separability and prevent collapse.
Result: Extensive experiments across diverse tasks demonstrate PMRL’s superiority compared to baseline methods. The framework achieves more stable and effective multimodal alignment without anchor dependency.
Conclusion: PMRL provides a principled approach to multimodal representation learning that addresses limitations of traditional anchor-based methods and recent simultaneous alignment techniques, offering more stable and effective cross-modal alignment.
Abstract: Multimodal representation learning seeks to create a unified representation space by integrating diverse data modalities to improve multimodal understanding. Traditional methods often depend on pairwise contrastive learning, which relies on a predefined anchor modality, restricting alignment across all modalities. Recent advances have investigated the simultaneous alignment of multiple modalities, yet several challenges remain, such as limitations imposed by fixed anchor points and instability arising from optimizing the product of singular values. To address the challenges, in this paper, we propose Principled Multimodal Representation Learning (PMRL), a novel framework that achieves simultaneous alignment of multiple modalities without anchor dependency in a more stable manner. Specifically, grounded in the theoretical insight that full alignment corresponds to a rank-1 Gram matrix, PMRL optimizes the dominant singular value of the representation matrix to align modalities along a shared leading direction. We propose a softmax-based loss function that treats singular values as logits to prioritize the largest singular value. Besides, instance-wise contrastive regularization on the leading eigenvectors maintains inter-instance separability and prevents representation collapse. Extensive experiments across diverse tasks demonstrate PMRL’s superiority compared to baseline methods. Source code can be found in https://github.com/Xiaohao-Liu/PMRL.
[111] LoFi: Location-Aware Fine-Grained Representation Learning for Chest X-ray
Myeongkyun Kang, Yanting Yang, Xiaoxiao Li
Main category: cs.CV
TL;DR: LoFi: Location-aware Fine-grained representation learning for chest X-ray retrieval and phrase grounding using joint optimization of multiple losses with a lightweight LLM.
Details
Motivation: Current contrastive models lack region-level supervision for fine-grained representation learning in chest X-rays, and large vision language models struggle with fine-grained representations in external validation, leading to suboptimal performance on retrieval and phrase grounding tasks.Method: Proposes LoFi which jointly optimizes sigmoid, captioning, and location-aware captioning losses using a lightweight large language model. The location-aware captioning loss enables region-level supervision through grounding and dense captioning objectives. Also integrates a fine-grained encoder into retrieval-based in-context learning.
Result: Achieves superior retrieval and phrase grounding performance on MIMIC-CXR and PadChest-GR datasets through extensive experiments.
Conclusion: LoFi effectively addresses limitations in fine-grained representation learning for chest X-rays by incorporating region-level supervision through location-aware captioning, leading to improved performance on clinically relevant retrieval and grounding tasks.
Abstract: Fine-grained representation learning is crucial for retrieval and phrase grounding in chest X-rays, where clinically relevant findings are often spatially confined. However, the lack of region-level supervision in contrastive models and the limited ability of large vision language models to capture fine-grained representations in external validation lead to suboptimal performance on these tasks. To address these limitations, we propose Location-aware Fine-grained representation learning (LoFi), which jointly optimizes sigmoid, captioning, and location-aware captioning losses using a lightweight large language model. The location-aware captioning loss enables region-level supervision through grounding and dense captioning objectives, thereby facilitating fine-grained representation learning. Building upon these representations, we integrate a fine-grained encoder into retrieval-based in-context learning to enhance chest X-ray grounding across diverse settings. Extensive experiments demonstrate that our method achieves superior retrieval and phrase grounding performance on MIMIC-CXR and PadChest-GR.
[112] In-the-Wild Camouflage Attack on Vehicle Detectors through Controllable Image Editing
Xiao Fang, Yiming Gong, Stanislav Panev, Celso de Melo, Shuowen Hu, Shayok Chakraborty, Fernando De la Torre
Main category: cs.CV
TL;DR: A framework for generating vehicle camouflage attacks using conditional image editing with ControlNet, achieving strong adversarial effectiveness while maintaining visual stealthiness.
Details
Motivation: Deep neural networks are vulnerable to adversarial attacks, and camouflage attacks specifically manipulate object appearance to deceive detectors while remaining stealthy to humans. Current methods need improvement in both attack effectiveness and visual naturalness.Method: Formulates vehicle camouflage attacks as conditional image-editing problem. Uses ControlNet fine-tuning with image-level and scene-level camouflage generation strategies. Designs unified objective enforcing vehicle structural fidelity, style consistency, and adversarial effectiveness.
Result: Achieves >38% AP50 decrease on COCO and LINZ datasets, better preserves vehicle structure, improves human-perceived stealthiness, generalizes to unseen black-box detectors, and shows promising physical world transferability.
Conclusion: The proposed framework effectively generates adversarial camouflage attacks that balance attack strength with visual stealthiness, demonstrating strong generalization capabilities across detectors and to physical world scenarios.
Abstract: Deep neural networks (DNNs) have achieved remarkable success in computer vision but remain highly vulnerable to adversarial attacks. Among them, camouflage attacks manipulate an object’s visible appearance to deceive detectors while remaining stealthy to humans. In this paper, we propose a new framework that formulates vehicle camouflage attacks as a conditional image-editing problem. Specifically, we explore both image-level and scene-level camouflage generation strategies, and fine-tune a ControlNet to synthesize camouflaged vehicles directly on real images. We design a unified objective that jointly enforces vehicle structural fidelity, style consistency, and adversarial effectiveness. Extensive experiments on the COCO and LINZ datasets show that our method achieves significantly stronger attack effectiveness, leading to more than 38% AP50 decrease, while better preserving vehicle structure and improving human-perceived stealthiness compared to existing approaches. Furthermore, our framework generalizes effectively to unseen black-box detectors and exhibits promising transferability to the physical world. Project page is available at https://humansensinglab.github.io/CtrlCamo
[113] Evaluating Test-Time Adaptation For Facial Expression Recognition Under Natural Cross-Dataset Distribution Shifts
John Turnbull, Shivam Grover, Amin Jalali, Ali Etemad
Main category: cs.CV
TL;DR: First evaluation of Test-Time Adaptation (TTA) methods for Facial Expression Recognition under natural domain shifts, showing TTA can boost performance by up to 11.34% with effectiveness depending on distributional distance and shift severity.
Details
Motivation: Deep learning models struggle with natural distribution shifts in real-world deployments. While Test-Time Adaptation (TTA) addresses this by adapting models during inference without labeled source data, there hasn't been evaluation of TTA methods for Facial Expression Recognition (FER) under natural domain shifts beyond synthetic corruptions.Method: Performed cross-dataset experiments with widely used FER datasets to evaluate TTA methods under natural domain shifts. Examined real-world shifts caused by differing collection protocols, annotation standards, and demographics. Compared different TTA approaches: entropy minimization methods (TENT, SAR), prototype adjustment methods (T3A), and feature alignment methods (SHOT).
Result: TTA can boost FER performance under natural shifts by up to 11.34%. Entropy minimization methods perform best with clean target distributions, prototype adjustment methods excel under larger distributional distances, and feature alignment methods deliver largest gains with noisier target distributions than source.
Conclusion: TTA effectiveness for FER under natural domain shifts is governed by distributional distance and severity of natural shift across domains. Different TTA methods have complementary strengths depending on target distribution characteristics.
Abstract: Deep learning models often struggle under natural distribution shifts, a common challenge in real-world deployments. Test-Time Adaptation (TTA) addresses this by adapting models during inference without labeled source data. We present the first evaluation of TTA methods for FER under natural domain shifts, performing cross-dataset experiments with widely used FER datasets. This moves beyond synthetic corruptions to examine real-world shifts caused by differing collection protocols, annotation standards, and demographics. Results show TTA can boost FER performance under natural shifts by up to 11.34%. Entropy minimization methods such as TENT and SAR perform best when the target distribution is clean. In contrast, prototype adjustment methods like T3A excel under larger distributional distance scenarios. Finally, feature alignment methods such as SHOT deliver the largest gains when the target distribution is noisier than our source. Our cross-dataset analysis shows that TTA effectiveness is governed by the distributional distance and the severity of the natural shift across domains.
[114] ProactiveBench: Benchmarking Proactiveness in Multimodal Large Language Models
Thomas De Min, Subhankar Roy, Stéphane Lathuilière, Elisa Ricci, Massimiliano Mancini
Main category: cs.CV
TL;DR: ProactiveBench benchmark tests MLLMs’ ability to request user interventions for multimodal tasks like recognizing occluded objects, enhancing image quality, and interpreting sketches, revealing current models lack proactiveness.
Details
Motivation: To investigate whether multimodal large language models can exhibit proactive behavior by requesting simple user interventions when faced with challenging multimodal tasks, similar to how humans ask for help when encountering obstacles like occluded objects.Method: Introduces ProactiveBench, a benchmark built from seven repurposed datasets testing proactiveness across different tasks. Evaluates 22 MLLMs on this benchmark and explores fine-tuning strategies using reinforcement learning to teach proactiveness.
Result: MLLMs generally lack proactiveness, proactiveness doesn’t correlate with model capacity, hinting yields marginal gains, conversation histories and in-context learning introduce negative biases, but reinforcement learning fine-tuning shows proactiveness can be learned and generalized.
Conclusion: Current MLLMs lack proactive behavior for requesting user interventions, but proactiveness can be learned through fine-tuning, suggesting potential for building more collaborative multimodal models.
Abstract: Effective collaboration begins with knowing when to ask for help. For example, when trying to identify an occluded object, a human would ask someone to remove the obstruction. Can MLLMs exhibit a similar “proactive” behavior by requesting simple user interventions? To investigate this, we introduce ProactiveBench, a benchmark built from seven repurposed datasets that tests proactiveness across different tasks such as recognizing occluded objects, enhancing image quality, and interpreting coarse sketches. We evaluate 22 MLLMs on ProactiveBench, showing that (i) they generally lack proactiveness; (ii) proactiveness does not correlate with model capacity; (iii) “hinting” at proactiveness yields only marginal gains. Surprisingly, we found that conversation histories and in-context learning introduce negative biases, hindering performance. Finally, we explore a simple fine-tuning strategy based on reinforcement learning: its results suggest that proactiveness can be learned, even generalizing to unseen scenarios. We publicly release ProactiveBench as a first step toward building proactive multimodal models.
[115] A Unified Platform and Quality Assurance Framework for 3D Ultrasound Reconstruction with Robotic, Optical, and Electromagnetic Tracking
Lewis Howell, Manisha Waterston, Tze Min Wah, James H. Chandler, James R. McLaughlan
Main category: cs.CV
TL;DR: A QA framework and open-source platform for evaluating 3D ultrasound reconstruction accuracy using geometric phantoms and automated analysis pipelines.
Details
Motivation: Current 3D ultrasound studies lack comprehensive evaluation of volumetric accuracy and reproducibility, especially for tracked reconstruction methods, creating a need for robust quality assurance frameworks.Method: Developed a custom phantom with geometric inclusions, standardized pipeline for real-time segmentation and 3D reconstruction, automated registration with ground-truth geometries, and evaluated optical, electromagnetic, and robotic tracking at different scanning speeds and angles.
Result: Robotic 3D US achieved state-of-the-art reconstruction performance (DSC-3D = 0.94 ± 0.01, HD95 = 1.17 ± 0.12), approaching transducer spatial resolution limits, with real-time segmentation achieving DSC = 0.97 at 46 FPS without GPU acceleration.
Conclusion: Established a flexible experimental platform and reproducible validation methodology for 3D US reconstruction, enabling robust cross-platform comparisons and improved reporting practices for clinical translation.
Abstract: Three-dimensional (3D) Ultrasound (US) can facilitate diagnosis, treatment planning, and image-guided therapy. However, current studies rarely provide a comprehensive evaluation of volumetric accuracy and reproducibility, highlighting the need for robust Quality Assurance (QA) frameworks, particularly for tracked 3D US reconstruction using freehand or robotic acquisition. This study presents a QA framework for 3D US reconstruction and a flexible open source platform for tracked US research. A custom phantom containing geometric inclusions with varying symmetry properties enables straightforward evaluation of optical, electromagnetic, and robotic kinematic tracking for 3D US at different scanning speeds and insonation angles. A standardised pipeline performs real-time segmentation and 3D reconstruction of geometric targets (DSC = 0.97, FPS = 46) without GPU acceleration, followed by automated registration and comparison with ground-truth geometries. Applying this framework showed that our robotic 3D US achieves state-of-the-art reconstruction performance (DSC-3D = 0.94 +- 0.01, HD95 = 1.17 +- 0.12), approaching the spatial resolution limit imposed by the transducer. This work establishes a flexible experimental platform and a reproducible validation methodology for 3D US reconstruction. The proposed framework enables robust cross-platform comparisons and improved reporting practices, supporting the safe and effective clinical translation of 3D ultrasound in diagnostic and image-guided therapy applications.
[116] Narrative Aligned Long Form Video Question Answering
Rahul Jain, Keval Doshi, Burak Uzkent, Garin Kessler
Main category: cs.CV
TL;DR: NA-VQA benchmark for evaluating narrative reasoning in long videos, with Video-NaRA framework using event chains and structured memory for improved long-range reasoning.
Details
Motivation: Existing multimodal LLM benchmarks focus on localized cues and fail to capture deep narrative reasoning across entire movies, lacking evaluation of long-range dependencies and causal chain reconstruction.Method: Introduces NA-VQA benchmark with 88 full-length movies and 4.4K QA pairs with evidence spans labeled by distance (Short/Medium/Far). Proposes Video-NaRA framework that builds event-level chains and stores them in structured memory for retrieval during reasoning.
Result: State-of-the-art MLLMs perform poorly on questions requiring far-range evidence. Video-NaRA improves long-range reasoning performance by up to 3%, demonstrating effectiveness in handling complex narrative structures.
Conclusion: Current MLLMs lack narrative reasoning capabilities for long videos. Explicit narrative modeling through event chains and structured memory is needed for deep temporal reasoning across movies.
Abstract: Recent progress in multimodal large language models (MLLMs) has led to a surge of benchmarks for long-video reasoning. However, most existing benchmarks rely on localized cues and fail to capture narrative reasoning, the ability to track intentions, connect distant events, and reconstruct causal chains across an entire movie. We introduce NA-VQA, a benchmark designed to evaluate deep temporal and narrative reasoning in long-form videos. NA-VQA contains 88 full-length movies and 4.4K open-ended question-answer pairs, each grounded in multiple evidence spans labeled as Short, Medium, or Far to assess long-range dependencies. By requiring generative, multi-scene answers, NA-VQA tests whether models can integrate dispersed narrative information rather than rely on shallow pattern matching. To address the limitations of existing approaches, we propose Video-NaRA, a narrative-centric framework that builds event-level chains and stores them in a structured memory for retrieval during reasoning. Extensive experiments show that state-of-the-art MLLMs perform poorly on questions requiring far-range evidence, highlighting the need for explicit narrative modeling. Video-NaRA improves long-range reasoning performance by up to 3 percent, demonstrating its effectiveness in handling complex narrative structures. We will release NA-VQA upon publication.
[117] Instruction-Free Tuning of Large Vision Language Models for Medical Instruction Following
Myeongkyun Kang, Soopil Kim, Xiaoxiao Li, Sang Hyun Park
Main category: cs.CV
TL;DR: Medical LVLM fine-tuning without curated instructions using momentum proxy instructions and response shuffling
Details
Motivation: Medical domain faces challenges in constructing large-scale, high-quality instruction datasets due to specialized expert knowledge requirements. Need to reduce reliance on handcrafted instructions for LVLM fine-tuning in medical applications.Method: Proposes instruction-free tuning using only image-description pairs. Introduces momentum proxy instruction to preserve instruction-following capability while promoting valid parameter updates. Incorporates response shuffling strategy to mitigate over-reliance on previous words.
Result: Achieves state-of-the-art accuracy on multiple-choice visual question answering tasks across SKINCON, WBCAtt, CBIS, and MIMIC-CXR datasets. Significantly enhances fine-tuning efficiency of LVLMs in medical domains.
Conclusion: Instruction-free tuning approach effectively addresses medical domain data scarcity, enabling LVLMs to respond to domain-specific instructions without explicit instruction data during fine-tuning.
Abstract: Large vision language models (LVLMs) have demonstrated impressive performance across a wide range of tasks. These capabilities largely stem from visual instruction tuning, which fine-tunes models on datasets consisting of curated image-instruction-output triplets. However, in the medical domain, constructing large-scale, high-quality instruction datasets is particularly challenging due to the need for specialized expert knowledge. To address this issue, we propose an instruction-free tuning approach that reduces reliance on handcrafted instructions, leveraging only image-description pairs for fine-tuning. Specifically, we introduce a momentum proxy instruction as a replacement for curated text instructions, which preserves the instruction-following capability of the pre-trained LVLM while promoting updates to parameters that remain valid during inference. Consequently, the fine-tuned LVLM can flexibly respond to domain-specific instructions, even though explicit instructions are absent during fine-tuning. Additionally, we incorporate a response shuffling strategy to mitigate the model’s over-reliance on previous words, facilitating more effective fine-tuning. Our approach achieves state-of-the-art accuracy on multiple-choice visual question answering tasks across SKINCON, WBCAtt, CBIS, and MIMIC-CXR datasets, significantly enhancing the fine-tuning efficiency of LVLMs in medical domains.
[118] VeloxNet: Efficient Spatial Gating for Lightweight Embedded Image Classification
Md Meftahul Ferdaus, Elias Ioup, Mahdi Abdelguerfi, Anton Netchaev, Steven Sloan, Ken Pathak, Kendall N. Niles
Main category: cs.CV
TL;DR: VeloxNet: lightweight CNN using gMLP blocks with spatial gating units for embedded image classification, achieving better accuracy with fewer parameters than SqueezeNet on aerial disaster datasets.
Details
Motivation: Need for lightweight deep learning models that balance accuracy with strict constraints on model size, memory, and latency for deployment on embedded devices in aerial disaster monitoring and infrastructure inspection applications.Method: Replaces SqueezeNet’s fire modules with gated multi-layer perceptron (gMLP) blocks containing spatial gating units (SGU) that apply learned spatial projections and multiplicative gating to capture global spatial dependencies across full feature maps in single layers.
Result: Reduces parameters by 46.1% relative to SqueezeNet (from 740,970 to 399,366) while improving weighted F1 scores: 6.32% on AIDER, 30.83% on CDD, and 2.51% on LDD. Outperforms MobileNet variants, ShuffleNet, EfficientNet, and vision transformers.
Conclusion: Substituting local convolutional modules with spatial gating blocks improves both classification accuracy and parameter efficiency for resource-constrained deployment, demonstrating the effectiveness of global spatial modeling in lightweight architectures.
Abstract: Deploying deep learning models on embedded devices for tasks such as aerial disaster monitoring and infrastructure inspection requires architectures that balance accuracy with strict constraints on model size, memory, and latency. This paper introduces VeloxNet, a lightweight CNN architecture that replaces SqueezeNet’s fire modules with gated multi-layer perceptron (gMLP) blocks for embedded image classification. Each gMLP block uses a spatial gating unit (SGU) that applies learned spatial projections and multiplicative gating, enabling the network to capture spatial dependencies across the full feature map in a single layer. Unlike fire modules, which are limited to local receptive fields defined by small convolutional kernels, the SGU provides global spatial modeling at each layer with fewer parameters. We evaluate VeloxNet on three aerial image datasets: the Aerial Image Database for Emergency Response (AIDER), the Comprehensive Disaster Dataset (CDD), and the Levee Defect Dataset (LDD), comparing against eleven baselines including MobileNet variants, ShuffleNet, EfficientNet, and recent vision transformers. VeloxNet reduces the parameter count by 46.1% relative to SqueezeNet (from 740,970 to 399,366) while improving weighted F1 scores by 6.32% on AIDER, 30.83% on CDD, and 2.51% on LDD. These results demonstrate that substituting local convolutional modules with spatial gating blocks can improve both classification accuracy and parameter efficiency for resource-constrained deployment. The source code will be made publicly available upon acceptance of the paper.
[119] Vision Tiny Recursion Model (ViTRM): Parameter-Efficient Image Classification via Recursive State Refinement
Ange-Clément Akazan, Abdoulaye Koroko, Verlon Roel Mbingui, Choukouriyah Arinloye, Hassan Fifen, Rose Bandolo
Main category: cs.CV
TL;DR: ViTRM introduces a parameter-efficient vision architecture using recursive computation with a tiny 3-layer block applied repeatedly instead of deep stacked layers, achieving competitive performance with significantly fewer parameters.
Details
Motivation: Current vision models (CNNs, ViTs) are parameter-intensive and computationally demanding, limiting deployment in resource-constrained environments. Inspired by Tiny Recursive Models (TRM) that solve complex reasoning through iterative refinement, the authors aim to create a parameter-efficient alternative to architectural depth in vision.Method: Replace the L-layer ViT encoder with a single tiny k-layer block (k=3) applied recursively N times. This recursive computation approach allows iterative state refinement with far fewer parameters than traditional deep architectures.
Result: ViTRM achieves competitive performance on CIFAR-10 and CIFAR-100 despite using up to 6× fewer parameters than CNN-based models and 84× fewer parameters than ViT models.
Conclusion: Recursive computation is a viable, parameter-efficient alternative to architectural depth in vision models, enabling competitive performance with dramatically reduced parameter counts.
Abstract: The success of deep learning in computer vision has been driven by models of increasing scale, from deep Convolutional Neural Networks (CNN) to large Vision Transformers (ViT). While effective, these architectures are parameter-intensive and demand significant computational resources, limiting deployment in resource-constrained environments. Inspired by Tiny Recursive Models (TRM), which show that small recursive networks can solve complex reasoning tasks through iterative state refinement, we introduce the \textbf{Vision Tiny Recursion Model (ViTRM)}: a parameter-efficient architecture that replaces the $L$-layer ViT encoder with a single tiny $k$-layer block ($k{=}3$) applied recursively $N$ times. Despite using up to $6 \times $ and $84 \times$ fewer parameters than CNN based models and ViT respectively, ViTRM maintains competitive performance on CIFAR-10 and CIFAR-100. This demonstrates that recursive computation is a viable, parameter-efficient alternative to architectural depth in vision.
[120] FedAgain: A Trust-Based and Robust Federated Learning Strategy for an Automated Kidney Stone Identification in Ureteroscopy
Ivan Reyes-Amezcua, Francisco Lopez-Tiro, Clément Larose, Christian Daul, Andres Mendez-Vazquez, Gilberto Ochoa-Ruiz
Main category: cs.CV
TL;DR: FedAgain is a trust-based federated learning strategy for robust kidney stone identification from endoscopic images using dual trust mechanism to weight client contributions and handle heterogeneous medical data.
Details
Motivation: AI reliability in medical imaging depends on robustness to heterogeneous/corrupted images from diverse devices across hospitals. Need privacy-preserving collaborative training while handling noisy data and adversarial updates.Method: FedAgain integrates dual trust mechanism combining benchmark reliability and model divergence to dynamically weight client contributions during aggregation, mitigating impact of noisy/adversarial updates while preserving data privacy.
Result: Outperforms standard federated learning baselines under non-IID data and corrupted-client scenarios across five datasets (MNIST, CIFAR-10, two private kidney stone datasets, MyStone), maintaining diagnostic accuracy and stability.
Conclusion: FedAgain represents practical advance toward reliable, privacy-preserving, clinically deployable federated AI for medical imaging by enhancing robustness and generalization.
Abstract: The reliability of artificial intelligence (AI) in medical imaging critically depends on its robustness to heterogeneous and corrupted images acquired with diverse devices across different hospitals which is highly challenging. Therefore, this paper introduces FedAgain, a trust-based Federated Learning (Federated Learning) strategy designed to enhance robustness and generalization for automated kidney stone identification from endoscopic images. FedAgain integrates a dual trust mechanism that combines benchmark reliability and model divergence to dynamically weight client contributions, mitigating the impact of noisy or adversarial updates during aggregation. The framework enables the training of collaborative models across multiple institutions while preserving data privacy and promoting stable convergence under real-world conditions. Extensive experiments across five datasets, including two canonical benchmarks (MNIST and CIFAR-10), two private multi-institutional kidney stone datasets, and one public dataset (MyStone), demonstrate that FedAgain consistently outperforms standard Federated Learning baselines under non-identically and independently distributed (non-IID) data and corrupted-client scenarios. By maintaining diagnostic accuracy and performance stability under varying conditions, FedAgain represents a practical advance toward reliable, privacy-preserving, and clinically deployable federated AI for medical imaging.
[121] A Multi-Agent Perception-Action Alliance for Efficient Long Video Reasoning
Yichang Xu, Gaowen Liu, Ramana Rao Kompella, Tiansheng Huang, Sihao Hu, Fatih Ilhan, Selim Furkan Tekin, Zachary Yahn, Ling Liu
Main category: cs.CV
TL;DR: A4VL is a multi-agent system for efficient long-video reasoning that uses perception-action exploration loops with VLM agents to extract query-specific clues and collaboratively reason through cross-reviews.
Details
Motivation: Long-video reasoning is challenging due to computational complexity and information overload. Existing methods struggle with efficiency and accuracy when processing lengthy videos, requiring a system that can effectively scale while maintaining reasoning quality.Method: Multi-agent perception-action exploration alliance with VLM agents operating in rounds. Each round includes: 1) perception exploration - agents extract query-specific clues from sampled frames and align them to relevant video blocks, 2) action exploration - agents produce answers with rationales, cross-review each other, and decide whether to continue with pruning/re-staging or conclude.
Result: Outperforms 18 existing VLMs and 11 recent long-video reasoning methods on five VideoQA benchmarks while achieving significantly lower inference latency.
Conclusion: A4VL effectively scales to real-world long videos while preserving high-quality reasoning through multi-agent collaboration and event-driven partitioning, demonstrating superior performance and efficiency.
Abstract: This paper presents a multi-agent perception-action exploration alliance, dubbed A4VL, for efficient long-video reasoning. A4VL operates in a multi-round perception-action exploration loop with a selection of VLM agents. In each round, the team of agents performs video question-answer (VideoQA) via perception exploration followed by action exploration. During perception exploration, each agent learns to extract query-specific perception clue(s) from a few sampled frames and performs clue-based alignment to find the video block(s) that are most relevant to the query-specific event. During action exploration, A4VL performs video reasoning in three steps: (1) each agent produces its initial answer with rational, (2) all agents collaboratively scores one another through cross-reviews and relevance ranking, and (3) based on whether a satisfactory consensus is reached, the decision is made either to start a new round of perception-action deliberation by pruning (e.g., filtering out the lowest performing agent) and re-staging (e.g., new-clue and matching block based perception-action exploration), or to conclude by producing its final answer. The integration of the multi-agent alliance through multi-round perception-action exploration, coupled with event-driven partitioning and cue-guided block alignment, enables A4VL to effectively scale to real world long videos while preserving high quality video reasoning. Evaluation Results on five popular VideoQA benchmarks show that A4VL outperforms 18 existing representative VLMs and 11 recent methods optimized for long-video reasoning, while achieving significantly lower inference latency. Our code is released at https://github.com/git-disl/A4VL.
[122] Gastric-X: A Multimodal Multi-Phase Benchmark Dataset for Advancing Vision-Language Models in Gastric Cancer Analysis
Sheng Lu, Hao Chen, Rui Yin, Juyan Ba, Yu Zhang, Yuanzhe Li
Main category: cs.CV
TL;DR: Gastric-X is a large-scale multimodal benchmark for gastric cancer analysis with 1.7K cases including CT scans, endoscopic images, biochemical indicators, diagnostic notes, and tumor annotations, used to evaluate VLMs on clinical tasks.
Details
Motivation: Current vision-language models have strong generalization but limited application to medical diagnosis due to lack of comprehensive, structured datasets that capture real clinical workflows, particularly for gastric cancer analysis.Method: Created Gastric-X benchmark with 1.7K cases containing paired resting/dynamic CT scans, endoscopic images, structured biochemical indicators, expert diagnostic notes, and tumor bounding boxes. Evaluated VLMs on five core tasks: VQA, report generation, cross-modal retrieval, disease classification, and lesion localization.
Result: The benchmark enables systematic examination of VLM capabilities on clinical tasks and probes whether current VLMs can meaningfully correlate biochemical signals with spatial tumor features and textual reports.
Conclusion: Gastric-X represents a step toward aligning machine intelligence with physician cognitive reasoning and serves as a resource to inspire development of next-generation medical VLMs.
Abstract: Recent vision-language models (VLMs) have shown strong generalization and multimodal reasoning abilities in natural domains. However, their application to medical diagnosis remains limited by the lack of comprehensive and structured datasets that capture real clinical workflows. To advance the development of VLMs for clinical applications, particularly in gastric cancer, we introduce Gastric-X, a large-scale multimodal benchmark for gastric cancer analysis providing 1.7K cases. Each case in Gastric-X includes paired resting and dynamic CT scans, endoscopic image, a set of structured biochemical indicators, expert-authored diagnostic notes, and bounding box annotations of tumor regions, reflecting realistic clinical conditions. We systematically examine the capability of recent VLMs on five core tasks: Visual Question Answering (VQA), report generation, cross-modal retrieval, disease classification, and lesion localization. These tasks simulate critical stages of clinical workflow, from visual understanding and reasoning to multimodal decision support. Through this evaluation, we aim not only to assess model performance but also to probe the nature of VLM understanding: Can current VLMs meaningfully correlate biochemical signals with spatial tumor features and textual reports? We envision Gastric-X as a step toward aligning machine intelligence with the cognitive and evidential reasoning processes of physicians, and as a resource to inspire the development of next-generation medical VLMs.
[123] ReXInTheWild: A Unified Benchmark for Medical Photograph Understanding
Oishi Banerjee, Sung Eun Kim, Alexandra N. Willauer, Julius M. Kernbach, Abeer Rihan Alomaish, Reema Abdulwahab S. Alghamdi, Hassan Rayhan Alomaish, Mohammed Baharoon, Xiaoman Zhang, Julian Nicolas Acosta, Christine Zhou, Pranav Rajpurkar
Main category: cs.CV
TL;DR: ReXInTheWild is a benchmark of 955 clinician-verified medical questions across 484 everyday photographs, evaluating vision-language models’ ability to combine fine-grained image understanding with medical reasoning.
Details
Motivation: Everyday medical photographs are widely used in telemedicine, but there's no comprehensive benchmark to evaluate whether vision-language models can properly interpret their medical content, which requires both natural image understanding and domain-specific medical reasoning.Method: Created ReXInTheWild benchmark with 955 clinician-verified multiple-choice questions spanning 7 clinical topics across 484 photographs sourced from biomedical literature, then evaluated leading multimodal LLMs and medical specialist models.
Result: Gemini-3 achieved 78% accuracy, Claude Opus 4.5 72%, GPT-5 68%, while medical specialist MedGemma only reached 37%. Error analysis revealed four categories of common errors requiring different mitigation strategies.
Conclusion: ReXInTheWild provides a challenging, clinically grounded benchmark at the intersection of natural image understanding and medical reasoning, revealing substantial performance variation among current vision-language models.
Abstract: Everyday photographs taken with ordinary cameras are already widely used in telemedicine and other online health conversations, yet no comprehensive benchmark evaluates whether vision-language models can interpret their medical content. Analyzing these images requires both fine-grained natural image understanding and domain-specific medical reasoning, a combination that challenges both general-purpose and specialized models. We introduce ReXInTheWild, a benchmark of 955 clinician-verified multiple-choice questions spanning seven clinical topics across 484 photographs sourced from the biomedical literature. When evaluated on ReXInTheWild, leading multimodal large language models show substantial performance variation: Gemini-3 achieves 78% accuracy, followed by Claude Opus 4.5 (72%) and GPT-5 (68%), while the medical specialist model MedGemma reaches only 37%. A systematic error analysis also reveals four categories of common errors, ranging from low-level geometric errors to high-level reasoning failures and requiring different mitigation strategies. ReXInTheWild provides a challenging, clinically grounded benchmark at the intersection of natural image understanding and medical reasoning. The dataset is available on HuggingFace.
[124] Recognising BSL Fingerspelling in Continuous Signing Sequences
Alyssa Chan, Taein Kwon, Andrew Zisserman
Main category: cs.CV
TL;DR: New large-scale BSL fingerspelling dataset FS23K with iterative annotation framework and recognition model using bi-manual interactions and mouthing cues, achieving 50% CER reduction.
Details
Motivation: Fingerspelling is crucial in BSL for proper names and technical terms, but recognition is challenging due to rapid signing and letter omissions. Existing datasets are small or inaccurate, limiting research progress.Method: Created FS23K dataset using iterative annotation framework. Proposed recognition model that explicitly accounts for bi-manual interactions and mouthing cues to improve accuracy.
Result: Halved character error rate (CER) compared to prior state-of-the-art on fingerspelling recognition through refined annotations and improved modeling.
Conclusion: The approach demonstrates effectiveness for sign language understanding and enables scalable automated annotation pipelines for future research.
Abstract: Fingerspelling is a critical component of British Sign Language (BSL), used to spell proper names, technical terms, and words that lack established lexical signs. Fingerspelling recognition is challenging due to the rapid pace of signing and common letter omissions by native signers, while existing BSL fingerspelling datasets are either small in scale or temporally and letter-wise inaccurate. In this work, we introduce a new large-scale BSL fingerspelling dataset, FS23K, constructed using an iterative annotation framework. In addition, we propose a fingerspelling recognition model that explicitly accounts for bi-manual interactions and mouthing cues. As a result, with refined annotations, our approach halves the character error rate (CER) compared to the prior state of the art on fingerspelling recognition. These findings demonstrate the effectiveness of our method and highlight its potential to support future research in sign language understanding and scalable, automated annotation pipelines. The project page can be found at https://taeinkwon.com/projects/fs23k/.
[125] SurfaceXR: Fusing Smartwatch IMUs and Egocentric Hand Pose for Seamless Surface Interactions
Vasco Xu, Brian Chen, Eric J. Gonzalez, Andrea Colaço, Henry Hoffmann, Mar Gonzalez-Franco, Karan Ahuja
Main category: cs.CV
TL;DR: SurfaceXR combines headset-based hand tracking with smartwatch IMU data to enable robust surface-based interactions in XR, addressing fatigue and imprecision issues of mid-air gestures.
Details
Motivation: Mid-air gestures in XR cause fatigue and imprecision, while surface-based interactions offer better accuracy and comfort. Current egocentric vision methods struggle with hand tracking challenges and unreliable surface plane estimation.Method: Sensor fusion approach combining headset-based hand tracking (providing 3D positional data) with smartwatch IMU data (capturing high-frequency motion) to enable robust inputs on everyday surfaces.
Result: 21-participant study validates SurfaceXR’s effectiveness for touch tracking and 8-class gesture recognition, demonstrating significant improvements over single-modality approaches.
Conclusion: SurfaceXR’s multimodal sensor fusion approach successfully addresses limitations of current XR interaction methods by combining complementary modalities for robust surface-based interactions.
Abstract: Mid-air gestures in Extended Reality (XR) often cause fatigue and imprecision. Surface-based interactions offer improved accuracy and comfort, but current egocentric vision methods struggle due to hand tracking challenges and unreliable surface plane estimation. We introduce SurfaceXR, a sensor fusion approach combining headset-based hand tracking with smartwatch IMU data to enable robust inputs on everyday surfaces. Our insight is that these modalities are complementary: hand tracking provides 3D positional data while IMUs capture high-frequency motion. A 21-participant study validates SurfaceXR’s effectiveness for touch tracking and 8-class gesture recognition, demonstrating significant improvements over single-modality approaches.
[126] dinov3.seg: Open-Vocabulary Semantic Segmentation with DINOv3
Saikat Dutta, Biplab Banerjee, Hamid Rezatofighi
Main category: cs.CV
TL;DR: dinov3.seg extends dinov3.txt into a dedicated framework for open-vocabulary semantic segmentation, combining early visual refinement, joint text embeddings, and high-resolution inference to improve spatial precision in cluttered scenes.
Details
Motivation: Current vision-language models have suboptimal representations for dense prediction tasks like semantic segmentation, and existing OVSS methods rely on limited adaptation of image-text similarity maps, restricting spatial precision and robustness in complex scenes.Method: 1) Task-specific architecture adapting design principles from prior OVSS work; 2) Joint text embeddings aligned with both global [CLS] token and local patch-level features; 3) Early refinement of visual representations before image-text interaction plus late refinement of correlation features; 4) High-resolution sliding-window inference preserving spatial detail and global context.
Result: Extensive experiments on five OVSS benchmarks demonstrate effectiveness and robustness, consistently outperforming current state-of-the-art methods.
Conclusion: dinov3.seg provides a comprehensive framework for open-vocabulary semantic segmentation that addresses limitations of current VLMs for dense prediction, achieving superior performance through systematic architectural design and refinement strategies.
Abstract: Open-Vocabulary Semantic Segmentation (OVSS) assigns pixel-level labels from an open set of text-defined categories, demanding reliable generalization to unseen classes at inference. Although modern vision-language models (VLMs) support strong open-vocabulary recognition, their representations learned through global contrastive objectives remain suboptimal for dense prediction, prompting many OVSS methods to depend on limited adaptation or refinement of image-text similarity maps. This, in turn, restricts spatial precision and robustness in complex, cluttered scenes. We introduce dinov3.seg, extending dinov3.txt into a dedicated framework for OVSS. Our contributions are four-fold. First, we design a task-specific architecture tailored to this backbone, systematically adapting established design principles from prior open-vocabulary segmentation work. Second, we jointly leverage text embeddings aligned with both the global [CLS] token and local patch-level visual features from ViT-based encoder, effectively combining semantic discrimination with fine-grained spatial locality. Third, unlike prior approaches that rely primarily on post hoc similarity refinement, we perform early refinement of visual representations prior to image-text interaction, followed by late refinement of the resulting image-text correlation features, enabling more accurate and robust dense predictions in cluttered scenes. Finally, we propose a high-resolution local-global inference strategy based on sliding-window aggregation, which preserves spatial detail while maintaining global context. We conduct extensive experiments on five widely adopted OVSS benchmarks to evaluate our approach. The results demonstrate its effectiveness and robustness, consistently outperforming current state-of-the-art methods.
[127] Pedestrian Crossing Intent Prediction via Psychological Features and Transformer Fusion
Sima Ashayer, Hoang H. Nguyen, Yu Liang, Mina Sartipi
Main category: cs.CV
TL;DR: Lightweight socially-informed architecture for pedestrian intention prediction using behavioral streams and uncertainty quantification for risk-aware autonomous driving.
Details
Motivation: Accurate pedestrian intention prediction is crucial for safe autonomous vehicle navigation in urban environments, requiring models that are both efficient and capable of quantifying uncertainty for risk assessment.Method: Fuses four behavioral streams (attention, position, situation, interaction) using highway encoders, compact 4-token Transformer, and global self-attention pooling. Incorporates two uncertainty heads: variational bottleneck for epistemic uncertainty and Mahalanobis distance detector for distributional shift.
Result: Outperforms recent vision language models on PSI 1.0 benchmark (0.9 F1, 0.94 AUC-ROC, 0.78 MCC). Establishes strong baseline on PSI 2.0 (0.78 F1, 0.79 AUC-ROC). Selective prediction improves accuracy by 0.4 percentage points at 80% coverage.
Conclusion: The approach provides calibrated probabilities and actionable risk scores while maintaining efficiency. It’s modality-agnostic, easy to integrate with vision language pipelines, and suitable for resource-constrained platforms.
Abstract: Pedestrian intention prediction needs to be accurate for autonomous vehicles to navigate safely in urban environments. We present a lightweight, socially informed architecture for pedestrian intention prediction. It fuses four behavioral streams (attention, position, situation, and interaction) using highway encoders, a compact 4-token Transformer, and global self-attention pooling. To quantify uncertainty, we incorporate two complementary heads: a variational bottleneck whose KL divergence captures epistemic uncertainty, and a Mahalanobis distance detector that identifies distributional shift. Together, these components yield calibrated probabilities and actionable risk scores without compromising efficiency. On the PSI 1.0 benchmark, our model outperforms recent vision language models by achieving 0.9 F1, 0.94 AUC-ROC, and 0.78 MCC by using only structured, interpretable features. On the more diverse PSI 2.0 dataset, where, to the best of our knowledge, no prior results exist, we establish a strong initial baseline of 0.78 F1 and 0.79 AUC-ROC. Selective prediction based on Mahalanobis scores increases test accuracy by up to 0.4 percentage points at 80% coverage. Qualitative attention heatmaps further show how the model shifts its cross-stream focus under ambiguity. The proposed approach is modality-agnostic, easy to integrate with vision language pipelines, and suitable for risk-aware intent prediction on resource-constrained platforms.
[128] MoCA3D: Monocular 3D Bounding Box Prediction in the Image Plane
Changwoo Jeon, Rishi Upadhyay, Achuta Kadambi
Main category: cs.CV
TL;DR: MoCA3D is a monocular, class-agnostic 3D model that predicts projected 3D bounding box corners and per-corner depths without requiring camera intrinsics at inference time.
Details
Motivation: Existing monocular 3D object understanding methods require known camera intrinsics to obtain image-plane geometry (projected 3D box corners), which limits their applicability in real-world scenarios where intrinsics are unknown.Method: Formulates pixel-space localization and depth assignment as dense prediction via corner heatmaps and depth maps, enabling inference without camera intrinsics.
Result: Achieves state-of-the-art performance with 22.8% improvement in image-plane corner PAG while remaining comparable on 3D IoU, using up to 57 times fewer trainable parameters.
Conclusion: MoCA3D enables practical downstream applications that were previously impractical under unknown intrinsics, demonstrating utility beyond standard baseline models.
Abstract: Monocular 3D object understanding has largely been cast as a 2D RoI-to-3D box lifting problem. However, emerging downstream applications require image-plane geometry (e.g., projected 3D box corners) which cannot be easily obtained without known intrinsics, a problem for object detection in the wild. We introduce MoCA3D, a Monocular, Class-Agnostic 3D model that predicts projected 3D bounding box corners and per-corner depths without requiring camera intrinsics at inference time. MoCA3D formulates pixel-space localization and depth assignment as dense prediction via corner heatmaps and depth maps. To evaluate image-plane geometric fidelity, we propose Pixel-Aligned Geometry (PAG), which directly measures image-plane corner and depth consistency. Extensive experiments demonstrate that MoCA3D achieves state-of-the-art performance, improving image-plane corner PAG by 22.8% while remaining comparable on 3D IoU, using up to 57 times fewer trainable parameters. Finally, we apply MoCA3D to downstream tasks which were previously impractical under unknown intrinsics, highlighting its utility beyond standard baseline models.
[129] SeeClear: Reliable Transparent Object Depth Estimation via Generative Opacification
Xiaoying Wang, Yumeng He, Jingkai Shi, Jiayin Lu, Yin Yang, Ying Jiang, Chenfanfu Jiang
Main category: cs.CV
TL;DR: SeeClear converts transparent objects into generative opaque images using diffusion-based opacification to enable stable monocular depth estimation for transparent objects without retraining depth networks.
Details
Motivation: Monocular depth estimation struggles with transparent objects due to refraction and transmission effects that break appearance assumptions used by depth networks, leading to unstable or incorrect predictions.Method: First localizes transparent regions, then transforms their refractive appearance into geometrically consistent opaque shapes using a diffusion-based generative opacification module. Uses an off-the-shelf monocular depth estimator without retraining. Trained on SeeClear-396k synthetic dataset of paired transparent-opaque renderings.
Result: Experiments on both synthetic and real-world datasets show SeeClear significantly improves depth estimation for transparent objects compared to state-of-the-art methods.
Conclusion: SeeClear provides an effective framework for handling transparent objects in monocular depth estimation by converting them to opaque representations, enabling stable depth prediction without modifying existing depth estimation architectures.
Abstract: Monocular depth estimation remains challenging for transparent objects, where refraction and transmission are difficult to model and break the appearance assumptions used by depth networks. As a result, state-of-the-art estimators often produce unstable or incorrect depth predictions for transparent materials. We propose SeeClear, a novel framework that converts transparent objects into generative opaque images, enabling stable monocular depth estimation for transparent objects. Given an input image, we first localize transparent regions and transform their refractive appearance into geometrically consistent opaque shapes using a diffusion-based generative opacification module. The processed image is then fed into an off-the-shelf monocular depth estimator without retraining or architectural changes. To train the opacification model, we construct SeeClear-396k, a synthetic dataset containing 396k paired transparent-opaque renderings. Experiments on both synthetic and real-world datasets show that SeeClear significantly improves depth estimation for transparent objects. Project page: https://heyumeng.com/SeeClear-web/
[130] StreetForward: Perceiving Dynamic Street with Feedforward Causal Attention
Zhongrui Yu, Zhao Wang, Yijia Xie, Yida Wang, Xueyang Zhang, Yifei Zhan, Kun Zhan
Main category: cs.CV
TL;DR: StreetForward: A pose-free, tracker-free feedforward framework for dynamic street reconstruction using 3D Gaussian Splatting with temporal mask attention for motion-aware representations.
Details
Motivation: Need for rapid scene reconstruction in autonomous driving applications to efficiently utilize large-scale driving datasets for simulation and downstream tasks, avoiding time-consuming per-scene optimization.Method: Pose-free and tracker-free feedforward framework with temporal mask attention module based on VGGT’s alternating attention mechanism. Uses 3D Gaussian Splatting to uniformly represent static content and dynamic instances, optimized through cross-frame rendering with spatio-temporal consistency.
Result: Superior performance on novel view synthesis and depth estimation on Waymo Open Dataset. Zero-shot inference on CARLA and other datasets demonstrates generalization capability.
Conclusion: StreetForward enables efficient dynamic street reconstruction without pose estimation or tracking, producing high-fidelity novel views at new poses and times with motion-aware representations.
Abstract: Feedforward reconstruction is crucial for autonomous driving applications, where rapid scene reconstruction enables efficient utilization of large-scale driving datasets in closed-loop simulation and other downstream tasks, eliminating the need for time-consuming per-scene optimization. We present StreetForward, a pose-free and tracker-free feedforward framework for dynamic street reconstruction. Building upon the alternating attention mechanism from Visual Geometry Grounded Transformer (VGGT), we propose a simple yet effective temporal mask attention module that captures dynamic motion information from image sequences and produces motion-aware latent representations. Static content and dynamic instances are represented uniformly with 3D Gaussian Splatting, and are optimized jointly by cross-frame rendering with spatio-temporal consistency, allowing the model to infer per-pixel velocities and produce high-fidelity novel views at new poses and times. We train and evaluate our model on the Waymo Open Dataset, demonstrating superior performance on novel view synthesis and depth estimation compared to existing methods. Furthermore, zero-shot inference on CARLA and other datasets validates the generalization capability of our approach. More visualizations are available on our project page: https://streetforward.github.io.
[131] Dual-Domain Representation Alignment: Bridging 2D and 3D Vision via Geometry-Aware Architecture Search
Haoyu Zhang, Zhihao Yu, Rui Wang, Yaochu Jin, Qiqi Liu, Ran Cheng
Main category: cs.CV
TL;DR: EvoNAS: Efficient distributed evolutionary neural architecture search framework combining VSS-ViT hybrid supernet with cross-architecture knowledge distillation for Pareto-optimal vision models.
Details
Motivation: Large vision models have high inference costs limiting deployment on edge devices. Evolutionary NAS is suitable for multi-objective optimization but suffers from expensive candidate evaluation and ranking inconsistency among subnetworks.Method: Proposes EvoNAS with hybrid VSS-ViT supernet optimized via Cross-Architecture Dual-Domain Knowledge Distillation (CA-DDKD). Uses Distributed Multi-Model Parallel Evaluation (DMMPE) framework with GPU resource pooling and asynchronous scheduling for efficient evaluation.
Result: EvoNets achieve Pareto-optimal trade-offs between accuracy and efficiency on COCO, ADE20K, KITTI, and NYU-Depth v2. DMMPE improves evaluation efficiency by over 70% compared to conventional data-parallel evaluation.
Conclusion: EvoNAS enables efficient evolutionary architecture search for vision models, delivering lower inference latency and higher throughput under computational constraints while maintaining strong generalization on downstream tasks.
Abstract: Modern computer vision requires balancing predictive accuracy with real-time efficiency, yet the high inference cost of large vision models (LVMs) limits deployment on resource-constrained edge devices. Although Evolutionary Neural Architecture Search (ENAS) is well suited for multi-objective optimization, its practical use is hindered by two issues: expensive candidate evaluation and ranking inconsistency among subnetworks. To address them, we propose EvoNAS, an efficient distributed framework for multi-objective evolutionary architecture search. We build a hybrid supernet that integrates Vision State Space and Vision Transformer (VSS-ViT) modules, and optimize it with a Cross-Architecture Dual-Domain Knowledge Distillation (CA-DDKD) strategy. By coupling the computational efficiency of VSS blocks with the semantic expressiveness of ViT modules, CA-DDKD improves the representational capacity of the shared supernet and enhances ranking consistency, enabling reliable fitness estimation during evolution without extra fine-tuning. To reduce the cost of large-scale validation, we further introduce a Distributed Multi-Model Parallel Evaluation (DMMPE) framework based on GPU resource pooling and asynchronous scheduling. Compared with conventional data-parallel evaluation, DMMPE improves efficiency by over 70% through concurrent multi-GPU, multi-model execution. Experiments on COCO, ADE20K, KITTI, and NYU-Depth v2 show that the searched architectures, termed EvoNets, consistently achieve Pareto-optimal trade-offs between accuracy and efficiency. Compared with representative CNN-, ViT-, and Mamba-based models, EvoNets deliver lower inference latency and higher throughput under strict computational budgets while maintaining strong generalization on downstream tasks such as novel view synthesis. Code is available at https://github.com/EMI-Group/evonas
[132] PFM-VEPAR: Prompting Foundation Models for RGB-Event Camera based Pedestrian Attribute Recognition
Minghe Xu, Rouying Wu, ChiaWei Chu, Xiao Wang, Yu Li
Main category: cs.CV
TL;DR: Event Prompter uses lightweight DCT/IDCT operations on event data to augment RGB for pedestrian attribute recognition, with memory bank and Hopfield networks for cross-sample relational learning.
Details
Motivation: Existing two-stream multimodal fusion methods for event-based pedestrian attribute recognition have high computational overhead and fail to leverage contextual guidance from other samples.Method: Proposes Event Prompter with lightweight DCT/IDCT operations on event data, external memory bank with Hopfield networks for associative memory, and cross-attention fusion with RGB features.
Result: Extensive experiments on multiple benchmark datasets validate the effectiveness of the proposed RGB-Event PAR framework.
Conclusion: The approach achieves efficient multimodal fusion for pedestrian attribute recognition by combining lightweight event processing with memory-augmented representation learning.
Abstract: Event-based pedestrian attribute recognition (PAR) leverages motion cues to enhance RGB cameras in low-light and motion-blur scenarios, enabling more accurate inference of attributes like age and emotion. However, existing two-stream multimodal fusion methods introduce significant computational overhead and neglect the valuable guidance from contextual samples. To address these limitations, this paper proposes an Event Prompter. Discarding the computationally expensive auxiliary backbone, this module directly applies extremely lightweight and efficient Discrete Cosine Transform (DCT) and Inverse DCT (IDCT) operations to the event data. This design extracts frequency-domain event features at a minimal computational cost, thereby effectively augmenting the RGB branch. Furthermore, an external memory bank designed to provide rich prior knowledge, combined with modern Hopfield networks, enables associative memory-augmented representation learning. This mechanism effectively mines and leverages global relational knowledge across different samples. Finally, a cross-attention mechanism fuses the RGB and event modalities, followed by feed-forward networks for attribute prediction. Extensive experiments on multiple benchmark datasets fully validate the effectiveness of the proposed RGB-Event PAR framework. The source code of this paper will be released on https://github.com/Event-AHU/OpenPAR
[133] PhyUnfold-Net: Advancing Remote Sensing Change Detection with Physics-Guided Deep Unfolding
Zelin Lei, Yaoxing Ren, Jiaming Chang
Main category: cs.CV
TL;DR: PhyUnfold-Net: A physics-guided deep unfolding framework for bi-temporal change detection that separates genuine changes from acquisition discrepancies using patch-wise singular-value entropy prior and iterative decomposition.
Details
Motivation: Bi-temporal change detection suffers from false alarms due to acquisition discrepancies (illumination, season, atmosphere). The authors observe that genuine changes have higher patch-wise singular-value entropy than pseudo changes in feature-difference space, providing a physical prior to guide the detection process.Method: Proposes PhyUnfold-Net with: 1) Iterative Change Decomposition Module (ICDM) that unrolls a multi-step solver to separate mixed discrepancy features into change and nuisance components, 2) Staged Exploration-and-Constraint loss (S-SEC) to stabilize decomposition by encouraging separation early and constraining nuisance magnitude later, and 3) Wavelet Spectral Suppression Module (WSSM) to suppress acquisition-induced spectral mismatch before decomposition.
Result: Experiments on four benchmarks show improvements over state-of-the-art methods, with gains particularly evident under challenging conditions with significant acquisition discrepancies.
Conclusion: The physics-guided deep unfolding framework effectively leverages physical priors (singular-value entropy) to separate genuine changes from acquisition discrepancies, providing robust change detection even under challenging conditions with illumination, seasonal, and atmospheric variations.
Abstract: Bi-temporal change detection is highly sensitive to acquisition discrepancies, including illumination, season, and atmosphere, which often cause false alarms. We observe that genuine changes exhibit higher patch-wise singular-value entropy (SVE) than pseudo changes in the feature-difference space. Motivated by this physical prior, we propose PhyUnfold-Net, a physics-guided deep unfolding framework that formulates change detection as an explicit decomposition problem. The proposed Iterative Change Decomposition Module (ICDM) unrolls a multi-step solver to progressively separate mixed discrepancy features into a change component and a nuisance component. To stabilize this process, we introduce a staged Exploration-and-Constraint loss (S-SEC), which encourages component separation in early steps while constraining nuisance magnitude in later steps to avoid degenerate solutions. We further design a Wavelet Spectral Suppression Module (WSSM) to suppress acquisition-induced spectral mismatch before decomposition. Experiments on four benchmarks show improvements over state-of-the-art methods, with gains under challenging conditions.
[134] Efficiency Follows Global-Local Decoupling
Zhenyu Yang, Gensheng Pei, Tao Chen, Yichao Zhou, Tianfei Zhou, Yazhou Yao, Fumin Shen
Main category: cs.CV
TL;DR: ConvNeur: A two-branch architecture that decouples global reasoning and local representation for efficient vision models, achieving competitive performance with subquadratic scaling.
Details
Motivation: Modern vision models need to capture both global context and local details efficiently. Current approaches often sacrifice one for the other or become computationally expensive with quadratic scaling in attention mechanisms.Method: Introduces ConvNeur with two branches: 1) lightweight neural memory branch for global context aggregation on compact tokens, and 2) locality-preserving branch for fine structure extraction. Uses learned gate to modulate local features with global cues without entangling objectives.
Result: Matches or surpasses comparable alternatives on classification, detection, and segmentation benchmarks at similar or lower compute. Offers favorable accuracy vs latency trade-offs with subquadratic scaling relative to image size.
Conclusion: Global-local decoupling leads to efficient vision models that maintain both global reasoning and local detail extraction while being computationally affordable.
Abstract: Modern vision models must capture image-level context without sacrificing local detail while remaining computationally affordable. We revisit this tradeoff and advance a simple principle: decouple the roles of global reasoning and local representation. To operationalize this principle, we introduce ConvNeur, a two-branch architecture in which a lightweight neural memory branch aggregates global context on a compact set of tokens, and a locality-preserving branch extracts fine structure. A learned gate lets global cues modulate local features without entangling their objectives. This separation yields subquadratic scaling with image size, retains inductive priors associated with local processing, and reduces overhead relative to fully global attention. On standard classification, detection, and segmentation benchmarks, ConvNeur matches or surpasses comparable alternatives at similar or lower compute and offers favorable accuracy versus latency trade-offs at similar budgets. These results support the view that efficiency follows global-local decoupling.
[135] Accelerating Diffusion Decoders via Multi-Scale Sampling and One-Step Distillation
Chuhan Wang, Hao Chen
Main category: cs.CV
TL;DR: Two-stage acceleration framework for diffusion-based image tokenization decoders using multi-scale sampling and distillation to achieve order-of-magnitude speedup with minimal quality degradation.
Details
Motivation: Diffusion-based decoders in image tokenization provide high perceptual fidelity but suffer from significant latency due to iterative sampling, making them impractical for real-time or large-scale applications.Method: Two-stage framework: 1) Multi-scale sampling starting at coarse resolution and progressively doubling resolution at each stage (O(log n) speedup), 2) Distilling diffusion decoder at each scale into single-step denoising model for fast single forward pass per scale.
Result: Achieves order-of-magnitude reduction in decoding time with little degradation in output quality, providing practical pathway toward efficient yet expressive image tokenizers.
Conclusion: The approach enables efficient visual tokenization and serves as foundation for future work in efficient visual tokenization and downstream generation.
Abstract: Image tokenization plays a central role in modern generative modeling by mapping visual inputs into compact representations that serve as an intermediate signal between pixels and generative models. Diffusion-based decoders have recently been adopted in image tokenization to reconstruct images from latent representations with high perceptual fidelity. In contrast to diffusion models used for downstream generation, these decoders are dedicated to faithful reconstruction rather than content generation. However, their iterative sampling process introduces significant latency, making them impractical for real-time or large-scale applications. In this work, we introduce a two-stage acceleration framework to address this inefficiency. First, we propose a multi-scale sampling strategy, where decoding begins at a coarse resolution and progressively refines the output by doubling the resolution at each stage, achieving a theoretical speedup of $\mathcal{O}(\log n)$ compared to standard full-resolution sampling. Second, we distill the diffusion decoder at each scale into a single-step denoising model, enabling fast and high-quality reconstructions in a single forward pass per scale. Together, these techniques yield an order-of-magnitude reduction in decoding time with little degradation in output quality. Our approach provides a practical pathway toward efficient yet expressive image tokenizers. We hope it serves as a foundation for future work in efficient visual tokenization and downstream generation.
[136] CurveStream: Boosting Streaming Video Understanding in MLLMs via Curvature-Aware Hierarchical Visual Memory Management
Chao Wang, Xudong Tan, Jianjian Cao, Kangcong Li, Tao Chen
Main category: cs.CV
TL;DR: CurveStream: A training-free, curvature-aware hierarchical visual memory management framework for streaming video understanding in multimodal LLMs that addresses token explosion by adaptively routing frames based on semantic transitions.
Details
Motivation: Multimodal LLMs struggle with streaming videos due to linear explosion of visual tokens causing OOM errors or catastrophic forgetting. Existing methods use uniform sampling or low-level metrics lacking semantic awareness, potentially disrupting contextual coherence and missing critical semantic transitions.Method: Proposes CurveStream framework that uses curvature scores to identify semantic transitions along continuous feature trajectories. Employs online K-Sigma dynamic threshold to adaptively route frames into clear and fuzzy memory states under strict token budget constraints.
Result: Achieves absolute performance gains of over 10% (10.69% on StreamingBench and 13.58% on OVOBench) over baselines, establishing new SOTA for streaming video perception across diverse temporal scales.
Conclusion: CurveStream provides an effective, lightweight solution for streaming video understanding in multimodal LLMs by leveraging geometric insights about semantic transitions, enabling better memory management without training.
Abstract: Multimodal Large Language Models have achieved significant success in offline video understanding, yet their application to streaming videos is severely limited by the linear explosion of visual tokens, which often leads to Out-of-Memory (OOM) errors or catastrophic forgetting. Existing visual retention and memory management methods typically rely on uniform sampling, low-level physical metrics, or passive cache eviction. However, these strategies often lack intrinsic semantic awareness, potentially disrupting contextual coherence and blurring transient yet critical semantic transitions. To address these limitations, we propose CurveStream, a training-free, curvature-aware hierarchical visual memory management framework. Our approach is motivated by the key observation that high-curvature regions along continuous feature trajectories closely align with critical global semantic transitions. Based on this geometric insight, CurveStream evaluates real-time semantic intensity via a Curvature Score and integrates an online K-Sigma dynamic threshold to adaptively route frames into clear and fuzzy memory states under a strict token budget. Evaluations across diverse temporal scales confirm that this lightweight framework, CurveStream, consistently yields absolute performance gains of over 10% (e.g., 10.69% on StreamingBench and 13.58% on OVOBench) over respective baselines, establishing new state-of-the-art results for streaming video perception.The code will be released at https://github.com/streamingvideos/CurveStream.
[137] MagicSeg: Open-World Segmentation Pretraining via Counterfactural Diffusion-Based Auto-Generation
Kaixin Cai, Pengzhen Ren, Jianhua Han, Yi Zhu, Hang Xu, Jianzhuang Liu, Xiaodan Liang
Main category: cs.CV
TL;DR: MagicSeg: A diffusion model-driven pipeline for automatically generating datasets for open-world semantic segmentation, using text generation, paired counterfactual samples, and pseudo mask extraction for contrastive training.
Details
Motivation: Open-world semantic segmentation requires extensive image-text pairs with fine-grained pixel annotations, which are expensive to acquire. Current datasets lack sufficient categories and annotations, limiting model performance.Method: 1) Generate high-fidelity textual descriptions from class labels; 2) Use diffusion model to generate images from text; 3) Generate both positive and negative (counterfactual) image pairs; 4) Extract object masks using open-vocabulary detection and interactive segmentation models; 5) Train with contrastive language-image pretraining with pseudo mask supervision and counterfactual contrastive training.
Result: Achieves state-of-the-art performance on PASCAL VOC (62.9%), PASCAL Context (26.7%), and COCO (40.2%) for open-world semantic segmentation.
Conclusion: MagicSeg provides an effective automated pipeline for generating high-quality datasets for open-world semantic segmentation, reducing reliance on expensive manual annotation while achieving strong downstream performance.
Abstract: Open-world semantic segmentation presently relies significantly on extensive image-text pair datasets, which often suffer from a lack of fine-grained pixel annotations on sufficient categories. The acquisition of such data is rendered economically prohibitive due to the substantial investments of both human labor and time. In light of the formidable image generation capabilities of diffusion models, we introduce a novel diffusion model-driven pipeline for automatically generating datasets tailored to the needs of open-world semantic segmentation, named “MagicSeg”. Our MagicSeg initiates from class labels and proceeds to generate high-fidelity textual descriptions, which in turn serve as guidance for the diffusion model to generate images. Rather than only generating positive samples for each label, our process encompasses the simultaneous generation of corresponding negative images, designed to serve as paired counterfactual samples for contrastive training. Then, to provide a self-supervised signal for open-world segmentation pretraining, our MagicSeg integrates an open-vocabulary detection model and an interactive segmentation model to extract object masks as precise segmentation labels from images based on the provided category labels. By applying our dataset to the contrastive language-image pretraining model with the pseudo mask supervision and the auxiliary counterfactual contrastive training, the downstream model obtains strong performance on open-world semantic segmentation. We evaluate our model on PASCAL VOC, PASCAL Context, and COCO, achieving SOTA with performance of 62.9%, 26.7%, and 40.2%, respectively, demonstrating our dataset’s effectiveness in enhancing open-world semantic segmentation capabilities. Project website: https://github.com/ckxhp/magicseg.
[138] FlowScene: Style-Consistent Indoor Scene Generation with Multimodal Graph Rectified Flow
Zhifei Yang, Guangyao Zhai, Keyang Lu, YuYang Yin, Chao Zhang, Zhen Xiao, Jieyi Long, Nassir Navab, Yikai Wang
Main category: cs.CV
TL;DR: FlowScene is a tri-branch scene generative model that uses multimodal graphs to collaboratively generate scene layouts, object shapes, and object textures with tight-coupled rectified flow for cross-object information exchange.
Details
Motivation: Current scene generation methods have limitations: language-driven retrieval lacks object-level control and scene-level style coherence, while graph-based methods struggle with high-fidelity textured results, limiting practical utility.Method: A tri-branch generative model conditioned on multimodal graphs with tight-coupled rectified flow that exchanges object information during generation, enabling collaborative reasoning across the graph for layouts, shapes, and textures.
Result: Outperforms both language-conditioned and graph-conditioned baselines in generation realism, style consistency, and alignment with human preferences.
Conclusion: FlowScene enables fine-grained control of objects’ shapes, textures, and relations while enforcing scene-level style coherence across structure and appearance.
Abstract: Scene generation has extensive industrial applications, demanding both high realism and precise control over geometry and appearance. Language-driven retrieval methods compose plausible scenes from a large object database, but overlook object-level control and often fail to enforce scene-level style coherence. Graph-based formulations offer higher controllability over objects and inform holistic consistency by explicitly modeling relations, yet existing methods struggle to produce high-fidelity textured results, thereby limiting their practical utility. We present FlowScene, a tri-branch scene generative model conditioned on multimodal graphs that collaboratively generates scene layouts, object shapes, and object textures. At its core lies a tight-coupled rectified flow model that exchanges object information during generation, enabling collaborative reasoning across the graph. This enables fine-grained control of objects’ shapes, textures, and relations while enforcing scene-level style coherence across structure and appearance. Extensive experiments show that FlowScene outperforms both language-conditioned and graph-conditioned baselines in terms of generation realism, style consistency, and alignment with human preferences.
[139] K-GMRF: Kinetic Gauss-Markov Random Field for First-Principles Covariance Tracking on Lie Groups
ZhiMing Li
Main category: cs.CV
TL;DR: K-GMRF is an online, training-free framework for covariance tracking using second-order dynamics on Lie groups, achieving superior tracking accuracy compared to first-order methods.
Details
Motivation: Existing covariance tracking methods either ignore manifold constraints or use first-order updates that cause phase lag during rapid evolution, limiting their effectiveness in dynamic vision applications.Method: Reformulates covariance tracking as forced rigid-body motion on Lie groups using Euler-Poincaré equations, interpreting observations as torques driving latent angular velocity, with structure-preserving symplectic integration.
Result: Achieves 30x reduction in angular error on synthetic ellipses, reduces geodesic error from 29.4° to 9.9° on SO(3) stabilization, and improves IoU from 0.55 to 0.74 on motion-blur sequences with 96% success rate.
Conclusion: K-GMRF provides a plug-and-play geometric prior for data-constrained scenarios and an interpretable layer within modern deep architectures, with proven theoretical advantages over first-order methods.
Abstract: Tracking non-stationary covariance matrices is fundamental to vision yet hindered by existing estimators that either neglect manifold constraints or rely on first-order updates, incurring inevitable phase lag during rapid evolution. We propose K-GMRF, an online, training-free framework for covariance tracking that reformulates the problem as forced rigid-body motion on Lie groups. Derived from the Euler-Poincaré equations, our method interprets observations as torques driving a latent angular velocity, propagated via a structure-preserving symplectic integrator. We theoretically prove that this second-order dynamics achieves zero steady-state error under constant rotation, strictly superior to the proportional lag of first-order baselines. Validation across three domains demonstrates robust tracking fidelity: (i) on synthetic ellipses, K-GMRF reduces angular error by 30x compared to Riemannian EMA while maintaining stability at high speeds; (ii) on SO(3) stabilization with 20% dropout, it decreases geodesic error from 29.4° to 9.9°; and (iii) on OTB motion-blur sequences, it improves loU from 0.55 to 0.74 on BlurCar2 with a 96% success rate. As a fully differentiable symplectic module, K-GMRF provides a plug-and-play geometric prior for data-constrained scenarios and an interpretable layer within modern deep architectures.
[140] Beyond Quadratic: Linear-Time Change Detection with RWKV
Zhenyu Yang, Gensheng Pei, Tao Chen, Xia Yuan, Haofeng Zhang, Xiangbo Shu, Yazhou Yao
Main category: cs.CV
TL;DR: ChangeRWKV introduces a novel architecture for remote sensing change detection that combines Transformer-like parallel training with RNN-like linear-time inference using RWKV framework, achieving SOTA results with significantly reduced computational cost.
Details
Motivation: Existing change detection methods face a trade-off: CNNs are efficient but lack global context, while Transformers capture long-range dependencies but have prohibitive computational costs. There's a need for an architecture that reconciles this conflict for operational-scale applications.Method: Proposes ChangeRWKV based on Receptance Weighted Key Value (RWKV) framework. Key innovations include: 1) hierarchical RWKV encoder for multi-resolution feature representation, and 2) Spatial-Temporal Fusion Module (STFM) to resolve spatial misalignments across scales while capturing fine-grained temporal discrepancies.
Result: Achieves state-of-the-art performance on LEVIR-CD benchmark with 85.46% IoU and 92.16% F1 score, while drastically reducing parameters and FLOPs compared to previous leading methods.
Conclusion: ChangeRWKV demonstrates a new, efficient, and powerful paradigm for operational-scale change detection, successfully reconciling the CNN-Transformer trade-off through the RWKV framework.
Abstract: Existing paradigms for remote sensing change detection are caught in a trade-off: CNNs excel at efficiency but lack global context, while Transformers capture long-range dependencies at a prohibitive computational cost. This paper introduces ChangeRWKV, a new architecture that reconciles this conflict. By building upon the Receptance Weighted Key Value (RWKV) framework, our ChangeRWKV uniquely combines the parallelizable training of Transformers with the linear-time inference of RNNs. Our approach core features two key innovations: a hierarchical RWKV encoder that builds multi-resolution feature representation, and a novel Spatial-Temporal Fusion Module (STFM) engineered to resolve spatial misalignments across scales while distilling fine-grained temporal discrepancies. ChangeRWKV not only achieves state-of-the-art performance on the LEVIR-CD benchmark, with an 85.46% IoU and 92.16% F1 score, but does so while drastically reducing parameters and FLOPs compared to previous leading methods. This work demonstrates a new, efficient, and powerful paradigm for operational-scale change detection. Our code and model are publicly available.
[141] Physion-Eval: Evaluating Physical Realism in Generated Video via Human Reasoning
Qin Zhang, Peiyu Jing, Hong-Xing Yu, Fangqiang Ding, Fan Nie, Weimin Wang, Yilun Du, James Zou, Jiajun Wu, Bing Shuai
Main category: cs.CV
TL;DR: Physion-Eval is a benchmark for evaluating physical realism in video generation models using expert human reasoning across 22 physical categories, revealing that 83-94% of generated videos contain physical glitches.
Details
Motivation: Current video generation evaluations rely on automated metrics or coarse human judgments that provide limited insight into when and why generated videos violate real-world physical constraints. There's a need for fine-grained analysis of physical realism failures.Method: Created a large-scale benchmark with 10,990 expert reasoning traces across 22 fine-grained physical categories. Generated videos from 5 state-of-the-art models were compared against real-world reference videos, with annotations including temporally localized glitches, structured failure categories, and natural-language explanations.
Result: Revealed striking limitations: 83.3% of exocentric and 93.5% of egocentric generated videos exhibit at least one human-identifiable physical glitch in physics-critical scenarios.
Conclusion: Physion-Eval sets a new standard for physical realism evaluation and should guide development of physics-grounded video generation models.
Abstract: Video generation models are increasingly used as world simulators for storytelling, simulation, and embodied AI. As these models advance, a key question arises: do generated videos obey the physical laws of the real world? Existing evaluations largely rely on automated metrics or coarse human judgments such as preferences or rubric-based checks. While useful for assessing perceptual quality, these methods provide limited insight into when and why generated dynamics violate real-world physical constraints. We introduce Physion-Eval, a large-scale benchmark of expert human reasoning for diagnosing physical realism failures in videos generated by five state-of-the-art models across egocentric and exocentric views, containing 10,990 expert reasoning traces spanning 22 fine-grained physical categories. Each generated video is derived from a corresponding real-world reference video depicting a clear physical process, and annotated with temporally localized glitches, structured failure categories, and natural-language explanations of the violated physical behavior. Using this dataset, we reveal a striking limitation of current video generation models: in physics-critical scenarios, 83.3% of exocentric and 93.5% of egocentric generated videos exhibit at least one human-identifiable physical glitch. We hope Physion-Eval will set a new standard for physical realism evaluation and guide the development of physics-grounded video generation. The benchmark is publicly available at https://huggingface.co/datasets/PhysionLabs/Physion-Eval.
[142] FB-CLIP: Fine-Grained Zero-Shot Anomaly Detection with Foreground-Background Disentanglement
Ming Hu, Yongsheng Huo, Mingyu Dou, Jianfu Yin, Peng Zhao, Yao Wang, Cong Hu, Bingliang Hu, Quan Wang
Main category: cs.CV
TL;DR: FB-CLIP enhances zero-shot anomaly detection by improving CLIP’s ability to separate foreground from background and using multi-strategy textual representations for better semantic understanding.
Details
Motivation: Fine-grained anomaly detection is important for industrial/medical applications but suffers from scarce labeled anomalies. Vision-language models like CLIP show promise but struggle with foreground-background feature entanglement and coarse textual semantics.Method: FB-CLIP uses multi-strategy textual representations (End-of-Text features, global-pooled representations, attention-weighted tokens) and multi-view soft separation in visual modality (identity, semantic, spatial dimensions) with background suppression. Semantic Consistency Regularization aligns image features with normal/abnormal textual prototypes.
Result: FB-CLIP effectively distinguishes anomalies from complex backgrounds, achieving accurate fine-grained anomaly detection and localization under zero-shot settings.
Conclusion: The proposed framework successfully addresses CLIP’s limitations for zero-shot anomaly detection through improved foreground-background separation and richer semantic representations.
Abstract: Fine-grained anomaly detection is crucial in industrial and medical applications, but labeled anomalies are often scarce, making zero-shot detection challenging. While vision-language models like CLIP offer promising solutions, they struggle with foreground-background feature entanglement and coarse textual semantics. We propose FB-CLIP, a framework that enhances anomaly localization via multi-strategy textual representations and foreground-background separation. In the textual modality, it combines End-of-Text features, global-pooled representations, and attention-weighted token features for richer semantic cues. In the visual modality, multi-view soft separation along identity, semantic, and spatial dimensions, together with background suppression, reduces interference and improves discriminability. Semantic Consistency Regularization (SCR) aligns image features with normal and abnormal textual prototypes, suppressing uncertain matches and enlarging semantic gaps. Experiments show that FB-CLIP effectively distinguishes anomalies from complex backgrounds, achieving accurate fine-grained anomaly detection and localization under zero-shot settings.
[143] LoD-Loc v3: Generalized Aerial Localization in Dense Cities using Instance Silhouette Alignment
Shuaibang Peng, Juelin Zhu, Xia Li, Kun Yang, Maojun Zhang, Yu Liu, Shen Yan
Main category: cs.CV
TL;DR: LoD-Loc v3 improves aerial visual localization in dense urban areas using instance segmentation and synthetic data for better generalization and reduced ambiguity in dense building scenes.
Details
Motivation: Previous LoD-Loc v2 had poor cross-scene generalization and frequent failures in dense building scenes due to semantic building silhouette alignment with low-detail city models.Method: Two key innovations: 1) New synthetic data generation pipeline creating InsLoD-Loc - largest instance segmentation dataset for aerial imagery (100k images), enabling zero-shot generalization. 2) Reformulating localization from semantic to instance silhouette alignment to reduce pose estimation ambiguity in dense scenes.
Result: Extensive experiments show LoD-Loc v3 outperforms existing SOTA baselines with superior performance in both cross-scene and dense urban scenarios by a large margin.
Conclusion: The method successfully addresses limitations of prior work through instance-level segmentation and synthetic data generation, achieving robust aerial visual localization in challenging dense urban environments.
Abstract: We present LoD-Loc v3, a novel method for generalized aerial visual localization in dense urban environments. While prior work LoD-Loc v2 achieves localization through semantic building silhouette alignment with low-detail city models, it suffers from two key limitations: poor cross-scene generalization and frequent failure in dense building scenes. Our method addresses these challenges through two key innovations. First, we develop a new synthetic data generation pipeline that produces InsLoD-Loc - the largest instance segmentation dataset for aerial imagery to date, comprising 100k images with precise instance building annotations. This enables trained models to exhibit remarkable zero-shot generalization capability. Second, we reformulate the localization paradigm by shifting from semantic to instance silhouette alignment, which significantly reduces pose estimation ambiguity in dense scenes. Extensive experiments demonstrate that LoD-Loc v3 outperforms existing state-of-the-art (SOTA) baselines, achieving superior performance in both cross-scene and dense urban scenarios with a large margin. The project is available at https://nudt-sawlab.github.io/LoD-Locv3/.
[144] ParallelVLM: Lossless Video-LLM Acceleration with Visual Alignment Aware Parallel Speculative Decoding
Quan Kong, Yuhao Shen, Yicheng Ji, Huan Li, Cong Wang
Main category: cs.CV
TL;DR: ParallelVLM is a training-free speculative decoding framework that accelerates Video-LLMs by parallelizing draft generation and verification stages, achieving 2.4-3.4× speedup on video understanding tasks.
Details
Motivation: Current Video-LLMs suffer from slow autoregressive decoding due to massive video tokens. Existing visual token pruning approaches have limited acceleration and cause information loss. There's a need for more efficient decoding methods that maintain accuracy while significantly improving speed.Method: Proposes ParallelVLM with two parallelized stages: 1) draft generation and verification run concurrently to eliminate mutual waiting, and 2) Unbiased Verifier-Guided Pruning strategy that removes positional bias in attention-guided pruning to better align draft and target models. The framework is training-free and uses speculative decoding.
Result: Achieves 1.6-1.8× expansion of draft window with high accepted lengths. Accelerates video understanding benchmarks by 3.36× on LLaVA-Onevision-72B and 2.42× on Qwen2.5-VL-32B compared to vanilla autoregressive decoding.
Conclusion: ParallelVLM effectively addresses the decoding efficiency bottleneck in Video-LLMs through parallel speculative decoding and unbiased pruning, achieving significant speedups without training while maintaining accuracy.
Abstract: Although current Video-LLMs achieve impressive performance in video understanding tasks, their autoregressive decoding efficiency remains constrained by the massive number of video tokens. Visual token pruning can partially ease this bottleneck, yet existing approaches still suffer from information loss and yield only modest acceleration in decoding. In this paper, we propose ParallelVLM, a training-free draft-then-verify speculative decoding framework that overcomes both mutual waiting and limited speedup-ratio problems between draft and target models in long-video settings. ParallelVLM features two parallelized stages that maximize hardware utilization and incorporate an Unbiased Verifier-Guided Pruning strategy to better align the draft and target models by eliminating the positional bias in attention-guided pruning. Extensive experiments demonstrate that ParallelVLM effectively expands the draft window by $1.6\sim1.8\times$ with high accepted lengths, and accelerates various video understanding benchmarks by 3.36$\times$ on LLaVA-Onevision-72B and 2.42$\times$ on Qwen2.5-VL-32B compared with vanilla autoregressive decoding.
[145] OrbitNVS: Harnessing Video Diffusion Priors for Novel View Synthesis
Jinglin Liang, Zijian Zhou, Rui Huang, Shuangping Huang, Yichen Gong
Main category: cs.CV
TL;DR: OrbitNVS reformulates novel view synthesis as orbit video generation, adapting a pre-trained video model with camera adapters and normal map guidance to improve geometric and appearance consistency, achieving state-of-the-art results on 3D object datasets.
Details
Motivation: Existing novel view synthesis methods struggle with synthesizing plausible views for unobserved regions, especially under single-view input, and face challenges in maintaining geometry- and appearance-consistency across different viewpoints.Method: Reformulates NVS as orbit video generation, adapting a pre-trained video generation model with camera adapters for accurate camera control. Incorporates a normal map generation branch and uses normal map features via attention mechanism to guide target view synthesis for geometric consistency. Applies pixel-space supervision to address blurry appearance caused by spatial compression in latent space.
Result: Significantly outperforms previous methods on GSO and OmniObject3D benchmarks, especially in challenging single-view setting (+2.9 dB and +2.4 dB PSNR improvements).
Conclusion: OrbitNVS effectively addresses key challenges in novel view synthesis by leveraging video generation priors and incorporating geometric guidance mechanisms, demonstrating strong performance particularly in data-scarce single-view scenarios.
Abstract: Novel View Synthesis (NVS) aims to generate unseen views of a 3D object given a limited number of known views. Existing methods often struggle to synthesize plausible views for unobserved regions, particularly under single-view input, and still face challenges in maintaining geometry- and appearance-consistency. To address these issues, we propose OrbitNVS, which reformulates NVS as an orbit video generation task. Through tailored model design and training strategies, we adapt a pre-trained video generation model to the NVS task, leveraging its rich visual priors to achieve high-quality view synthesis. Specifically, we incorporate camera adapters into the video model to enable accurate camera control. To enhance two key properties of 3D objects, geometry and appearance, we design a normal map generation branch and use normal map features to guide the synthesis of the target views via attention mechanism, thereby improving geometric consistency. Moreover, we apply a pixel-space supervision to alleviate blurry appearance caused by spatial compression in the latent space. Extensive experiments show that OrbitNVS significantly outperforms previous methods on the GSO and OmniObject3D benchmarks, especially in the challenging single-view setting (\eg, +2.9 dB and +2.4 dB PSNR).
[146] UniPR: Unified Object-level Real-to-Sim Perception and Reconstruction from a Single Stereo Pair
Chuanrui Zhang, Yingshuang Zou, ZhengXian Wu, Yonggen Ling, Yuxiao Yang, Ziwei Wang
Main category: cs.CV
TL;DR: UniPR is an end-to-end framework for object-level real-to-sim perception and reconstruction from stereo images, addressing scale ambiguity and eliminating per-category canonical definitions.
Details
Motivation: Existing modular pipelines for real-to-sim transfer suffer from inefficiency and cumulative errors due to operating on partial information and discarding global context. The authors aim to create an end-to-end solution that leverages geometric constraints from stereo images.Method: UniPR operates directly on single stereo image pairs, using geometric constraints to resolve scale ambiguity. It introduces Pose-Aware Shape Representation to bridge reconstruction and pose estimation without per-category canonical definitions. The framework reconstructs all objects in a scene in parallel within a single forward pass.
Result: Extensive experiments show UniPR achieves significant efficiency gains and preserves true physical proportions across diverse object types. The authors also created LVS6D, a large-vocabulary stereo dataset with over 6,300 objects to facilitate research.
Conclusion: UniPR demonstrates potential for practical robotic applications by providing an efficient, end-to-end solution for object-level real-to-sim perception and reconstruction that overcomes limitations of modular pipelines.
Abstract: Perceiving and reconstructing objects from images are critical for real-to-sim transfer tasks, which are widely used in the robotics community. Existing methods rely on multiple submodules such as detection, segmentation, shape reconstruction, and pose estimation to complete the pipeline. However, such modular pipelines suffer from inefficiency and cumulative error, as each stage operates on only partial or locally refined information while discarding global context. To address these limitations, we propose UniPR, the first end-to-end object-level real-to-sim perception and reconstruction framework. Operating directly on a single stereo image pair, UniPR leverages geometric constraints to resolve the scale ambiguity. We introduce Pose-Aware Shape Representation to eliminate the need for per-category canonical definitions and to bridge the gap between reconstruction and pose estimation tasks. Furthermore, we construct a large-vocabulary stereo dataset, LVS6D, comprising over 6,300 objects, to facilitate large-scale research in this area. Extensive experiments demonstrate that UniPR reconstructs all objects in a scene in parallel within a single forward pass, achieving significant efficiency gains and preserves true physical proportions across diverse object types, highlighting its potential for practical robotic applications.
[147] Disentangle-then-Align: Non-Iterative Hybrid Multimodal Image Registration via Cross-Scale Feature Disentanglement
Chunlei Zhang, Jiahao Xia, Yun Xiao, Bo Jiang, Jian Zhang
Main category: cs.CV
TL;DR: HRNet: Hybrid Registration Network for multimodal image registration that learns stable shared features and unified hybrid transformations through disentanglement and multi-scale architecture.
Details
Motivation: Address two key limitations in multimodal image registration: 1) modality-private cues leaking into shared feature space despite disentanglement methods, and 2) most multi-scale frameworks supporting only single transformation types, limiting applicability when both global misalignment and local deformation coexist.Method: Proposes HRNet with Modality-Specific Batch Normalization (MSBN) for multi-scale feature extraction, Cross-scale Disentanglement and Adaptive Projection (CDAP) module to suppress modality-private cues and project shared features into stable subspace, and Hybrid Parameter Prediction Module (HPPM) for non-iterative coarse-to-fine estimation of global rigid parameters and deformation fields.
Result: Extensive experiments on four multimodal datasets demonstrate state-of-the-art performance on both rigid and non-rigid registration tasks.
Conclusion: HRNet effectively addresses hybrid multimodal registration by jointly learning stable shared feature space and unified hybrid transformation, outperforming existing methods across diverse registration scenarios.
Abstract: Multimodal image registration is a fundamental task and a prerequisite for downstream cross-modal analysis. Despite recent progress in shared feature extraction and multi-scale architectures, two key limitations remain. First, some methods use disentanglement to learn shared features but mainly regularize the shared part, allowing modality-private cues to leak into the shared space. Second, most multi-scale frameworks support only a single transformation type, limiting their applicability when global misalignment and local deformation coexist. To address these issues, we formulate hybrid multimodal registration as jointly learning a stable shared feature space and a unified hybrid transformation. Based on this view, we propose HRNet, a Hybrid Registration Network that couples representation disentanglement with hybrid parameter prediction. A shared backbone with Modality-Specific Batch Normalization (MSBN) extracts multi-scale features, while a Cross-scale Disentanglement and Adaptive Projection (CDAP) module suppresses modality-private cues and projects shared features into a stable subspace for matching. Built on this shared space, a Hybrid Parameter Prediction Module (HPPM) performs non-iterative coarse-to-fine estimation of global rigid parameters and deformation fields, which are fused into a coherent deformation field. Extensive experiments on four multimodal datasets demonstrate state-of-the-art performance on rigid and non-rigid registration tasks. The code is available at the project website.
[148] IUP-Pose: Decoupled Iterative Uncertainty Propagation for Real-time Relative Pose Regression via Implicit Dense Alignment v1
Jun Wang, Xiaoyan Huang
Main category: cs.CV
TL;DR: IUP-Pose: A geometry-driven decoupled iterative framework for relative pose estimation with implicit dense alignment, achieving high accuracy with end-to-end differentiability and real-time performance.
Details
Motivation: Existing relative pose estimation methods face trade-offs: feature-matching pipelines with RANSAC are accurate but not differentiable, while ViT-based regressors are differentiable but too slow for real-time deployment. The core bottlenecks are coupling between rotation/translation estimation and insufficient cross-view feature alignment.Method: Proposes IUP-Pose with a lightweight Multi-Head Bi-Cross Attention (MHBC) module for implicit dense feature alignment without explicit matching supervision. Uses a decoupled pipeline: two shared-parameter rotation stages iteratively refine rotation with uncertainty, then realigns feature maps via rotational homography H_inf before translation prediction.
Result: Achieves 73.3% AUC@20deg on MegaDepth1500 benchmark with full end-to-end differentiability, 70 FPS throughput, and only 37M parameters, demonstrating favorable accuracy-efficiency trade-off for real-time edge deployment.
Conclusion: IUP-Pose provides an effective solution to the accuracy-efficiency trade-off in relative pose estimation, offering differentiable, real-time performance suitable for edge deployment while maintaining high accuracy.
Abstract: Relative pose estimation is fundamental for SLAM, visual localization, and 3D reconstruction. Existing Relative Pose Regression (RPR) methods face a key trade-off: feature-matching pipelines achieve high accuracy but block gradient flow via non-differentiable RANSAC, while ViT-based regressors are end-to-end trainable but prohibitively expensive for real-time deployment. We identify the core bottlenecks as the coupling between rotation and translation estimation and insufficient cross-view feature alignment. We propose IUP-Pose, a geometry-driven decoupled iterative framework with implicit dense alignment. A lightweight Multi-Head Bi-Cross Attention (MHBC) module aligns cross-view features without explicit matching supervision. The aligned features are processed by a decoupled rotation-translation pipeline: two shared-parameter rotation stages iteratively refine rotation with uncertainty, and feature maps are realigned via rotational homography H_inf before translation prediction. IUP-Pose achieves 73.3% AUC@20deg on MegaDepth1500 with full end-to-end differentiability, 70 FPS throughput, and only 37M parameters, demonstrating a favorable accuracy-efficiency trade-off for real-time edge deployment.
[149] Dual Prompt-Driven Feature Encoding for Nighttime UAV Tracking
Yiheng Wang, Changhong Fu, Liangliang Yao, Haobo Zuo, Zijie Zhang
Main category: cs.CV
TL;DR: A dual prompt-driven feature encoding method for nighttime UAV tracking that uses illumination and viewpoint prompts to improve robustness under challenging conditions.
Details
Motivation: Existing UAV tracking methods often overlook critical illumination and viewpoint cues, especially under challenging nighttime conditions, leading to degraded tracking performance.Method: Proposes DPTracker with pyramid illumination prompter for multi-scale frequency-aware illumination prompts and dynamic viewpoint prompter that modulates deformable convolution offsets to accommodate viewpoint variations.
Result: Extensive experiments validate effectiveness in nighttime UAV tracking, with ablation studies showing contributions of each component and real-world tests demonstrating robustness.
Conclusion: The dual prompt-driven feature encoding method enables domain-invariant feature encoding for robust nighttime UAV tracking.
Abstract: Robust feature encoding constitutes the foundation of UAV tracking by enabling the nuanced perception of target appearance and motion, thereby playing a pivotal role in ensuring reliable tracking. However, existing feature encoding methods often overlook critical illumination and viewpoint cues, which are essential for robust perception under challenging nighttime conditions, leading to degraded tracking performance. To overcome the above limitation, this work proposes a dual prompt-driven feature encoding method that integrates prompt-conditioned feature adaptation and context-aware prompt evolution to promote domain-invariant feature encoding. Specifically, the pyramid illumination prompter is proposed to extract multi-scale frequency-aware illumination prompts. %The dynamic viewpoint prompter adapts the sampling to different viewpoints, enabling the tracker to learn view-invariant features. The dynamic viewpoint prompter modulates deformable convolution offsets to accommodate viewpoint variations, enabling the tracker to learn view-invariant features. Extensive experiments validate the effectiveness of the proposed dual prompt-driven tracker (DPTracker) in tackling nighttime UAV tracking. Ablation studies highlight the contribution of each component in DPTracker. Real-world tests under diverse nighttime UAV tracking scenarios further demonstrate the robustness and practical utility. The code and demo videos are available at https://github.com/yiheng-wang-duke/DPTracker.
[150] UniBioTransfer: A Unified Framework for Multiple Biometrics Transfer
Caiyi Sun, Yujing Sun, Xiangyu Li, Yuhang Zheng, Yiming Ren, Jiamin Wang, Yuexin Ma, Siu-Ming Yiu
Main category: cs.CV
TL;DR: UniBioTransfer is a unified framework for multiple deepface generation tasks including face/hair transfer and reenactment, addressing data scarcity and cross-task conflicts through a unified data construction strategy and BioMoE architecture.
Details
Motivation: Traditional deepface generation uses task-specific models which limit generalization and scalability. A unified model for multiple tasks is promising but challenging due to data scarcity and cross-task conflicts from heterogeneous attribute transformations.Method: Proposes UniBioTransfer with: 1) unified data construction strategy including swapping-based corruption for spatially dynamic attributes like hair, 2) BioMoE (mixture-of-experts) model with two-stage training to disentangle task-specific knowledge and mitigate cross-task interference.
Result: Extensive experiments show UniBioTransfer outperforms both existing unified models and task-specific methods across a wide range of deepface generation tasks, demonstrating effectiveness, generalization, and scalability.
Conclusion: UniBioTransfer successfully addresses the limitations of task-specific deepface generation models by providing a unified framework that handles multiple tasks, generalizes to unseen tasks with minimal fine-tuning, and overcomes data scarcity and cross-task conflicts.
Abstract: Deepface generation has traditionally followed a task-driven paradigm, where distinct tasks (e.g., face transfer and hair transfer) are addressed by task-specific models. Nevertheless, this single-task setting severely limits model generalization and scalability. A unified model capable of solving multiple deepface generation tasks in a single pass represents a promising and practical direction, yet remains challenging due to data scarcity and cross-task conflicts arising from heterogeneous attribute transformations. To this end, we propose UniBioTransfer, the first unified framework capable of handling both conventional deepface tasks (e.g., face transfer and face reenactment) and shape-varying transformations (e.g., hair transfer and head transfer). Besides, UniBioTransfer naturally generalizes to unseen tasks, like lip, eye, and glasses transfer, with minimal fine-tuning. Generally, UniBioTransfer addresses data insufficiency in multi-task generation through a unified data construction strategy, including a swapping-based corruption mechanism designed for spatially dynamic attributes like hair. It further mitigates cross-task interference via an innovative BioMoE, a mixture-of-experts based model coupled with a novel two-stage training strategy that effectively disentangles task-specific knowledge. Extensive experiments demonstrate the effectiveness, generalization, and scalability of UniBioTransfer, outperforming both existing unified models and task-specific methods across a wide range of deepface generation tasks. Project page is at https://scy639.github.io/UniBioTransfer.github.io/
[151] OmniDiT: Extending Diffusion Transformer to Omni-VTON Framework
Weixuan Zeng, Pengcheng Wei, Huaiqing Wang, Boheng Zhang, Jia Sun, Dewen Fan, Lin HE, Long Chen, Qianqian Gan, Fan Yang, Tingting Gao
Main category: cs.CV
TL;DR: OmniDiT is a unified diffusion transformer framework for virtual try-on and try-off tasks that addresses detail preservation, generalization, and efficiency through a self-evolving data pipeline, adaptive position encoding, and shifted window attention.
Details
Motivation: Existing virtual try-on methods face challenges with fine-grained detail preservation, generalization to complex scenes, complicated pipelines, and efficient inference. There's a need for a unified approach that handles both try-on and try-off tasks effectively.Method: Proposes OmniDiT framework with: 1) Self-evolving data curation pipeline creating Omni-TryOn dataset (380k+ garment-model-tryon pairs), 2) Token concatenation with adaptive position encoding for multiple reference conditions, 3) Shifted Window Attention for linear complexity in diffusion models, 4) Multiple timestep prediction and alignment loss to improve generation fidelity.
Result: Achieves best performance in model-free VTON and VTOFF tasks under various complex scenes, and comparable performance to SOTA methods in model-based VTON task.
Conclusion: OmniDiT successfully addresses key challenges in virtual try-on through a unified diffusion transformer framework with efficient attention mechanisms and comprehensive data curation.
Abstract: Despite the rapid advancement of Virtual Try-On (VTON) and Try-Off (VTOFF) technologies, existing VTON methods face challenges with fine-grained detail preservation, generalization to complex scenes, complicated pipeline, and efficient inference. To tackle these problems, we propose OmniDiT, an omni Virtual Try-On framework based on the Diffusion Transformer, which combines try-on and try-off tasks into one unified model. Specifically, we first establish a self-evolving data curation pipeline to continuously produce data, and construct a large VTON dataset Omni-TryOn, which contains over 380k diverse and high-quality garment-model-tryon image pairs and detailed text prompts. Then, we employ the token concatenation and design an adaptive position encoding to effectively incorporate multiple reference conditions. To relieve the bottleneck of long sequence computation, we are the first to introduce Shifted Window Attention into the diffusion model, thus achieving a linear complexity. To remedy the performance degradation caused by local window attention, we utilize multiple timestep prediction and an alignment loss to improve generation fidelity. Experiments reveal that, under various complex scenes, our method achieves the best performance in both the model-free VTON and VTOFF tasks and a performance comparable to current SOTA methods in the model-based VTON task.
[152] GravCal: Single-Image Calibration of IMU Gravity Priors with Per-Sample Confidence
Haichao Zhu, Qian Zhang
Main category: cs.CV
TL;DR: GravCal: A feedforward model that calibrates noisy gravity priors from IMUs using single RGB images, combining residual correction and image-only estimation with learned adaptive fusion.
Details
Motivation: Gravity estimation is crucial for visual-inertial perception systems, but IMU-based gravity priors are often unreliable due to linear acceleration, vibration, and transient motion. Existing methods either estimate gravity directly from images or assume accurate inertial input, leaving the practical problem of correcting noisy gravity priors from single images largely unaddressed.Method: GravCal is a feedforward model that takes one RGB image and a noisy gravity prior as input, and predicts a corrected gravity direction with per-sample confidence. The model combines two complementary predictions: (1) residual correction of the input prior, and (2) prior-independent image estimate. A learned gate adaptively fuses these predictions based on prior quality.
Result: Extensive experiments show GravCal reduces mean angular error from 22.02° (raw IMU prior) to 14.24°, with larger improvements when the prior is severely corrupted. The model also introduces a novel dataset of over 148K frames with paired VIO-derived ground-truth gravity and Mahony-filter IMU priors across diverse scenes and camera orientations.
Conclusion: GravCal effectively calibrates noisy gravity priors from single images, with the learned gate providing useful confidence signals for downstream systems. The approach addresses a practical gap in visual-inertial perception by combining image-based and inertial information adaptively.
Abstract: Gravity estimation is fundamental to visual-inertial perception, augmented reality, and robotics, yet gravity priors from IMUs are often unreliable under linear acceleration, vibration, and transient motion. Existing methods often estimate gravity directly from images or assume reasonably accurate inertial input, leaving the practical problem of correcting a noisy gravity prior from a single image largely unaddressed. We present GravCal, a feedforward model for single-image gravity prior calibration. Given one RGB image and a noisy gravity prior, GravCal predicts a corrected gravity direction and a per-sample confidence score. The model combines two complementary predictions, including a residual correction of the input prior and a prior-independent image estimate, and uses a learned gate to fuse them adaptively. Extensive experiments show strong gains over raw inertial priors: GravCal reduces mean angular error from 22.02° (IMU prior) to 14.24°, with larger improvements when the prior is severely corrupted. We also introduce a novel dataset of over 148K frames with paired VIO-derived ground-truth gravity and Mahony-filter IMU priors across diverse scenes and arbitrary camera orientations. The learned gate also correlates with prior quality, making it a useful confidence signal for downstream systems.
[153] CS-MUNet: A Channel-Spatial Dual-Stream Mamba Network for Multi-Organ Segmentation
Yuyang Zheng, Mingda Zhang, Jianglong Qin, Qi Mo, Jingdan Pan, Haozhe Hu, Hongyi Huang
Main category: cs.CV
TL;DR: CS-MUNet introduces a Mamba-based architecture for abdominal organ segmentation with boundary-aware state modeling and cross-channel semantic collaboration modules.
Details
Motivation: Existing Mamba-based methods for abdominal organ segmentation neglect cross-channel anatomical semantic collaboration and lack explicit boundary-aware feature fusion mechanisms, limiting segmentation accuracy.Method: Proposes CS-MUNet with two novel modules: 1) Boundary-Aware State Mamba module using Bayesian-attention for pixel-level boundary posterior maps injected into Mamba’s scan parameters, and 2) Channel Mamba State Aggregation module redefining channel dimension as SSM sequence dimension to model cross-channel anatomical semantic collaboration.
Result: Experiments on two public benchmarks show CS-MUNet consistently outperforms state-of-the-art methods across multiple metrics, establishing a new SSM modeling paradigm.
Conclusion: CS-MUNet successfully addresses channel semantic collaboration and boundary-aware feature fusion for abdominal multi-organ segmentation, offering a new SSM modeling approach.
Abstract: Recently Mamba-based methods have shown promise in abdominal organ segmentation. However, existing approaches neglect cross-channel anatomical semantic collaboration and lack explicit boundary-aware feature fusion mechanisms. To address these limitations, we propose CS-MUNet with two purpose-built modules. The Boundary-Aware State Mamba module employs a Bayesian-attention framework to generate pixel-level boundary posterior maps, injected directly into Mamba’s core scan parameters to embed boundary awareness into the SSM state transition mechanism, while dual-branch weight allocation enables complementary modulation between global and local structural representations. The Channel Mamba State Aggregation module redefines the channel dimension as the SSM sequence dimension to explicitly model cross-channel anatomical semantic collaboration in a data-driven manner. Experiments on two public benchmarks demonstrate that CS-MUNet consistently outperforms state-of-the-art methods across multiple metrics, establishing a new SSM modeling paradigm that jointly addresses channel semantic collaboration and boundary-aware feature fusion for abdominal multi-organ segmentation.
[154] Semantic Audio-Visual Navigation in Continuous Environments
Yichen Zeng, Hebaixu Wang, Meng Liu, Yu Zhou, Chen Gao, Kehan Chen, Gongping Huang
Main category: cs.CV
TL;DR: SAVN-CE introduces continuous audio-visual navigation where agents move freely in 3D spaces, addressing intermittent sound targets with MAGNet, a multimodal transformer model that integrates spatial-semantic goal representations with historical context for memory-augmented reasoning.
Details
Motivation: Existing audio-visual navigation approaches rely on precomputed room impulse responses and discrete grid positions, creating unrealistic settings with spatially discontinuous observations. Real-world scenarios involve continuous movement and intermittent sound targets that can become silent, causing agents to lose goal information.Method: Proposes MAGNet, a multimodal transformer-based model that jointly encodes spatial and semantic goal representations. It integrates historical context with self-motion cues to enable memory-augmented goal reasoning in continuous 3D environments where targets may intermittently stop emitting sound.
Result: MAGNet significantly outperforms state-of-the-art methods, achieving up to 12.1% absolute improvement in success rate. Demonstrates robustness to short-duration sounds and long-distance navigation scenarios in continuous environments.
Conclusion: The SAVN-CE setting with MAGNet establishes a more realistic audio-visual navigation paradigm that handles intermittent sound targets through memory-augmented multimodal reasoning, advancing embodied AI capabilities in continuous 3D environments.
Abstract: Audio-visual navigation enables embodied agents to navigate toward sound-emitting targets by leveraging both auditory and visual cues. However, most existing approaches rely on precomputed room impulse responses (RIRs) for binaural audio rendering, restricting agents to discrete grid positions and leading to spatially discontinuous observations. To establish a more realistic setting, we introduce Semantic Audio-Visual Navigation in Continuous Environments (SAVN-CE), where agents can move freely in 3D spaces and perceive temporally and spatially coherent audio-visual streams. In this setting, targets may intermittently become silent or stop emitting sound entirely, causing agents to lose goal information. To tackle this challenge, we propose MAGNet, a multimodal transformer-based model that jointly encodes spatial and semantic goal representations and integrates historical context with self-motion cues to enable memory-augmented goal reasoning. Comprehensive experiments demonstrate that MAGNet significantly outperforms state-of-the-art methods, achieving up to a 12.1% absolute improvement in success rate. These results also highlight its robustness to short-duration sounds and long-distance navigation scenarios. The code is available at https://github.com/yichenzeng24/SAVN-CE.
[155] Toward High-Fidelity Visual Reconstruction: From EEG-Based Conditioned Generation to Joint-Modal Guided Rebuilding
Zhijian Gong, Tianren Yao, Wenjia Dong, Xueyuan Xu
Main category: cs.CV
TL;DR: JMVR is a novel framework for visual reconstruction from EEG signals that treats EEG and text as independent modalities for joint learning, preserving EEG-specific spatial and chromatic details through multi-scale encoding and image augmentation.
Details
Motivation: Current EEG-based visual reconstruction approaches are deeply coupled with alignment frameworks that force EEG features to align with text or image semantics, which may condense the rich spatial and chromatic details in EEG signals, resulting in mere conditioned image generation rather than high-fidelity visual reconstruction.Method: Proposes Joint-Modal Visual Reconstruction (JMVR) framework that treats EEG and text as independent modalities for joint learning to preserve EEG-specific information. Uses multi-scale EEG encoding to capture both fine- and coarse-grained features, alongside image augmentation to enhance recovery of perceptual details.
Result: Extensive experiments on THINGS-EEG dataset demonstrate JMVR achieves state-of-the-art performance against six baseline methods, specifically exhibiting superior capabilities in modeling spatial structure and chromatic fidelity.
Conclusion: JMVR effectively addresses limitations of alignment-based approaches by preserving EEG-specific information through joint-modal learning, enabling higher-fidelity visual reconstruction from neural signals.
Abstract: Human visual reconstruction aims to reconstruct fine-grained visual stimuli based on subject-provided descriptions and corresponding neural signals. As a widely adopted modality, Electroencephalography (EEG) captures rich visual cognition information, encompassing complex spatial relationships and chromatic details within scenes. However, current approaches are deeply coupled with an alignment framework that forces EEG features to align with text or image semantic representation. The dependency may condense the rich spatial and chromatic details in EEG that achieved mere conditioned image generation rather than high-fidelity visual reconstruction. To address this limitation, we propose a novel Joint-Modal Visual Reconstruction (JMVR) framework. It treats EEG and text as independent modalities for joint learning to preserve EEG-specific information for reconstruction. It further employs a multi-scale EEG encoding strategy to capture both fine- and coarse-grained features, alongside image augmentation to enhance the recovery of perceptual details. Extensive experiments on the THINGS-EEG dataset demonstrate that JMVR achieves SOTA performance against six baseline methods, specifically exhibiting superior capabilities in modeling spatial structure and chromatic fidelity.
[156] Making Video Models Adhere to User Intent with Minor Adjustments
Daniel Ajisafe, Eric Hedlin, Helge Rhodin, Kwang Moo Yi
Main category: cs.CV
TL;DR: Optimizing bounding box positions in text-to-video diffusion models by aligning them with internal attention maps improves generation quality and control adherence.
Details
Motivation: While text-to-video diffusion models have advanced, controlling their generations through bounding boxes/layouts remains challenging, as models often don't fully adhere to these control inputs.Method: Proposes optimizing user-provided bounding boxes by making them differentiable via smooth masks and using attention-maximization objectives to align boxes with the model’s internal attention maps, balancing foreground/background focus.
Result: Small modifications to bounding box positions significantly improve generation quality and control adherence, validated through thorough experiments and user studies.
Conclusion: Optimizing bounding boxes to align with diffusion model’s internal attention maps is an effective approach for improving control in text-to-video generation.
Abstract: With the recent drastic advancements in text-to-video diffusion models, controlling their generations has drawn interest. A popular way for control is through bounding boxes or layouts. However, enforcing adherence to these control inputs is still an open problem. In this work, we show that by slightly adjusting user-provided bounding boxes we can improve both the quality of generations and the adherence to the control inputs. This is achieved by simply optimizing the bounding boxes to better align with the internal attention maps of the video diffusion model while carefully balancing the focus on foreground and background. In a sense, we are modifying the bounding boxes to be at places where the model is familiar with. Surprisingly, we find that even with small modifications, the quality of generations can vary significantly. To do so, we propose a smooth mask to make the bounding box position differentiable and an attention-maximization objective that we use to alter the bounding boxes. We conduct thorough experiments, including a user study to validate the effectiveness of our method. Our code is made available on the project webpage to foster future research from the community.
[157] DynFlowDrive: Flow-Based Dynamic World Modeling for Autonomous Driving
Xiaolu Liu, Yicong Li, Song Wang, Junbo Chen, Angela Yao, Jianke Zhu
Main category: cs.CV
TL;DR: DynFlowDrive: A latent world model using flow-based dynamics to predict future driving scene states under different actions, with stability-aware trajectory selection for autonomous driving planning.
Details
Motivation: Existing autonomous driving world models use appearance generation or deterministic regression, which struggle to capture trajectory-conditioned scene evolution and lead to unreliable action planning. There's a need for better modeling of how scenes evolve under different driving actions.Method: Proposes DynFlowDrive, a latent world model using rectified flow formulation to learn a velocity field describing scene state changes under different driving actions. This enables progressive prediction of future latent states. Also introduces stability-aware multi-mode trajectory selection that evaluates candidate trajectories based on stability of induced scene transitions.
Result: Extensive experiments on nuScenes and NavSim benchmarks show consistent improvements across diverse driving frameworks without adding inference overhead.
Conclusion: DynFlowDrive effectively models scene evolution under driving actions using flow-based dynamics, improving planning reliability in autonomous driving systems through stability-aware trajectory selection.
Abstract: Recently, world models have been incorporated into the autonomous driving systems to improve the planning reliability. Existing approaches typically predict future states through appearance generation or deterministic regression, which limits their ability to capture trajectory-conditioned scene evolution and leads to unreliable action planning. To address this, we propose DynFlowDrive, a latent world model that leverages flow-based dynamics to model the transition of world states under different driving actions. By adopting the rectifiedflow formulation, the model learns a velocity field that describes how the scene state changes under different driving actions, enabling progressive prediction of future latent states. Building upon this, we further introduce a stability-aware multi-mode trajectory selection strategy that evaluates candidate trajectories according to the stability of the induced scene transitions. Extensive experiments on the nuScenes and NavSim benchmarks demonstrate consistent improvements across diverse driving frameworks without introducing additional inference overhead. Source code will be abaliable at https://github.com/xiaolul2/DynFlowDrive.
[158] ATHENA: Adaptive Test-Time Steering for Improving Count Fidelity in Diffusion Models
Mohammad Shahab Sepehri, Asal Mehradfar, Berk Tinaz, Salman Avestimehr, Mahdi Soltanolkotabi
Main category: cs.CV
TL;DR: ATHENA is a test-time adaptive steering framework that improves object count fidelity in text-to-image diffusion models without architectural changes or retraining.
Details
Motivation: Text-to-image diffusion models achieve high visual quality but systematically fail at numerical control when prompts specify explicit object counts, limiting their practical utility for precise generation tasks.Method: ATHENA uses intermediate representations during sampling to estimate object counts and applies count-aware noise corrections early in denoising. It has three variants with increasing complexity: static prompt-based steering, dynamic count-aware control, and more advanced adaptive methods.
Result: Experiments show ATHENA consistently improves count fidelity, especially at higher target counts, while maintaining favorable accuracy-runtime trade-offs across multiple diffusion backbones.
Conclusion: ATHENA provides a model-agnostic solution to improve numerical control in text-to-image generation without requiring model modifications or retraining.
Abstract: Text-to-image diffusion models achieve high visual fidelity but surprisingly exhibit systematic failures in numerical control when prompts specify explicit object counts. To address this limitation, we introduce ATHENA, a model-agnostic, test-time adaptive steering framework that improves object count fidelity without modifying model architectures or requiring retraining. ATHENA leverages intermediate representations during sampling to estimate object counts and applies count-aware noise corrections early in the denoising process, steering the generation trajectory before structural errors become difficult to revise. We present three progressively more advanced variants of ATHENA that trade additional computation for improved numerical accuracy, ranging from static prompt-based steering to dynamically adjusted count-aware control. Experiments on established benchmarks and a new visually and semantically complex dataset show that ATHENA consistently improves count fidelity, particularly at higher target counts, while maintaining favorable accuracy-runtime trade-offs across multiple diffusion backbones.
[159] Vision-Language Attribute Disentanglement and Reinforcement for Lifelong Person Re-Identification
Kunlun Xu, Haotong Cheng, Jiangmeng Li, Xu Zou, Jiahuan Zhou
Main category: cs.CV
TL;DR: VLADR introduces a vision-language attribute disentanglement and reinforcement approach for lifelong person re-identification, leveraging VLM’s generalizable knowledge through fine-grained attribute modeling to improve knowledge transfer and reduce forgetting.
Details
Motivation: Existing LReID methods either learn from scratch or use visual classification-pretrained models, but Vision-Language Models (VLMs) offer generalizable knowledge that's underutilized. Current approaches only consider global-aware learning, missing fine-grained attribute knowledge, leading to limited acquisition and anti-forgetting capacity.Method: VLADR uses Multi-grain Text Attribute Disentanglement to mine global and diverse local text attributes from images, then employs Inter-domain Cross-modal Attribute Reinforcement with cross-modal attribute alignment to guide visual attribute extraction and inter-domain attribute alignment for fine-grained knowledge transfer.
Result: VLADR outperforms state-of-the-art methods by 1.9%-2.2% on anti-forgetting capacity and 2.1%-2.5% on generalization capacity.
Conclusion: The approach effectively leverages VLM’s generalizable knowledge through explicit modeling of universally shared human attributes, improving inter-domain knowledge transfer and reducing forgetting in lifelong person re-identification.
Abstract: Lifelong person re-identification (LReID) aims to learn from varying domains to obtain a unified person retrieval model. Existing LReID approaches typically focus on learning from scratch or a visual classification-pretrained model, while the Vision-Language Model (VLM) has shown generalizable knowledge in a variety of tasks. Although existing methods can be directly adapted to the VLM, since they only consider global-aware learning, the fine-grained attribute knowledge is underleveraged, leading to limited acquisition and anti-forgetting capacity. To address this problem, we introduce a novel VLM-driven LReID approach named Vision-Language Attribute Disentanglement and Reinforcement (VLADR). Our key idea is to explicitly model the universally shared human attributes to improve inter-domain knowledge transfer, thereby effectively utilizing historical knowledge to reinforce new knowledge learning and alleviate forgetting. Specifically, VLADR includes a Multi-grain Text Attribute Disentanglement mechanism that mines the global and diverse local text attributes of an image. Then, an Inter-domain Cross-modal Attribute Reinforcement scheme is developed, which introduces cross-modal attribute alignment to guide visual attribute extraction and adopts inter-domain attribute alignment to achieve fine-grained knowledge transfer. Experimental results demonstrate that our VLADR outperforms the state-of-the-art methods by 1.9%-2.2% and 2.1%-2.5% on anti-forgetting and generalization capacity. Our source code is available at https://github.com/zhoujiahuan1991/CVPR2026-VLADR
[160] Unbiased Dynamic Multimodal Fusion
Shicai Wei, Kaijie Zhang, Luyi Chen, Tao He, Guiduo Duan
Main category: cs.CV
TL;DR: UDML is a dynamic multimodal learning framework that addresses limitations in existing methods by introducing noise-aware uncertainty estimation and modality bias correction to improve fusion performance across varying noise conditions.
Details
Motivation: Existing dynamic multimodal methods have two key limitations: 1) they rely on empirical metrics that fail to accurately measure modality quality under extreme noise conditions (very low or high noise), and 2) they assume equal initial contribution from each modality, ignoring inherent modality dependency bias that doubly penalizes hard-to-learn modalities.Method: Proposes UDML with two key components: 1) A noise-aware uncertainty estimator that adds controlled noise to modality data and predicts noise intensity from features, establishing clear correspondence between feature corruption and noise levels; 2) Quantification of inherent modality reliance bias via modality dropout, which is incorporated into the weighting mechanism to eliminate dual suppression of hard-to-learn modalities.
Result: Extensive experiments across diverse multimodal benchmark tasks demonstrate the effectiveness, versatility, and generalizability of UDML. The framework outperforms existing methods and addresses the limitations of both static and dynamic fusion approaches.
Conclusion: UDML provides a robust solution for dynamic multimodal learning that can accurately assess modality quality across varying noise conditions while correcting for inherent modality bias, leading to superior performance compared to existing fusion methods.
Abstract: Traditional multimodal methods often assume static modality quality, which limits their adaptability in dynamic real-world scenarios. Thus, dynamical multimodal methods are proposed to assess modality quality and adjust their contribution accordingly. However, they typically rely on empirical metrics, failing to measure the modality quality when noise levels are extremely low or high. Moreover, existing methods usually assume that the initial contribution of each modality is the same, neglecting the intrinsic modality dependency bias. As a result, the modality hard to learn would be doubly penalized, and the performance of dynamical fusion could be inferior to that of static fusion. To address these challenges, we propose the Unbiased Dynamic Multimodal Learning (UDML) framework. Specifically, we introduce a noise-aware uncertainty estimator that adds controlled noise to the modality data and predicts its intensity from the modality feature. This forces the model to learn a clear correspondence between feature corruption and noise level, allowing accurate uncertainty measure across both low- and high-noise conditions. Furthermore, we quantify the inherent modality reliance bias within multimodal networks via modality dropout and incorporate it into the weighting mechanism. This eliminates the dual suppression effect on the hard-to-learn modality. Extensive experiments across diverse multimodal benchmark tasks validate the effectiveness, versatility, and generalizability of the proposed UDML. The code is available at https://github.com/shicaiwei123/UDML.
[161] 3D Gaussian Splatting with Self-Constrained Priors for High Fidelity Surface Reconstruction
Takeshi Noda, Yu-Shen Liu, Zhizhong Han
Main category: cs.CV
TL;DR: A method that improves 3D Gaussian Splatting (3DGS) surface reconstruction by introducing a self-constrained prior derived from fused depth maps to better constrain Gaussian distributions for more accurate depth rendering.
Details
Motivation: While 3DGS shows advantages over NeRF in rendering quality and speed, there's still room for improvement in recovering high-fidelity surfaces. Current 3DGS methods don't sufficiently constrain Gaussian learning for accurate depth rendering.Method: Proposes a self-constrained prior derived from a TSDF grid obtained by fusing depth maps rendered with current 3D Gaussians. The prior creates a distance field band around estimated surfaces to impose specific constraints: removing Gaussians outside the band, moving Gaussians closer to surfaces, and adjusting opacity in a geometry-aware manner. The prior is regularly updated with more accurate depth images and progressively narrows the band to tighten constraints.
Result: Superior performance over state-of-the-art methods on widely used benchmarks, demonstrating improved surface reconstruction quality through better constrained 3D Gaussian learning.
Conclusion: The self-constrained prior effectively improves 3DGS surface reconstruction by providing geometry-aware constraints that can be progressively refined, leading to more accurate depth rendering and higher fidelity surfaces.
Abstract: Rendering 3D surfaces has been revolutionized within the modeling of radiance fields through either 3DGS or NeRF. Although 3DGS has shown advantages over NeRF in terms of rendering quality or speed, there is still room for improvement in recovering high fidelity surfaces through 3DGS. To resolve this issue, we propose a self-constrained prior to constrain the learning of 3D Gaussians, aiming for more accurate depth rendering. Our self-constrained prior is derived from a TSDF grid that is obtained by fusing the depth maps rendered with current 3D Gaussians. The prior measures a distance field around the estimated surface, offering a band centered at the surface for imposing more specific constraints on 3D Gaussians, such as removing Gaussians outside the band, moving Gaussians closer to the surface, and encouraging larger or smaller opacity in a geometry-aware manner. More importantly, our prior can be regularly updated by the most recent depth images which are usually more accurate and complete. In addition, the prior can also progressively narrow the band to tighten the imposed constraints. We justify our idea and report our superiority over the state-of-the-art methods in evaluations on widely used benchmarks.
[162] TSegAgent: Zero-Shot Tooth Segmentation via Geometry-Aware Vision-Language Agents
Shaojie Zhuang, Lu Yin, Guangshun Wei, Yunpeng Li, Xilu Wang, Yuanfeng Zhou
Main category: cs.CV
TL;DR: TSegAgent: A zero-shot geometric reasoning approach for tooth segmentation and identification from 3D dental scans, using foundation models and dental anatomy constraints instead of task-specific training.
Details
Motivation: Existing dental segmentation methods rely on task-specific 3D neural networks requiring expensive dense annotations and have limited generalization to unseen scan sources. There's a need for more generalizable, annotation-efficient approaches.Method: Reformulates dental analysis as zero-shot geometric reasoning by combining general-purpose foundation models with explicit geometric inductive biases from dental anatomy. Uses multi-view visual abstraction and geometry-grounded reasoning with structural constraints like dental arch organization and volumetric relationships.
Result: Experimental results show accurate and reliable tooth segmentation/identification with low computational and annotation cost, and strong generalization across diverse, previously unseen dental scans.
Conclusion: The reasoning-oriented formulation enables effective dental analysis without task-specific training, addressing annotation cost and generalization limitations of purely data-driven approaches.
Abstract: Automatic tooth segmentation and identification from intra-oral scanned 3D models are fundamental problems in digital dentistry, yet most existing approaches rely on task-specific 3D neural networks trained with densely annotated datasets, resulting in high annotation cost and limited generalization to scans from unseen sources. Thus, we propose TSegAgent, which addresses these challenges by reformulating dental analysis as a zero-shot geometric reasoning problem rather than a purely data-driven recognition task. The key idea is to combine the representational capacity of general-purpose foundation models with explicit geometric inductive biases derived from dental anatomy. Instead of learning dental-specific features, the proposed framework leverages multi-view visual abstraction and geometry-grounded reasoning to infer tooth instances and identities without task-specific training. By explicitly encoding structural constraints such as dental arch organization and volumetric relationships, the method reduces uncertainty in ambiguous cases and mitigates overfitting to particular shape distributions. Experimental results demonstrate that this reasoning-oriented formulation enables accurate and reliable tooth segmentation and identification with low computational and annotation cost, while exhibiting strong generalization across diverse and previously unseen dental scans.
[163] Demographic-Aware Self-Supervised Anomaly Detection Pretraining for Equitable Rare Cardiac Diagnosis
Chaoqin Huang, Zi Zeng, Aofan Jiang, Yuchen Xu, Qing Cao, Kang Chen, Chenfei Chi, Yanfeng Wang, Ya Zhang
Main category: cs.CV
TL;DR: AI framework for detecting rare cardiac anomalies from ECGs using self-supervised anomaly detection and demographic-aware representation learning to address long-tailed distribution and diagnostic disparities.
Details
Motivation: Rare cardiac anomalies are difficult to detect from ECGs due to long-tailed distribution with limited case counts and demographic disparities in diagnostic performance, leading to delayed recognition and uneven quality of care.Method: Two-stage framework: 1) Self-supervised anomaly detection pretraining by reconstructing masked global/local ECG signals, modeling trends, and predicting patient attributes; 2) Fine-tuning for multi-label ECG classification using asymmetric loss for long-tail abnormalities, with anomaly score maps for localization and CPU-based optimization.
Result: Achieves 94.7% AUROC for rare anomalies, reduces common-rare performance gap by 73%, and maintains consistent diagnostic accuracy across age and sex groups on a cohort of over one million clinical ECGs.
Conclusion: The equity-aware AI framework demonstrates strong clinical utility, interpretable anomaly localization, and scalable performance, highlighting potential to mitigate diagnostic disparities in biomedical signals and digital health.
Abstract: Rare cardiac anomalies are difficult to detect from electrocardiograms (ECGs) due to their long-tailed distribution with extremely limited case counts and demographic disparities in diagnostic performance. These limitations contribute to delayed recognition and uneven quality of care, creating an urgent need for a generalizable framework that enhances sensitivity while ensuring equity across diverse populations. In this study, we developed an AI-assisted two-stage ECG framework integrating self-supervised anomaly detection with demographic-aware representation learning. The first stage performs self-supervised anomaly detection pretraining by reconstructing masked global and local ECG signals, modeling signal trends, and predicting patient attributes to learn robust ECG representations without diagnostic labels. The pretrained model is then fine-tuned for multi-label ECG classification using asymmetric loss to better handle long-tail cardiac abnormalities, and additionally produces anomaly score maps for localization, with CPU-based optimization enabling practical deployment. Evaluated on a longitudinal cohort of over one million clinical ECGs, our method achieves an AUROC of 94.7% for rare anomalies and reduces the common-rare performance gap by 73%, while maintaining consistent diagnostic accuracy across age and sex groups. In conclusion, the proposed equity-aware AI framework demonstrates strong clinical utility, interpretable anomaly localization, and scalable performance across multiple cohorts, highlighting its potential to mitigate diagnostic disparities and advance equitable anomaly detection in biomedical signals and digital health. Source code is available at https://github.com/MediaBrain-SJTU/Rare-ECG.
[164] WorldAgents: Can Foundation Image Models be Agents for 3D World Models?
Ziya Erkoç, Angela Dai, Matthias Nießner
Main category: cs.CV
TL;DR: 2D foundation image models possess inherent 3D world model capabilities that can be harnessed through an agentic multi-architecture approach for coherent 3D world synthesis.
Details
Motivation: To investigate whether 2D foundation image models inherently possess 3D world model capabilities, given their remarkable ability to generate high-fidelity 2D outputs.Method: Proposes an agentic framing with multi-agent architecture: VLM-based director formulates prompts, generator synthesizes new image views, and VLM-backed two-step verifier evaluates and curates frames from both 2D image and 3D reconstruction space.
Result: Demonstrates that 2D models do encapsulate grasp of 3D worlds, enabling synthesis of expansive, realistic, and 3D-consistent worlds with coherent and robust 3D reconstruction.
Conclusion: 2D foundation image models inherently possess 3D world model capabilities that can be effectively harnessed through agentic approaches for 3D world synthesis.
Abstract: Given the remarkable ability of 2D foundation image models to generate high-fidelity outputs, we investigate a fundamental question: do 2D foundation image models inherently possess 3D world model capabilities? To answer this, we systematically evaluate multiple state-of-the-art image generation models and Vision-Language Models (VLMs) on the task of 3D world synthesis. To harness and benchmark their potential implicit 3D capability, we propose an agentic framing to facilitate 3D world generation. Our approach employs a multi-agent architecture: a VLM-based director that formulates prompts to guide image synthesis, a generator that synthesizes new image views, and a VLM-backed two-step verifier that evaluates and selectively curates generated frames from both 2D image and 3D reconstruction space. Crucially, we demonstrate that our agentic approach provides coherent and robust 3D reconstruction, producing output scenes that can be explored by rendering novel views. Through extensive experiments across various foundation models, we demonstrate that 2D models do indeed encapsulate a grasp of 3D worlds. By exploiting this understanding, our method successfully synthesizes expansive, realistic, and 3D-consistent worlds.
[165] BALM: A Model-Agnostic Framework for Balanced Multimodal Learning under Imbalanced Missing Rates
Phuong-Anh Nguyen, Tien Anh Pham, Duc-Trong Le, Cam-Van Thi Nguyen
Main category: cs.CV
TL;DR: BALM is a model-agnostic framework for balanced multimodal learning under imbalanced missing rates, using feature calibration and gradient rebalancing modules to handle modality imbalance in realistic settings.
Details
Motivation: Multimodal learning suffers from imbalance where information-rich modalities dominate optimization while weaker or missing modalities contribute less, especially in realistic settings with imbalanced missing rates (IMR) where each modality has different absence probabilities.Method: BALM framework with two modules: 1) Feature Calibration Module (FCM) recalibrates unimodal features using global context to establish shared representation basis across heterogeneous missing patterns; 2) Gradient Rebalancing Module (GRM) balances learning dynamics across modalities by modulating gradient magnitudes and directions from distributional and spatial perspectives.
Result: Experimental results across multiple multimodal emotion recognition benchmarks confirm BALM consistently enhances robustness and improves performance under diverse missing and imbalance settings.
Conclusion: BALM is a plug-and-play framework that can be seamlessly integrated into diverse backbones without architectural changes, effectively addressing modality imbalance in realistic multimodal learning scenarios.
Abstract: Learning from multiple modalities often suffers from imbalance, where information-rich modalities dominate optimization while weaker or partially missing modalities contribute less. This imbalance becomes severe in realistic settings with imbalanced missing rates (IMR), where each modality is absent with different probabilities, distorting representation learning and gradient dynamics. We revisit this issue from a training-process perspective and propose BALM, a model-agnostic plug-in framework to achieve balanced multimodal learning under IMR. The framework comprises two complementary modules: the Feature Calibration Module (FCM), which recalibrates unimodal features using global context to establish a shared representation basis across heterogeneous missing patterns; the Gradient Rebalancing Module (GRM), which balances learning dynamics across modalities by modulating gradient magnitudes and directions from both distributional and spatial perspectives. BALM can be seamlessly integrated into diverse backbones, including multimodal emotion recognition (MER) models, without altering their architectures. Experimental results across multiple MER benchmarks confirm that BALM consistently enhances robustness and improves performance under diverse missing and imbalance settings. Code available at: https://github.com/np4s/BALM_CVPR2026.git
[166] PerformRecast: Expression and Head Pose Disentanglement for Portrait Video Editing
Jiadong Liang, Bojun Xiong, Jie Tian, Hua Li, Xiao Long, Yong Zheng, Huan Fu
Main category: cs.CV
TL;DR: A method called PerformRecast for expression-only portrait video editing that disentangles facial expressions from head pose using 3DMM parameters, enabling fine-grained control over facial performance editing.
Details
Motivation: Existing portrait animation methods struggle to disentangle facial expression from head pose rotation, lacking the ability to edit facial expressions independently, which is crucial for animation and film industries.Method: Improves keypoints transformation formula to align with 3D Morphable Face Model (3DMM) parameters for better disentanglement, and decouples facial/non-facial regions with separate teacher model supervision to avoid boundary misalignment.
Result: Produces high-quality results more faithful to driving video, outperforming existing methods in both controllability and efficiency, as shown in extensive experiments.
Conclusion: PerformRecast enables versatile expression-only video editing with fine-grained control, advancing performance recasting in film and animation applications.
Abstract: This paper primarily investigates the task of expression-only portrait video performance editing based on a driving video, which plays a crucial role in animation and film industries. Most existing research mainly focuses on portrait animation, which aims to animate a static portrait image according to the facial motion from the driving video. As a consequence, it remains challenging for them to disentangle the facial expression from head pose rotation and thus lack the ability to edit facial expression independently. In this paper, we propose PerformRecast, a versatile expression-only video editing method which is dedicated to recast the performance in existing film and animation. The key insight of our method comes from the characteristics of 3D Morphable Face Model (3DMM), which models the face identity, facial expression and head pose of 3D face mesh with separate parameters. Therefore, we improve the keypoints transformation formula in previous methods to make it more consistent with 3DMM model, which achieves a better disentanglement and provides users with much more fine-grained control. Furthermore, to avoid the misalignment around the boundary of face in generated results, we decouple the facial and non-facial regions of input portrait images and pre-train a teacher model to provide separate supervision for them. Extensive experiments show that our method produces high-quality results which are more faithful to the driving video, outperforming existing methods in both controllability and efficiency. Our code, data and trained models are available at https://youku-aigc.github.io/PerformRecast.
[167] Uncertainty-aware Prototype Learning with Variational Inference for Few-shot Point Cloud Segmentation
Yifei Zhao, Fanyu Zhao, Yinsheng Li
Main category: cs.CV
TL;DR: UPL introduces uncertainty-aware prototype learning for few-shot 3D semantic segmentation, using probabilistic modeling and dual-stream prototype refinement to address limited supervision challenges.
Details
Motivation: Existing prototype-based methods for few-shot 3D segmentation create deterministic prototypes that fail to capture uncertainty from scarce supervision, leading to reduced robustness and limited generalization.Method: UPL uses a dual-stream prototype refinement module that leverages both support and query samples, and formulates prototype learning as a variational inference problem treating class prototypes as latent variables for explicit uncertainty modeling.
Result: Extensive experiments on ScanNet and S3DIS benchmarks show UPL achieves state-of-the-art performance under different settings while providing reliable uncertainty estimation.
Conclusion: UPL successfully incorporates uncertainty modeling into prototype learning for few-shot 3D segmentation, improving robustness and generalization while maintaining interpretability.
Abstract: Few-shot 3D semantic segmentation aims to generate accurate semantic masks for query point clouds with only a few annotated support examples. Existing prototype-based methods typically construct compact and deterministic prototypes from the support set to guide query segmentation. However, such rigid representations are unable to capture the intrinsic uncertainty introduced by scarce supervision, which often results in degraded robustness and limited generalization. In this work, we propose UPL (Uncertainty-aware Prototype Learning), a probabilistic approach designed to incorporate uncertainty modeling into prototype learning for few-shot 3D segmentation. Our framework introduces two key components. First, UPL introduces a dual-stream prototype refinement module that enriches prototype representations by jointly leveraging limited information from both support and query samples. Second, we formulate prototype learning as a variational inference problem, regarding class prototypes as latent variables. This probabilistic formulation enables explicit uncertainty modeling, providing robust and interpretable mask predictions. Extensive experiments on the widely used ScanNet and S3DIS benchmarks show that our UPL achieves consistent state-of-the-art performance under different settings while providing reliable uncertainty estimation. The code is available at https://fdueblab-upl.github.io/.
[168] PhysNeXt: Next-Generation Dual-Branch Structured Attention Fusion Network for Remote Photoplethysmography Measurement
Junzhe Cao, Bo Zhao, Zhiyi Niu, Dan Guo, Yue Sun, Haochen Liang, Yong Xu, Zitong YU
Main category: cs.CV
TL;DR: PhysNeXt is a dual-input deep learning framework for remote photoplethysmography that jointly processes raw video frames and spatial-temporal maps to improve heart rate measurement robustness under challenging conditions.
Details
Motivation: Current rPPG methods have trade-offs: end-to-end video modeling preserves subtle heartbeat signals but introduces noise from motion/illumination, while STMap representations reduce complexity but lose high-frequency details. The authors aim to integrate both approaches' strengths for more robust pulse signal extraction.Method: PhysNeXt uses a dual-input framework with video frames and STMap representations, incorporating spatio-temporal difference modeling, cross-modal interaction modules, and structured attention-based decoder for collaborative enhancement of pulse signal extraction.
Result: PhysNeXt achieves more stable and fine-grained rPPG signal recovery under challenging conditions, validating the effectiveness of joint modeling of video and STMap representations.
Conclusion: The proposed dual-input framework effectively integrates complementary strengths of video and STMap representations for improved remote photoplethysmography performance.
Abstract: Remote photoplethysmography (rPPG) enables contactless measurement of heart rate and other vital signs by analyzing subtle color variations in facial skin induced by cardiac pulsation. Current rPPG methods are mainly based on either end-to-end modeling from raw videos or intermediate spatial-temporal map (STMap) representations. The former preserves complete spatiotemporal information and can capture subtle heartbeat-related signals, but it also introduces substantial noise from motion artifacts and illumination variations. The latter stacks the temporal color changes of multiple facial regions of interest into compact two-dimensional representations, significantly reducing data volume and computational complexity, although some high-frequency details may be lost. To effectively integrate the mutual strengths, we propose PhysNeXt, a dual-input deep learning framework that jointly exploits video frames and STMap representations. By incorporating a spatio-temporal difference modeling unit, a cross-modal interaction module, and a structured attention-based decoder, PhysNeXt collaboratively enhances the robustness of pulse signal extraction. Experimental results demonstrate that PhysNeXt achieves more stable and fine-grained rPPG signal recovery under challenging conditions, validating the effectiveness of joint modeling of video and STMap representations. The codes will be released.
[169] Learning Hierarchical Orthogonal Prototypes for Generalized Few-Shot 3D Point Cloud Segmentation
Yifei Zhao, Fanyu Zhao, Zhongyuan Zhang, Shengtang Wu, Yixuan Lin, Yinsheng Li
Main category: cs.CV
TL;DR: HOP3D is a framework for generalized few-shot 3D point cloud segmentation that uses hierarchical orthogonal prototypes and entropy-based regularization to adapt to novel classes without forgetting base classes.
Details
Motivation: The paper addresses the stability-plasticity trade-off in few-shot 3D point cloud segmentation, where adapting to novel classes can interfere with shared representations and cause base-class forgetting.Method: HOP3D introduces hierarchical orthogonalization that decouples base and novel learning at both gradient and representation levels, and incorporates an entropy-based regularizer that leverages predictive uncertainty to refine prototype learning.
Result: Extensive experiments on ScanNet200 and ScanNet++ show HOP3D consistently outperforms state-of-the-art baselines under both 1-shot and 5-shot settings.
Conclusion: HOP3D effectively mitigates base-novel interference in few-shot 3D point cloud segmentation through hierarchical orthogonal prototypes and uncertainty-aware regularization.
Abstract: Generalized few-shot 3D point cloud segmentation aims to adapt to novel classes from only a few annotations while maintaining strong performance on base classes, but this remains challenging due to the inherent stability-plasticity trade-off: adapting to novel classes can interfere with shared representations and cause base-class forgetting. We present HOP3D, a unified framework that learns hierarchical orthogonal prototypes with an entropy-based few-shot regularizer to enable robust novel-class adaptation without degrading base-class performance. HOP3D introduces hierarchical orthogonalization that decouples base and novel learning at both the gradient and representation levels, effectively mitigating base-novel interference. To further enhance adaptation under sparse supervision, we incorporate an entropy-based regularizer that leverages predictive uncertainty to refine prototype learning and promote balanced predictions. Extensive experiments on ScanNet200 and ScanNet++ demonstrate that HOP3D consistently outperforms state-of-the-art baselines under both 1-shot and 5-shot settings. The code is available at https://fdueblab-hop3d.github.io/.
[170] ReLi3D: Relightable Multi-view 3D Reconstruction with Disentangled Illumination
Jan-Niklas Dihlmann, Mark Boss, Simon Donne, Andreas Engelhardt, Hendrik P. A. Lensch, Varun Jampani
Main category: cs.CV
TL;DR: ReLi3D is a unified end-to-end pipeline that simultaneously reconstructs 3D geometry, materials, and illumination from sparse multi-view images in under one second, enabling near-instantaneous generation of complete, relightable 3D assets.
Details
Motivation: Traditional 3D reconstruction requires separate pipelines for geometry, materials, and illumination, each with distinct limitations and computational overhead. There's a need for a unified approach that can handle all these aspects simultaneously and efficiently.Method: Uses transformer cross-conditioning architecture to fuse multi-view input, followed by a unified two-path prediction strategy: one path predicts object structure/appearance, the other predicts environment illumination. Combines differentiable Monte Carlo multiple importance sampling renderer with mixed domain training (synthetic PBR datasets + real-world RGB captures).
Result: Achieves simultaneous reconstruction of complete 3D geometry, spatially-varying physically-based materials, and environment illumination from sparse multi-view images in under one second, establishing generalizable results across geometry, material accuracy, and illumination quality.
Conclusion: ReLi3D demonstrates that multi-view constraints dramatically improve material and illumination disentanglement, enabling near-instantaneous generation of complete, relightable 3D assets through a unified feed-forward pipeline.
Abstract: Reconstructing 3D assets from images has long required separate pipelines for geometry reconstruction, material estimation, and illumination recovery, each with distinct limitations and computational overhead. We present ReLi3D, the first unified end-to-end pipeline that simultaneously reconstructs complete 3D geometry, spatially-varying physically-based materials, and environment illumination from sparse multi-view images in under one second. Our key insight is that multi-view constraints can dramatically improve material and illumination disentanglement, a problem that remains fundamentally ill-posed for single-image methods. Key to our approach is the fusion of the multi-view input via a transformer cross-conditioning architecture, followed by a novel unified two-path prediction strategy. The first path predicts the object’s structure and appearance, while the second path predicts the environment illumination from image background or object reflections. This, combined with a differentiable Monte Carlo multiple importance sampling renderer, creates an optimal illumination disentanglement training pipeline. In addition, with our mixed domain training protocol, which combines synthetic PBR datasets with real-world RGB captures, we establish generalizable results in geometry, material accuracy, and illumination quality. By unifying previously separate reconstruction tasks into a single feed-forward pass, we enable near-instantaneous generation of complete, relightable 3D assets. Project Page: https://reli3d.jdihlmann.com/
[171] Enhancing Alignment for Unified Multimodal Models via Semantically-Grounded Supervision
Jiyeong Kim, Yerim So, Hyesong Choi, Uiwon Hwang, Dongbo Min
Main category: cs.CV
TL;DR: SeGroS is a fine-tuning framework that addresses granularity mismatch and supervisory redundancy in Unified Multimodal Models by using visual grounding maps to create semantic visual hints and corrupted inputs for better generation fidelity and cross-modal alignment.
Details
Motivation: Current generative training paradigms for Unified Multimodal Models suffer from inherent limitations including granularity mismatch between text prompts and visual content, and supervisory redundancy in existing training approaches.Method: Proposes Semantically-Grounded Supervision (SeGroS) framework with a novel visual grounding map to construct two complementary supervision signals: 1) semantic Visual Hints to compensate for sparse text prompts, and 2) semantically-grounded Corrupted Input that restricts reconstruction loss to core text-aligned regions for masking-based UMMs.
Result: Extensive evaluations on GenEval, DPGBench, and CompBench demonstrate that SeGroS significantly improves generation fidelity and cross-modal alignment across various UMM architectures.
Conclusion: SeGroS effectively addresses the limitations of current generative training paradigms for Unified Multimodal Models, providing a robust framework for improving multimodal understanding and generation capabilities.
Abstract: Unified Multimodal Models (UMMs) have emerged as a promising paradigm that integrates multimodal understanding and generation within a unified modeling framework. However, current generative training paradigms suffer from inherent limitations. We present Semantically-Grounded Supervision (SeGroS), a fine-tuning framework designed to resolve the granularity mismatch and supervisory redundancy in UMMs. At its core, we propose a novel visual grounding map to construct two complementary supervision signals. First, we formulate semantic Visual Hints to compensate for the sparsity of text prompts. Second, we generate a semantically-grounded Corrupted Input to explicitly enhance the supervision of masking-based UMMs by restricting the reconstruction loss to core text-aligned regions. Extensive evaluations on GenEval, DPGBench, and CompBench demonstrate that SeGroS significantly improves generation fidelity and cross-modal alignment across various UMM architectures.
[172] Growing Networks with Autonomous Pruning
Charles De Lambilly, Stefan Duffner
Main category: cs.CV
TL;DR: GNAP introduces a novel neural network training approach that dynamically adjusts network size through autonomous growth and pruning mechanisms to achieve high accuracy with minimal parameters for image classification tasks.
Details
Motivation: Traditional convolutional neural networks have fixed architectures, which may not be optimal for different datasets. The authors aim to develop networks that can autonomously adjust their size and parameter count during training to best fit the data while minimizing computational resources.Method: GNAP uses two complementary mechanisms: growth and pruning. Networks start with few parameters and periodically expand during training when they reach saturation points. Between growth phases, parameters are trained for classification and pruned autonomously through gradient descent, allowing the network to maintain high performance with minimal parameters.
Result: Experimental results show GNAP can train extremely sparse neural networks with high accuracy: 99.44% accuracy with only 6.2k parameters on MNIST, and 92.2% accuracy with 157.8k parameters on CIFAR10.
Conclusion: GNAP demonstrates that neural networks can autonomously adjust their architecture during training to achieve optimal performance with minimal parameters, offering a promising approach for efficient model design in image classification.
Abstract: This paper introduces Growing Networks with Autonomous Pruning (GNAP) for image classification. Unlike traditional convolutional neural networks, GNAP change their size, as well as the number of parameters they are using, during training, in order to best fit the data while trying to use as few parameters as possible. This is achieved through two complementary mechanisms: growth and pruning. GNAP start with few parameters, but their size is expanded periodically during training to add more expressive power each time the network has converged to a saturation point. Between these growing phases, model parameters are trained for classification and pruned simultaneously, with complete autonomy by gradient descent. Growing phases allow GNAP to improve their classification performance, while autonomous pruning allows them to keep as few parameters as possible. Experimental results on several image classification benchmarks show that our approach can train extremely sparse neural networks with high accuracy. For example, on MNIST, we achieved 99.44% accuracy with as few as 6.2k parameters, while on CIFAR10, we achieved 92.2\ accuracy with 157.8k parameters.
[173] Failure Modes for Deep Learning-Based Online Mapping: How to Measure and Address Them
Michael Hubbertz, Qi Han, Tobias Meisen
Main category: cs.CV
TL;DR: A framework for analyzing generalization failures in online mapping models by disentangling memorization of input features from overfitting to known map geometries, with novel metrics and dataset diagnostics.
Details
Motivation: Deep learning-based online mapping models for autonomous driving often fail to generalize beyond familiar environments, but current evaluation methods don't properly identify whether failures stem from memorizing input features or overfitting to known map geometries.Method: Proposes evaluation subsets controlling for geographical proximity and geometric similarity, introduces Fréchet distance-based reconstruction statistics for shape fidelity, defines localization and map geometry overfitting scores, and develops MST-based diversity measures and sparsification strategy.
Result: Experiments on nuScenes and Argoverse 2 show the framework provides more trustworthy generalization assessment, and that map geometry-diverse, balanced training sets improve performance while reducing training size.
Conclusion: The work motivates failure-mode-aware evaluation protocols and map geometry-centric dataset design for deployable online mapping systems in autonomous driving.
Abstract: Deep learning-based online mapping has emerged as a cornerstone of autonomous driving, yet these models frequently fail to generalize beyond familiar environments. We propose a framework to identify and measure the underlying failure modes by disentangling two effects: Memorization of input features and overfitting to known map geometries. We propose measures based on evaluation subsets that control for geographical proximity and geometric similarity between training and validation scenes. We introduce Fréchet distance-based reconstruction statistics that capture per-element shape fidelity without threshold tuning, and define complementary failure-mode scores: a localization overfitting score quantifying the performance drop when geographic cues disappear, and a map geometry overfitting score measuring degradation as scenes become geometrically novel. Beyond models, we analyze dataset biases and contribute map geometry-aware diagnostics: A minimum-spanning-tree (MST) diversity measure for training sets and a symmetric coverage measure to quantify geometric similarity between splits. Leveraging these, we formulate an MST-based sparsification strategy that reduces redundancy and improves balancing and performance while shrinking training size. Experiments on nuScenes and Argoverse 2 across multiple state-of-the-art models yield more trustworthy assessment of generalization and show that map geometry-diverse and balanced training sets lead to improved performance. Our results motivate failure-mode-aware protocols and map geometry-centric dataset design for deployable online mapping.
[174] PCSTracker: Long-Term Scene Flow Estimation for Point Cloud Sequences
Min Lin, Gangwei Xu, Xianqi Wang, Yuyi Peng, Xin Yang
Main category: cs.CV
TL;DR: PCSTracker is an end-to-end framework for consistent scene flow estimation in point cloud sequences, addressing temporal consistency issues in long-term 3D motion analysis through iterative optimization and spatio-temporal trajectory modeling.
Details
Motivation: Existing point cloud scene flow methods are limited to pairwise settings and struggle with temporal consistency over long sequences due to evolving geometry, occlusions, and error accumulation. There's a need for a framework that can maintain coherent motion estimation across extended temporal contexts.Method: PCSTracker introduces: 1) Iterative Geometry Motion Joint Optimization (IGMO) module that models temporal evolution of point features to handle dynamic geometric changes; 2) Spatio-temporal Point Trajectory Update (STTU) module that leverages broad temporal context to infer positions for occluded points; 3) Overlapping sliding-window inference strategy with cross-window propagation and in-window refinement for long sequences.
Result: PCSTracker achieves state-of-the-art accuracy in long-term scene flow estimation on synthetic PointOdyssey3D and real-world ADT3D datasets, maintains real-time performance at 32.5 FPS, and demonstrates superior 3D motion understanding compared to RGB-D-based approaches.
Conclusion: PCSTracker is the first end-to-end framework for consistent scene flow estimation in point cloud sequences, effectively addressing temporal consistency challenges through novel optimization modules and inference strategies, enabling robust long-term 3D motion analysis.
Abstract: Point cloud scene flow estimation is fundamental to long-term and fine-grained 3D motion analysis. However, existing methods are typically limited to pairwise settings and struggle to maintain temporal consistency over long sequences as geometry evolves, occlusions emerge, and errors accumulate. In this work, we propose PCSTracker, the first end-to-end framework specifically designed for consistent scene flow estimation in point cloud sequences. Specifically, we introduce an iterative geometry motion joint optimization module (IGMO) that explicitly models the temporal evolution of point features to alleviate correspondence inconsistencies caused by dynamic geometric changes. In addition, a spatio-temporal point trajectory update module (STTU) is proposed to leverage broad temporal context to infer plausible positions for occluded points, ensuring coherent motion estimation. To further handle long sequences, we employ an overlapping sliding-window inference strategy that alternates cross-window propagation and in-window refinement, effectively suppressing error accumulation and maintaining stable long-term motion consistency. Extensive experiments on the synthetic PointOdyssey3D and real-world ADT3D datasets show that PCSTracker achieves the best accuracy in long-term scene flow estimation and maintains real-time performance at 32.5 FPS, while demonstrating superior 3D motion understanding compared to RGB-D-based approaches.
[175] FREAK: A Fine-grained Hallucination Evaluation Benchmark for Advanced MLLMs
Zhihan Yin, Jianxin Liang, Yueqian Wang, Yifeng Yao, Huishuai Zhang, Dongyan Zhao
Main category: cs.CV
TL;DR: FREAK is a comprehensive multimodal benchmark for fine-grained hallucination assessment in MLLMs using photorealistic images with counter-commonsense edits to evaluate detailed visual perception.
Details
Motivation: Existing hallucination evaluation benchmarks for MLLMs are limited by oversimplified tasks with saturated metrics and insufficient diversity, failing to adequately assess hallucination extent in state-of-the-art multimodal models.Method: Proposes FREAK benchmark using high-quality photorealistic images with fine-grained counter-commonsense edits to evaluate hallucination in detailed visual perception. Includes controlled subset for indirect evaluation and systematic analysis of Chain-of-Thought prompting techniques.
Result: Extensive experiments show severe hallucination issues in SOTA models regarding detailed visual perception. Systematic evaluation of CoT techniques reveals critical insights about hallucination patterns and model reasoning processes.
Conclusion: FREAK addresses gaps in existing hallucination evaluation by providing comprehensive, fine-grained assessment of MLLMs’ detailed visual perception capabilities, revealing significant hallucination issues in current models.
Abstract: Multimodal Large Language Models (MLLMs) suffer from hallucinations. Existing hallucination evaluation benchmarks are often limited by over-simplified tasks leading to saturated metrics, or insufficient diversity that fails to adequately assess the hallucination extent in state-of-the-art multimodal models. To address this gap, we propose FREAK, a comprehensive multimodal benchmark designed for fine-grained hallucination assessment in MLLMs. Through high-quality photorealistic images featuring fine-grained counter-commonsense edits, FREAK innovatively evaluates hallucination phenomena in detailed visual perception of MLLMs. Extensive experiments on FREAK show severe hallucination issues in SOTA models regarding detailed visual perception. To enable deeper investigation, we curate a controlled subset to indirectly evaluate the model’s ability to perceive target detailed information. Through systematic evaluation of prevailing Chain-of-Thought (CoT) prompting techniques within this task, we reveal critical insights regarding hallucination patterns and model reasoning processes.
[176] Adapting a Pre-trained Single-Cell Foundation Model to Spatial Gene Expression Generation from Histology Images
Donghai Fang, Yongheng Li, Zhen Wang, Yuansong Zeng, Wenwen Min
Main category: cs.CV
TL;DR: HINGE adapts single-cell foundation models for histology-conditioned spatial gene expression generation by retrofitting pre-trained models with visual context injection while preserving learned gene relationships.
Details
Motivation: Spatial transcriptomics is expensive and low-throughput, motivating prediction of gene expression from histology. Existing generative approaches lack explicit modeling of gene-gene dependencies, while single-cell foundation models capture these relationships but lack visual pathways and alignment with histology-conditioned objectives.Method: Proposes HINGE with SoftAdaLN - lightweight, identity-initialized modulation that injects layer-wise visual context into pre-trained sc-FM backbone. Uses expression-space masked diffusion objective and warm-start curriculum for objective alignment and training stability.
Result: Outperforms state-of-the-art baselines on mean Pearson correlation across three ST datasets, yields more accurate spatial marker expression patterns, and achieves higher pairwise co-expression consistency.
Conclusion: HINGE establishes a practical route to adapt pre-trained single-cell foundation models for histology-conditioned spatial expression generation while preserving learned gene relationships.
Abstract: Spatial transcriptomics (ST) enables spot-level in situ expression profiling, but its high cost and limited throughput motivate predicting expression directly from HE-stained histology. Recent advances explore using score- or flow-based generative models to estimate the conditional distribution of gene expression from histology, offering a flexible alternative to deterministic regression approaches. However, most existing generative approaches omit explicit modeling of gene-gene dependencies, undermining biological coherence. Single-cell foundation models (sc-FMs), pre-trained across diverse cell populations, capture these critical gene relationships that histology alone cannot reveal. Yet, applying expression-only sc-FMs to histology-conditioned expression modeling is nontrivial due to the absence of a visual pathway, a mismatch between their pre-training and conditional ST objectives, and the scarcity of mixed-cell ST supervision. To address these challenges, we propose HINGE (HIstology-coNditioned GEneration), which retrofits a pre-trained sc-FM into a conditional expression generator while mostly preserving its learned gene relationships. We achieve this by introducing SoftAdaLN, a lightweight, identity-initialized modulation that injects layer-wise visual context into the backbone, coupled with an expression-space masked diffusion objective and a warm-start curriculum to ensure objective alignment and training stability. Evaluated on three ST datasets, ours outperforms state-of-the-art baselines on mean Pearson correlation and yields more accurate spatial marker expression patterns and higher pairwise co-expression consistency, establishing a practical route to adapt pre-trained sc-FMs for histology-conditioned spatial expression generation.
[177] FlashCap: Millisecond-Accurate Human Motion Capture via Flashing LEDs and Event-Based Vision
Zekai Wu, Shuqi Fan, Mengyin Liu, Yuhua Luo, Xincheng Lin, Ming Yan, Junhao Wu, Xiuhong Lin, Yuexin Ma, Chenglu Wen, Lan Xu, Siqi Shen, Cheng Wang
Main category: cs.CV
TL;DR: FlashCap introduces a flashing LED-based motion capture system for precise motion timing, creating FlashMotion dataset with millisecond-resolution multimodal data (events, RGB, LiDAR, IMU), and proposes ResPose for high-temporal-resolution pose estimation.
Details
Motivation: Current high-speed RGB cameras for precise motion timing are expensive, light-sensitive, and computationally complex, limiting daily use. The human pose estimation community lacks high-temporal-resolution labeled datasets for precise motion timing analysis.Method: Developed FlashCap system using flashing LEDs for motion capture. Collected FlashMotion dataset with millisecond-resolution multimodal data. Proposed ResPose baseline that learns residual poses from events and RGB images for precise motion timing and high-temporal-resolution pose estimation.
Result: ResPose reduces pose estimation errors by ~40% and achieves millisecond-level timing accuracy. FlashMotion dataset enables new research opportunities in precise motion timing and high-temporal-resolution human pose estimation.
Conclusion: FlashCap provides an affordable alternative to high-speed cameras for precise motion timing. The FlashMotion dataset and ResPose method advance high-temporal-resolution human pose estimation and enable new research in motion analysis.
Abstract: Precise motion timing (PMT) is crucial for swift motion analysis. A millisecond difference may determine victory or defeat in sports competitions. Despite substantial progress in human pose estimation (HPE), PMT remains largely overlooked by the HPE community due to the limited availability of high-temporal-resolution labeled datasets. Today, PMT is achieved using high-speed RGB cameras in specialized scenarios such as the Olympic Games; however, their high costs, light sensitivity, bandwidth, and computational complexity limit their feasibility for daily use. We developed FlashCap, the first flashing LED-based MoCap system for PMT. With FlashCap, we collect a millisecond-resolution human motion dataset, FlashMotion, comprising the event, RGB, LiDAR, and IMU modalities, and demonstrate its high quality through rigorous validation. To evaluate the merits of FlashMotion, we perform two tasks: precise motion timing and high-temporal-resolution HPE. For these tasks, we propose ResPose, a simple yet effective baseline that learns residual poses based on events and RGBs. Experimental results show that ResPose reduces pose estimation errors by ~40% and achieves millisecond-level timing accuracy, enabling new research opportunities. The dataset and code will be shared with the community.
[178] Learning Like Humans: Analogical Concept Learning for Generalized Category Discovery
Jizhou Han, Chenhao Ding, Yuhang He, Qiang Wang, Shaokun Wang, SongLin Dong, Yihong Gong
Main category: cs.CV
TL;DR: ATCG generates textual concepts for unlabeled data by analogizing from labeled knowledge, improving category discovery through visual-textual reasoning.
Details
Motivation: Current Generalized Category Discovery (GCD) methods rely on visual-only pipelines and loose coupling between supervised learning and discovery, leading to brittle boundaries on fine-grained, look-alike categories.Method: Introduces Analogical Textual Concept Generator (ATCG) - a plug-and-play module that analogizes from labeled knowledge to new observations, forming textual concepts for unlabeled samples. Fuses these analogical textual concepts with visual features to turn discovery into visual-textual reasoning.
Result: Across six benchmarks, ATCG consistently improves overall, known-class, and novel-class performance, with largest gains on fine-grained data. Works with both parametric and clustering style GCD pipelines without design changes.
Conclusion: ATCG effectively transfers prior knowledge to novel data through analogical textual concept generation and visual-textual reasoning, sharpening category separation in GCD tasks.
Abstract: Generalized Category Discovery (GCD) seeks to uncover novel categories in unlabeled data while preserving recognition of known categories, yet prevailing visual-only pipelines and the loose coupling between supervised learning and discovery often yield brittle boundaries on fine-grained, look-alike categories. We introduce the Analogical Textual Concept Generator (ATCG), a plug-and-play module that analogizes from labeled knowledge to new observations, forming textual concepts for unlabeled samples. Fusing these analogical textual concepts with visual features turns discovery into a visual-textual reasoning process, transferring prior knowledge to novel data and sharpening category separation. ATCG attaches to both parametric and clustering style GCD pipelines and requires no changes to their overall design. Across six benchmarks, ATCG consistently improves overall, known-class, and novel-class performance, with the largest gains on fine-grained data. Our code is available at: https://github.com/zhou-9527/AnaLogical-GCD.
[179] Template-based Object Detection Using a Foundation Model
Valentin Braeutigam, Matthias Stock, Bernhard Egger
Main category: cs.CV
TL;DR: A training-free object detection method using segmentation foundation models and feature-based classification for GUI testing, achieving near-YOLO performance without training data.
Details
Motivation: To address use cases with limited data variation but requiring no training data generation or training, particularly for automated GUI testing in software development and continuous integration testing.Method: Uses segments from segmentation foundation models combined with simple feature-based classification, eliminating the need for retraining or dataset creation when objects change.
Result: Achieves results almost on par with learning-based object detection methods like YOLO on icon detection in navigation maps for automotive UI testing, without requiring training.
Conclusion: Proposes a practical training-free approach suitable for GUI testing scenarios where object designs frequently change, saving time and cost compared to learning-based methods.
Abstract: Most currently used object detection methods are learning-based, and can detect objects under varying appearances. Those models require training and a training dataset. We focus on use cases with less data variation, but the requirement of being free of generation of training data and training. Such a setup is for example desired in automatic testing of graphical interfaces during software development, especially for continuous integration testing. In our approach, we use segments from segmentation foundation models and combine them with a simple feature-based classification method. This saves time and cost when changing the object to be searched or its design, as nothing has to be retrained and no dataset has to be created. We evaluate our method on the task of detecting and classifying icons in navigation maps, which is used to simplify and automate the testing of user interfaces in automotive industry. Our methods achieve results almost on par with learning-based object detection methods like YOLO, without the need for training.
[180] RAM: Recover Any 3D Human Motion in-the-Wild
Sen Jia, Ning Zhu, Jinqin Zhong, Jiale Zhou, Huaping Zhang, Jenq-Neng Hwang, Lei Li
Main category: cs.CV
TL;DR: RAM is a robust 3D human motion capture system that combines motion-aware semantic tracking with adaptive Kalman filtering, temporal human mesh recovery, and future pose prediction for stable and accurate multi-person motion reconstruction in the wild.
Details
Motivation: The paper addresses challenges in markerless 3D human motion capture in dynamic, real-world scenarios, particularly dealing with severe occlusions, identity association issues, and maintaining temporal consistency in multi-person settings.Method: RAM uses a motion-aware semantic tracker with adaptive Kalman filtering for robust identity association, a memory-augmented Temporal HMR module for spatio-temporal priors, a lightweight Predictor for future pose forecasting, and a gated combiner to fuse reconstructed and predicted features.
Result: RAM substantially outperforms previous state-of-the-art methods on in-the-wild multi-person benchmarks (PoseTrack and 3DPW) in both zero-shot tracking stability and 3D accuracy.
Conclusion: RAM offers a generalizable paradigm for robust markerless 3D human motion capture in dynamic, real-world environments by effectively addressing occlusion, identity association, and temporal consistency challenges.
Abstract: RAM incorporates a motion-aware semantic tracker with adaptive Kalman filtering to achieve robust identity association under severe occlusions and dynamic interactions. A memory-augmented Temporal HMR module further enhances human motion reconstruction by injecting spatio-temporal priors for consistent and smooth motion estimation. Moreover, a lightweight Predictor module forecasts future poses to maintain reconstruction continuity, while a gated combiner adaptively fuses reconstructed and predicted features to ensure coherence and robustness. Experiments on in-the-wild multi-person benchmarks such as PoseTrack and 3DPW, demonstrate that RAM substantially outperforms previous state-of-the-art in both Zero-shot tracking stability and 3D accuracy, offering a generalizable paradigm for markerless 3D human motion capture in-the-wild.
[181] Evaluating Image Editing with LLMs: A Comprehensive Benchmark and Intermediate-Layer Probing Approach
Shiqi Gao, Zitong Xu, Kang Fu, Huiyu Duan, Xiongkuo Min, Jia wang, Guangtao Zhai
Main category: cs.CV
TL;DR: TIEdit benchmark for text-guided image editing evaluation with 5,120 edited images and 15,360 human ratings, plus EditProbe LLM-based evaluator using intermediate-layer probing for better alignment with human perception.
Details
Motivation: Existing evaluation benchmarks for text-guided image editing (TIE) are limited in scale and show weak correlation with human judgments. There's a need for systematic evaluation that considers perceptual quality, alignment with textual instructions, and preservation of original content.Method: Created TIEdit benchmark with 512 source images and editing prompts across 8 tasks, generating 5,120 edited images using 10 state-of-the-art TIE models. Collected 307,200 raw subjective ratings from 20 experts, aggregated into 15,360 MOS scores. Developed EditProbe, an LLM-based evaluator that probes intermediate layers of multimodal large language models to capture semantic relationships between source images, editing instructions, and edited results.
Result: Widely used automatic evaluation metrics show limited correlation with human judgments, while EditProbe achieves substantially stronger alignment with human perception. The benchmark provides comprehensive evaluation data across three dimensions: perceptual quality, editing alignment, and content preservation.
Conclusion: TIEdit and EditProbe provide a foundation for more reliable and perceptually aligned evaluation of text-guided image editing methods, addressing limitations of existing evaluation approaches.
Abstract: Evaluating text-guided image editing (TIE) methods remains a challenging problem, as reliable assessment should simultaneously consider perceptual quality, alignment with textual instructions, and preservation of original image content. Despite rapid progress in TIE models, existing evaluation benchmarks remain limited in scale and often show weak correlation with human perceptual judgments. In this work, we introduce TIEdit, a benchmark for systematic evaluation of text-guided image editing methods. TIEdit consists of 512 source images paired with editing prompts across eight representative editing tasks, producing 5,120 edited images generated by ten state-of-the-art TIE models. To obtain reliable subjective ratings, 20 experts are recruited to produce 307,200 raw subjective ratings, which accumulates into 15,360 mean opinion scores (MOSs) across three evaluation dimensions: perceptual quality, editing alignment, and content preservation. Beyond the benchmark itself, we further propose EditProbe, an LLM-based evaluator that estimates editing quality via intermediate-layer probing of hidden representations. Instead of relying solely on final model outputs, EditProbe extracts informative representations from intermediate layers of multimodal large language models to better capture semantic and perceptual relationships between source images, editing instructions, and edited results. Experimental results demonstrate that widely used automatic evaluation metrics show limited correlation with human judgments on editing tasks, while EditProbe achieves substantially stronger alignment with human perception. Together, TIEdit and EditProbe provide a foundation for more reliable and perceptually aligned evaluation of text-guided image editing methods.
[182] HiPath: Hierarchical Vision-Language Alignment for Structured Pathology Report Prediction
Ruicheng Yuan, Zhenxuan Zhang, Anbang Wang, Liwei Hu, Xiangqian Hua, Yaya Peng, Jiawei Luo, Guang Yang
Main category: cs.CV
TL;DR: HiPath is a lightweight pathology vision-language model framework that generates structured diagnostic reports from pathology images, outperforming existing methods with 68.9% strict accuracy while maintaining 97.3% safety rate.
Details
Motivation: Existing pathology VLMs reduce complex structured diagnostic reports to flat labels or free-form text, failing to capture the hierarchical, multi-granular nature of real pathology reports that encode diagnostic conclusions, histological grades, and ancillary test results across anatomical sites.Method: Built on frozen UNI2 and Qwen3 backbones with only 15M trainable parameters. Uses three modules: Hierarchical Patch Aggregator (HiPA) for multi-image visual encoding, Hierarchical Contrastive Learning (HiCL) for cross-modal alignment via optimal transport, and Slot-based Masked Diagnosis Prediction (Slot-MDP) for structured diagnosis generation.
Result: Trained on 749K real-world Chinese pathology cases, achieves 68.9% strict and 74.7% clinically acceptable accuracy with 97.3% safety rate, outperforming all baselines. Cross-hospital evaluation shows only 3.4 percentage point drop in strict accuracy while maintaining 97.1% safety.
Conclusion: HiPath demonstrates that treating structured report prediction as the primary training objective enables effective pathology VLMs that can generate clinically relevant structured diagnostic reports while maintaining high safety standards and generalization across hospitals.
Abstract: Pathology reports are structured, multi-granular documents encoding diagnostic conclusions, histological grades, and ancillary test results across one or more anatomical sites; yet existing pathology vision-language models (VLMs) reduce this output to a flat label or free-form text. We present HiPath, a lightweight VLM framework built on frozen UNI2 and Qwen3 backbones that treats structured report prediction as its primary training objective. Three trainable modules totalling 15M parameters address complementary aspects of the problem: a Hierarchical Patch Aggregator (HiPA) for multi-image visual encoding, Hierarchical Contrastive Learning (HiCL) for cross-modal alignment via optimal transport, and Slot-based Masked Diagnosis Prediction (Slot-MDP) for structured diagnosis generation. Trained on 749K real-world Chinese pathology cases from three hospitals, HiPath achieves 68.9% strict and 74.7% clinically acceptable accuracy with a 97.3% safety rate, outperforming all baselines under the same frozen backbone. Cross-hospital evaluation confirms generalisation with only a 3.4pp drop in strict accuracy while maintaining 97.1% safety.
[183] ReManNet: A Riemannian Manifold Network for Monocular 3D Lane Detection
Chengzhi Hong, Bijun Li
Main category: cs.CV
TL;DR: ReManNet: A monocular 3D lane detection method using Riemannian manifold geometry and road-manifold assumption to address depth ambiguity and weak geometric constraints.
Details
Motivation: Monocular 3D lane detection suffers from depth ambiguity and weak geometric constraints. Existing methods rely on simplified physical assumptions and weakly encode road geometry, leading to ill-posed 2D-to-3D lifting that often produces concavities, bulges, and twists.Method: Proposes the Road-Manifold Assumption: road as smooth 2D manifold in R³, lanes as embedded 1D submanifolds, and sampled lane points as dense observations. ReManNet first produces initial lane predictions, then encodes geometry as Riemannian Gaussian descriptors on the SPD manifold, fusing them with visual features via a lightweight gate. Also introduces 3D Tunnel Lane IoU loss for shape-level alignment.
Result: Achieves SOTA or competitive results on standard benchmarks. On OpenLane, improves F1 by +8.2% over baseline and +1.8% over previous best, with scenario-level gains up to +6.6%.
Conclusion: The road-manifold assumption and Riemannian geometric encoding effectively address geometric constraints in monocular 3D lane detection, leading to improved performance and robustness.
Abstract: Monocular 3D lane detection remains challenging due to depth ambiguity and weak geometric constraints. Mainstream methods rely on depth guidance, BEV projection, and anchor- or curve-based heads with simplified physical assumptions, remapping high-dimensional image features while only weakly encoding road geometry. Lacking an invariant geometric-topological coupling between lanes and the underlying road surface, 2D-to-3D lifting is ill-posed and brittle, often degenerating into concavities, bulges, and twists. To address this, we propose the Road-Manifold Assumption: the road is a smooth 2D manifold in $\mathbb{R}^3$, lanes are embedded 1D submanifolds, and sampled lane points are dense observations, thereby coupling metric and topology across surfaces, curves, and point sets. Building on this, we propose ReManNet, which first produces initial lane predictions with an image backbone and detection heads, then encodes geometry as Riemannian Gaussian descriptors on the symmetric positive-definite (SPD) manifold, and fuses these descriptors with visual features through a lightweight gate to maintain coherent 3D reasoning. We also propose the 3D Tunnel Lane IoU (3D-TLIoU) loss, a joint point-curve objective that computes slice-wise overlap of tubular neighborhoods along each lane to improve shape-level alignment. Extensive experiments on standard benchmarks demonstrate that ReManNet achieves state-of-the-art (SOTA) or competitive results. On OpenLane, it improves F1 by +8.2% over the baseline and by +1.8% over the previous best, with scenario-level gains of up to +6.6%. The code will be publicly available at https://github.com/changehome717/ReManNet.
[184] One Model, Two Minds: Task-Conditioned Reasoning for Unified Image Quality and Aesthetic Assessment
Wen Yin, Cencen Liu, Dingrui Liu, Bing Su, Yuan-Fang Li, Tao He
Main category: cs.CV
TL;DR: TATAR is a unified multimodal LLM framework that addresses fundamental mismatches between Image Quality Assessment (IQA) and Image Aesthetic Assessment (IAA) through task-aware reasoning and asymmetric rewards.
Details
Motivation: Existing unified approaches treat IQA and IAA as similar tasks using the same reasoning strategies, but they have fundamentally different requirements: IQA needs low-level perceptual analysis while IAA requires high-level semantic judgment.Method: TATAR uses a shared visual-language backbone with task-conditioned post-training, featuring: 1) Fast-slow reasoning (concise for IQA, deliberative for IAA), 2) Two-stage SFT+GRPO learning, and 3) Asymmetric rewards (Gaussian shaping for IQA, Thurstone ranking for IAA).
Result: Outperforms prior unified baselines on eight benchmarks in both in-domain and cross-domain settings, remains competitive with task-specific models, and provides more stable training for aesthetic assessment.
Conclusion: Task-conditioned post-training is an effective paradigm for unified perceptual scoring, addressing fundamental task differences through specialized reasoning strategies and optimization objectives.
Abstract: Unifying Image Quality Assessment (IQA) and Image Aesthetic Assessment (IAA) in a single multimodal large language model is appealing, yet existing methods adopt a task-agnostic recipe that applies the same reasoning strategy and reward to both tasks. We show this is fundamentally misaligned: IQA relies on low-level, objective perceptual cues and benefits from concise distortion-focused reasoning, whereas IAA requires deliberative semantic judgment and is poorly served by point-wise score regression. We identify these as a reasoning mismatch and an optimization mismatch, and provide empirical evidence for both through controlled probes. Motivated by these findings, we propose TATAR (Task-Aware Thinking with Asymmetric Rewards), a unified framework that shares the visual-language backbone while conditioning post-training on each task’s nature. TATAR combines three components: fast–slow task-specific reasoning construction that pairs IQA with concise perceptual rationales and IAA with deliberative aesthetic narratives; two-stage SFT+GRPO learning that establishes task-aware behavioral priors before reward-driven refinement; and asymmetric rewards that apply Gaussian score shaping for IQA and Thurstone-style completion ranking for IAA. Extensive experiments across eight benchmarks demonstrate that TATAR consistently outperforms prior unified baselines on both tasks under in-domain and cross-domain settings, remains competitive with task-specific specialized models, and yields more stable training dynamics for aesthetic assessment. Our results establish task-conditioned post-training as a principled paradigm for unified perceptual scoring. Our code is publicly available at https://github.com/yinwen2019/TATAR.
[185] Decoupled Sensitivity-Consistency Learning for Weakly Supervised Video Anomaly Detection
Hantao Zheng, Ning Han, Yawen Zeng, Hao Chen
Main category: cs.CV
TL;DR: DeSC is a decoupled sensitivity-consistency framework for weakly supervised video anomaly detection that separates temporal sensitivity and semantic consistency streams to overcome the sensitivity-stability trade-off in unified frameworks.
Details
Motivation: Current weakly supervised video anomaly detection methods use unified frameworks that suffer from a fundamental sensitivity-stability trade-off, where conflicting objectives for detecting transient vs. sustained anomalies lead to either fragmented predictions or over-smoothed responses.Method: Proposes DeSC with two specialized streams: 1) temporal sensitivity stream using aggressive optimization to capture high-frequency abrupt changes, and 2) semantic consistency stream applying robust constraints for long-term coherence and noise reduction. Uses collaborative inference mechanism to fuse their complementary strengths and reduce individual biases.
Result: Achieves state-of-the-art performance with 89.37% AUC on UCF-Crime (+1.29% improvement) and 87.18% AP on XD-Violence (+2.22% improvement).
Conclusion: DeSC effectively addresses the sensitivity-stability trade-off in weakly supervised video anomaly detection through decoupled optimization strategies, establishing new state-of-the-art performance on benchmark datasets.
Abstract: Recent weakly supervised video anomaly detection methods have achieved significant advances by employing unified frameworks for joint optimization. However, this paradigm is limited by a fundamental sensitivity-stability trade-off, as the conflicting objectives for detecting transient and sustained anomalies lead to either fragmented predictions or over-smoothed responses. To address this limitation, we propose DeSC, a novel Decoupled Sensitivity-Consistency framework that trains two specialized streams using distinct optimization strategies. The temporal sensitivity stream adopts an aggressive optimization strategy to capture high-frequency abrupt changes, whereas the semantic consistency stream applies robust constraints to maintain long-term coherence and reduce noise. Their complementary strengths are fused through a collaborative inference mechanism that reduces individual biases and produces balanced predictions. Extensive experiments demonstrate that DeSC establishes new state-of-the-art performance by achieving 89.37% AUC on UCF-Crime (+1.29%) and 87.18% AP on XD-Violence (+2.22%). Code is available at https://github.com/imzht/DeSC.
[186] X-World: Controllable Ego-Centric Multi-Camera World Models for Scalable End-to-End Driving
Chaoda Zheng, Sean Li, Jinhao Deng, Zhennan Wang, Shijia Chen, Liqiang Xiao, Ziheng Chi, Hongbin Lin, Kangjie Chen, Boyang Wang, Yu Zhang, Xianming Liu
Main category: cs.CV
TL;DR: X-World is an action-conditioned multi-camera generative world model that simulates realistic future video observations for autonomous driving evaluation, enabling controllable and reproducible testing without real-world road trials.
Details
Motivation: Current autonomous driving evaluation relies heavily on costly real-world road testing with limited scenario coverage and reproducibility issues. There's a need for a simulator that can generate realistic future observations under proposed actions while remaining controllable and stable over long horizons.Method: X-World is a multi-view latent video generator that takes synchronized multi-camera history and future action sequences to generate future multi-camera video streams. It explicitly encourages cross-view geometric consistency and temporal coherence, supports optional controls over traffic agents and road elements, and includes text-prompt interfaces for appearance-level control.
Result: X-World achieves high-quality multi-view video generation with strong view consistency across cameras, stable temporal dynamics over long rollouts, and high controllability with strict action following and faithful adherence to optional scene controls.
Conclusion: X-World provides a practical foundation for scalable and reproducible evaluation of autonomous driving systems by simulating realistic future observations directly in video space, addressing limitations of current real-world testing approaches.
Abstract: Scalable and reliable evaluation is increasingly critical in the end-to-end era of autonomous driving, where vision–language–action (VLA) policies directly map raw sensor streams to driving actions. Yet, current evaluation pipelines still rely heavily on real-world road testing, which is costly, biased toward limited scenario coverage, and difficult to reproduce. These challenges motivate a real-world simulator that can generate realistic future observations under proposed actions, while remaining controllable and stable over long horizons. We present X-World, an action-conditioned multi-camera generative world model that simulates future observations directly in video space. Given synchronized multi-view camera history and a future action sequence, X-World generates future multi-camera video streams that follow the commanded actions. To ensure reproducible and editable scene rollouts, X-World further supports optional controls over dynamic traffic agents and static road elements, and retains a text-prompt interface for appearance-level control (e.g., weather and time of day). Beyond world simulation, X-World also enables video style transfer by conditioning on appearance prompts while preserving the underlying action and scene dynamics. At the core of X-World is a multi-view latent video generator designed to explicitly encourage cross-view geometric consistency and temporal coherence under diverse control signals. Experiments show that X-World achieves high-quality multi-view video generation with (i) strong view consistency across cameras, (ii) stable temporal dynamics over long rollouts, and (iii) high controllability with strict action following and faithful adherence to optional scene controls. These properties make X-World a practical foundation for scalable and reproducible evaluation.
[187] From Plausibility to Verifiability: Risk-Controlled Generative OCR for Vision-Language Models
Weile Gong, Yiping Zuo, Zijian Lu, Xin He, Weibei Fan, Chen Dai
Main category: cs.CV
TL;DR: Paper proposes Geometric Risk Controller for generative OCR in vision-language models to reduce severe errors by enforcing cross-view consensus and stability before accepting transcriptions.
Details
Motivation: Generative OCR in VLMs suffers from deployment misalignment: autoregressive decoding favors semantic plausibility over visual grounding, leading to severe errors like over-generation and unsupported substitutions that create deployment risks despite high benchmark accuracy.Method: Formulates frozen VLM OCR as selective accept/abstain problem, proposes model-agnostic Geometric Risk Controller that probes multiple structured views of same input, applies lightweight structural screening, and accepts transcription only when cross-view consensus and stability satisfy predefined criteria.
Result: Experiments on frozen VLM backbones and standard OCR benchmarks show consistent reductions in extreme-error risk and catastrophic over-generation at predictable coverage costs.
Conclusion: Reliable deployment of generative OCR with frozen VLMs benefits from explicit system-level risk control rather than unconstrained generation.
Abstract: Modern vision-language models (VLMs) can act as generative OCR engines, yet open-ended decoding can expose rare but consequential failures. We identify a core deployment misalignment in generative OCR. Autoregressive decoding favors semantic plausibility, whereas OCR requires outputs that are visually grounded and geometrically verifiable. This mismatch produces severe errors, especially over-generation and unsupported substitutions, creating deployment risk even when benchmark accuracy remains high. We therefore formulate frozen VLM OCR as a selective accept/abstain problem and propose a model-agnostic Geometric Risk Controller. The controller probes multiple structured views of the same input, applies lightweight structural screening, and accepts a transcription only when cross-view consensus and stability satisfy predefined criteria, yielding a small family of operating points. Experiments on frozen VLM backbones and standard OCR benchmarks show consistent reductions in extreme-error risk and catastrophic over-generation at predictable coverage costs. Reliable deployment of generative OCR with frozen VLMs benefits from explicit system-level risk control rather than unconstrained generation.
[188] Controllable Text-to-Motion Generation via Modular Body-Part Phase Control
Minyue Dai, Ke Fan, Anyi Rao, Jingbo Wang, Bo Dai
Main category: cs.CV
TL;DR: A plug-and-play framework for text-to-motion generation that enables localized body-part editing through a compact phase-based interface, allowing fine-grained control over motion magnitude, speed, and timing while maintaining overall coherence.
Details
Motivation: Existing text-to-motion generation methods struggle with modifying specific body parts while maintaining overall motion coherence, relying on cumbersome high-dimensional joint constraints that hinder user-friendly iterative refinement.Method: Proposes Modular Body-Part Phase Control, modeling body-part latent motion channels as sinusoidal phase signals (amplitude, frequency, phase shift, offset) to extract interpretable codes. Uses a modular Phase ControlNet branch that injects this signal via residual feature modulation, decoupling control from the generative backbone.
Result: The approach provides predictable and fine-grained control over motion magnitude, speed, and timing, preserves global motion coherence, and works with both diffusion- and flow-based models. Offers a practical paradigm for controllable text-to-motion generation.
Conclusion: The proposed framework enables structured, localized editing of motion through a compact scalar-based phase interface, making text-to-motion generation more controllable and user-friendly for animation and avatar applications.
Abstract: Text-to-motion (T2M) generation is becoming a practical tool for animation and interactive avatars. However, modifying specific body parts while maintaining overall motion coherence remains challenging. Existing methods typically rely on cumbersome, high-dimensional joint constraints (e.g., trajectories), which hinder user-friendly, iterative refinement. To address this, we propose Modular Body-Part Phase Control, a plug-and-play framework enabling structured, localized editing via a compact, scalar-based phase interface. By modeling body-part latent motion channels as sinusoidal phase signals characterized by amplitude, frequency, phase shift, and offset, we extract interpretable codes that capture part-specific dynamics. A modular Phase ControlNet branch then injects this signal via residual feature modulation, seamlessly decoupling control from the generative backbone. Experiments on both diffusion- and flow-based models demonstrate that our approach provides predictable and fine-grained control over motion magnitude, speed, and timing. It preserves global motion coherence and offers a practical paradigm for controllable T2M generation. Project page: https://jixiii.github.io/bp-phase-project-page/
[189] Detached Skip-Links and $R$-Probe: Decoupling Feature Aggregation from Gradient Propagation for MLLM OCR
Ziye Yuan, Ruchang Yao, Chengxin Zheng, Yusheng Zhao, Daxiang Dong, Ming Zhang
Main category: cs.CV
TL;DR: Detached Skip-Links address gradient interference in multimodal LLMs to preserve fine-grained visual details for OCR tasks, improving performance without adding parameters.
Details
Motivation: MLLMs struggle with OCR tasks due to loss of fine-grained visual details during multi-layer feature fusion. Skip pathways cause gradient interference that overwrites low-level visual signals and destabilizes training.Method: Proposes Detached Skip-Links: reuses shallow features in forward pass but stops gradients through skip branch during joint training. Also introduces R-Probe to measure pixel-level reconstructability of visual tokens using a shallow decoder initialized from early LLM layers.
Result: Consistent improvements on OCR-centric benchmarks and clear gains on general multimodal tasks across multiple ViT backbones and scales up to 7M training samples.
Conclusion: Addressing gradient interference through Detached Skip-Links preserves fine-grained visual information in MLLMs, enhancing OCR performance while maintaining general multimodal capabilities.
Abstract: Multimodal large language models (MLLMs) excel at high-level reasoning yet fail on OCR tasks where fine-grained visual details are compromised or misaligned. We identify an overlooked optimization issue in multi-layer feature fusion. Skip pathways introduce direct back-propagation paths from high-level semantic objectives to early visual layers. This mechanism overwrites low-level signals and destabilizes training. To mitigate this gradient interference, we propose Detached Skip-Links, a minimal modification that reuses shallow features in the forward pass while stopping gradients through the skip branch during joint training. This asymmetric design reduces gradient interference, improving stability and convergence without adding learnable parameters. To diagnose whether fine-grained information is preserved and usable by an LLM, we introduce $R$-Probe, which measures pixel-level reconstructability of projected visual tokens using a shallow decoder initialized from the first quarter of the LLM layers. Across multiple ViT backbones and multimodal benchmarks, and at scales up to 7M training samples, our approach consistently improves OCR-centric benchmarks and delivers clear gains on general multimodal tasks.
[190] Evaluating Vision Foundation Models for Pixel and Object Classification in Microscopy
Carolin Teuber, Anwai Archit, Tobias Boothe, Peter Ditte, Jochen Rink, Constantin Pape
Main category: cs.CV
TL;DR: Vision foundation models (VFMs) like SAM and DINOv3 improve pixel and object classification in biomedical imaging compared to traditional feature-based shallow learning, establishing benchmarks for microscopy applications.
Details
Motivation: While deep learning dominates many computer vision tasks, biomedical imaging still relies on feature-based shallow learning for interactive semantic segmentation and object classification due to data diversity, lack of large pretraining datasets, and need for computational efficiency. The paper investigates whether vision foundation models can improve these tasks compared to current approaches.Method: Evaluated several VFMs including general-purpose models (SAM, SAM2, DINOv3) and domain-specific ones (μSAM, PathoSAM) in combination with shallow learning and attentive probing on five diverse and challenging microscopy datasets.
Result: Results demonstrate consistent improvements over hand-crafted features and provide a clear pathway toward practical improvements. The study establishes a benchmark for VFMs in microscopy.
Conclusion: Vision foundation models can significantly improve pixel and object classification in biomedical imaging compared to traditional approaches, establishing valuable benchmarks for future development in microscopy applications.
Abstract: Deep learning underlies most modern approaches and tools in computer vision, including biomedical imaging. However, for interactive semantic segmentation (often called pixel classification in this context) and interactive object-level classification (object classification), feature-based shallow learning remains widely used. This is due to the diversity of data in this domain, the lack of large pretraining datasets, and the need for computational and label efficiency. In contrast, state-of-the-art tools for many other vision tasks in microscopy - most notably cellular instance segmentation - already rely on deep learning and have recently benefited substantially from vision foundation models (VFMs), particularly SAM. Here, we investigate whether VFMs can also improve pixel and object classification compared to current approaches. To this end, we evaluate several VFMs, including general-purpose models (SAM, SAM2, DINOv3) and domain-specific ones ($μ$SAM, PathoSAM), in combination with shallow learning and attentive probing on five diverse and challenging datasets. Our results demonstrate consistent improvements over hand-crafted features and provide a clear pathway toward practical improvements. Furthermore, our study establishes a benchmark for VFMs in microscopy and informs future developments in this area.
[191] HUGE-Bench: A Benchmark for High-Level UAV Vision-Language-Action Tasks
Jingyu Guo, Ziye Chen, Ziwen Li, Zhengqing Gao, Jiaxin Huang, Hanlue Zhang, Fengming Huang, Yu Yao, Tongliang Liu, Mingming Gong
Main category: cs.CV
TL;DR: HUGE-Bench is a new benchmark for high-level UAV vision-language-action tasks that tests agents’ ability to interpret concise language commands and execute complex, safe trajectories in realistic 3D environments.
Details
Motivation: Existing UAV vision-language navigation benchmarks focus on long, step-wise route descriptions with goal-centric evaluation, which doesn't reflect real operations where brief, high-level commands need to be grounded into safe multi-stage behaviors. There's a need for diagnostic benchmarks that test high-level semantic understanding and safety awareness.Method: Created HUGE-Bench with 4 real-world digital twin scenes, 8 high-level tasks, and 2.56M meters of trajectories. Built on aligned 3D Gaussian Splatting (3DGS)-Mesh representation combining photorealistic rendering with collision-capable geometry. Introduced process-oriented and collision-aware metrics for evaluation.
Result: Experiments on state-of-the-art VLA models reveal significant gaps in high-level semantic completion and safe execution, demonstrating HUGE-Bench’s effectiveness as a diagnostic testbed for high-level UAV autonomy.
Conclusion: HUGE-Bench addresses limitations of existing UAV VLN benchmarks by focusing on high-level language understanding and safety-aware execution, providing a valuable diagnostic tool for advancing UAV vision-language-action capabilities.
Abstract: Existing UAV vision-language navigation (VLN) benchmarks have enabled language-guided flight, but they largely focus on long, step-wise route descriptions with goal-centric evaluation, making them less diagnostic for real operations where brief, high-level commands must be grounded into safe multi-stage behaviors. We present HUGE-Bench, a benchmark for High-Level UAV Vision-Language-Action (HL-VLA) tasks that tests whether an agent can interpret concise language and execute complex, process-oriented trajectories with safety awareness. HUGE-Bench comprises 4 real-world digital twin scenes, 8 high-level tasks, and 2.56M meters of trajectories, and is built on an aligned 3D Gaussian Splatting (3DGS)-Mesh representation that combines photorealistic rendering with collision-capable geometry for scalable generation and collision-aware evaluation. We introduce process-oriented and collision-aware metrics to assess process fidelity, terminal accuracy, and safety. Experiments on representative state-of-the-art VLA models reveal significant gaps in high-level semantic completion and safe execution, highlighting HUGE-Bench as a diagnostic testbed for high-level UAV autonomy.
[192] Adaptive Greedy Frame Selection for Long Video Understanding
Yuning Huang, Fengqing Zhu
Main category: cs.CV
TL;DR: A question-adaptive greedy frame selection method for long-video QA that jointly optimizes query relevance and semantic representativeness under fixed frame budgets, using complementary embeddings and submodular optimization with question-type routing.
Details
Motivation: Current VLM approaches for long-video QA face bottlenecks from excessive visual tokens. Naive sparse sampling misses key moments, while purely relevance-driven selection suffers from near-duplicate frames and poor temporal coverage, necessitating smarter frame selection methods.Method: Proposes a greedy frame selection method that constructs a 1 FPS candidate pool with timestamp alignment, embeds frames in SigLIP (relevance) and DINOv2 (semantic similarity) spaces, and selects frames by maximizing a weighted sum of modular relevance and facility-location coverage terms. Uses question-type classifier to route queries to optimal preset strategies.
Result: Experiments on MLVU show consistent accuracy gains over uniform sampling and strong baselines across frame budgets, with largest improvements under tight budgets.
Conclusion: The proposed question-adaptive greedy frame selection method effectively balances relevance and coverage for long-video QA, providing a principled approach to visual token reduction while maintaining accuracy.
Abstract: Large vision–language models (VLMs) are increasingly applied to long-video question answering, yet inference is often bottlenecked by the number of input frames and resulting visual tokens. Naive sparse sampling can miss decisive moments, while purely relevance-driven selection frequently collapses onto near-duplicate frames and sacrifices coverage of temporally distant evidence. We propose a question-adaptive greedy frame selection method that jointly optimizes query relevance and semantic representativeness under a fixed frame budget. Our approach constructs a 1~FPS candidate pool (capped at 1000) with exact timestamp alignment, embeds candidates in two complementary spaces (SigLIP for question relevance and DINOv2 for semantic similarity), and selects frames by greedily maximizing a weighted sum of a modular relevance term and a facility-location coverage term. This objective is normalized, monotone, and submodular, yielding a standard (1-1/e) greedy approximation guarantee. To account for question-dependent trade-offs between relevance and coverage, we introduce four preset strategies and a lightweight text-only question-type classifier that routes each query to its best-performing preset. Experiments on MLVU show consistent accuracy gains over uniform sampling and a strong recent baseline across frame budgets, with the largest improvements under tight budgets.
[193] Fourier Splatting: Generalized Fourier encoded primitives for scalable radiance fields
Mihnea-Bogdan Jurca, Bert Van hauwermeiren, Adrian Munteanu
Main category: cs.CV
TL;DR: Fourier Splatting introduces scalable primitives for radiance field rendering using Fourier-encoded planar surfels, enabling quality scaling without pruning primitives.
Details
Motivation: Current 3D Gaussian Splatting methods tie visual fidelity strictly to primitive count, requiring quality downscaling through primitive pruning. The authors seek a more flexible, inherently scalable primitive representation.Method: Uses Fourier-encoded descriptors to parameterize planar surfels, creating scalable primitives with arbitrary closed shapes. Employs straight-through estimator for stable optimization and HYDRA densification strategy within MCMC framework to decompose complex primitives.
Result: Achieves state-of-the-art rendering quality among planar-primitive frameworks and comparable perceptual metrics to leading volumetric representations on standard benchmarks.
Conclusion: Provides a versatile solution for bandwidth-constrained high-fidelity rendering with scalable primitives that enable runtime quality adjustment through Fourier coefficient truncation.
Abstract: Novel view synthesis has recently been revolutionized by 3D Gaussian Splatting (3DGS), which enables real-time rendering through explicit primitive rasterization. However, existing methods tie visual fidelity strictly to the number of primitives: quality downscaling is achieved only through pruning primitives. We propose the first inherently scalable primitive for radiance field rendering. Fourier Splatting employs scalable primitives with arbitrary closed shapes obtained by parameterizing planar surfels with Fourier encoded descriptors. This formulation allows a single trained model to be rendered at varying levels of detail simply by truncating Fourier coefficients at runtime. To facilitate stable optimization, we employ a straight-through estimator for gradient extension beyond the primitive boundary, and introduce HYDRA, a densification strategy that decomposes complex primitives into simpler constituents within the MCMC framework. Our method achieves state-of-the-art rendering quality among planar-primitive frameworks and comparable perceptual metrics compared to leading volumetric representations on standard benchmarks, providing a versatile solution for bandwidth-constrained high-fidelity rendering.
[194] VideoSeek: Long-Horizon Video Agent with Tool-Guided Seeking
Jingyang Lin, Jialian Wu, Jiang Liu, Ximeng Sun, Ze Wang, Xiaodong Yu, Jiebo Luo, Zicheng Liu, Emad Barsoum
Main category: cs.CV
TL;DR: VideoSeek is a long-horizon video agent that uses video logic flow to actively seek answer-critical evidence instead of exhaustively parsing full videos, achieving strong accuracy with far fewer frames.
Details
Motivation: Current video agentic models rely on greedy parsing over densely sampled video frames, resulting in high computational costs. There's a need for more efficient approaches that can maintain or improve video understanding while reducing frame usage.Method: VideoSeek operates in a think-act-observe loop with a toolkit for collecting multi-granular video observations. It leverages video logic flow to actively seek answer-critical evidence rather than exhaustively parsing full videos, enabling query-aware exploration over accumulated observations.
Result: VideoSeek achieves strong accuracy on four challenging video understanding and reasoning benchmarks while using far fewer frames than prior video agents and standalone LMMs. Notably, it achieves 10.2 absolute points improvement on LVBench over GPT-5 while using 93% fewer frames.
Conclusion: VideoSeek demonstrates that leveraging video logic flow for active evidence seeking can significantly reduce computational costs while maintaining or improving video understanding capabilities, highlighting the importance of toolkit design and reasoning capabilities.
Abstract: Video agentic models have advanced challenging video-language tasks. However, most agentic approaches still heavily rely on greedy parsing over densely sampled video frames, resulting in high computational cost. We present VideoSeek, a long-horizon video agent that leverages video logic flow to actively seek answer-critical evidence instead of exhaustively parsing the full video. This insight allows the model to use far fewer frames while maintaining, or even improving, its video understanding capability. VideoSeek operates in a think-act-observe loop with a well-designed toolkit for collecting multi-granular video observations. This design enables query-aware exploration over accumulated observations and supports practical video understanding and reasoning. Experiments on four challenging video understanding and reasoning benchmarks demonstrate that VideoSeek achieves strong accuracy while using far fewer frames than prior video agents and standalone LMMs. Notably, VideoSeek achieves a 10.2 absolute points improvement on LVBench over its base model, GPT-5, while using 93% fewer frames. Further analysis highlights the significance of leveraging video logic flow, strong reasoning capability, and the complementary roles of toolkit design.
[195] Hyper-Connections for Adaptive Multi-Modal MRI Brain Tumor Segmentation
Lokendra Kumar, Shubham Aggarwal
Main category: cs.CV
TL;DR: Hyper-Connections (HC) improve 3D brain tumor segmentation across multiple architectures by dynamically fusing multi-modal features, with best gains in volumetric models and fine-grained boundary delineation.
Details
Motivation: To enhance multi-modal brain tumor segmentation by replacing fixed residual connections with dynamic Hyper-Connections that can adaptively fuse features from different imaging modalities, addressing limitations of static connections in capturing modality-specific importance.Method: Hyper-Connections are introduced as drop-in replacements for fixed residual connections in five 3D segmentation architectures (nnU-Net, SwinUNETR, VT-UNet, U-Net, U-Netpp). HC dynamically aggregates multi-modal features using learnable weights that adapt to input data, tested on BraTS 2021 dataset with MRI modalities (T1ce, T2, FLAIR, T1).
Result: HC consistently improved all 3D models with up to +1.03% mean Dice gain, most pronounced in Enhancing Tumor sub-region. Models developed sharper sensitivity to clinically dominant sequences (T1ce for Tumor Core/Enhancing Tumor, FLAIR for Whole Tumor). In 2D settings, improvements were smaller and configuration-sensitive.
Conclusion: Hyper-Connections provide a simple, efficient, and broadly applicable mechanism for multi-modal feature fusion in medical image segmentation, with volumetric spatial context amplifying the benefits of adaptive aggregation.
Abstract: We present the first study of Hyper-Connections (HC) for volumetric multi-modal brain tumor segmentation, integrating them as a drop-in replacement for fixed residual connections across five architectures: nnU-Net, SwinUNETR, VT-UNet, U-Net, and U-Netpp. Dynamic HC consistently improves all 3D models on the BraTS 2021 dataset, yielding up to +1.03 percent mean Dice gain with negligible parameter overhead. Gains are most pronounced in the Enhancing Tumor sub-region, reflecting improved fine-grained boundary delineation. Modality ablation further reveals that HC-equipped models develop sharper sensitivity toward clinically dominant sequences, specifically T1ce for Tumor Core and Enhancing Tumor, and FLAIR for Whole Tumor, a behavior absent in fixed-connection baselines and consistent across all architectures. In 2D settings, improvements are smaller and configuration-sensitive, suggesting that volumetric spatial context amplifies the benefit of adaptive aggregation. These results establish HC as a simple, efficient, and broadly applicable mechanism for multi-modal feature fusion in medical image segmentation.
[196] IsoCLIP: Decomposing CLIP Projectors for Efficient Intra-modal Alignment
Simone Magistri, Dipam Goswami, Marco Mistretta, Bartłomiej Twardowski, Joost van de Weijer, Andrew D. Bagdanov
Main category: cs.CV
TL;DR: IsoCLIP: A training-free method to reduce intra-modal misalignment in CLIP models by identifying and removing anisotropic directions from the shared embedding space, improving performance on intra-modal tasks like image-to-image retrieval.
Details
Motivation: CLIP models suffer from intra-modal misalignment when applied to single-modality tasks like image-to-image retrieval, despite being trained for inter-modal alignment. The projectors that map features to shared space create both aligned and modality-specific anisotropic directions that harm intra-modal performance.Method: Analyze the spectral properties of CLIP’s inter-modal operator to identify an approximately isotropic subspace where modalities are well-aligned, and anisotropic directions specific to each modality. Extract this aligned subspace directly from projector weights and remove anisotropic directions without retraining.
Result: The method reduces intra-modal misalignment, lowers latency, and outperforms existing approaches on intra-modal retrieval and classification benchmarks across multiple pre-trained CLIP-like models.
Conclusion: Intra-modal misalignment in CLIP stems from anisotropic directions in the shared embedding space. Removing these directions improves intra-modal performance without retraining, revealing insights about the structure of multimodal representations.
Abstract: Vision-Language Models like CLIP are extensively used for inter-modal tasks which involve both visual and text modalities. However, when the individual modality encoders are applied to inherently intra-modal tasks like image-to-image retrieval, their performance suffers from the intra-modal misalignment. In this paper we study intra-modal misalignment in CLIP with a focus on the role of the projectors that map pre-projection image and text embeddings into the shared embedding space. By analyzing the form of the cosine similarity applied to projected features, and its interaction with the contrastive CLIP loss, we show that there is an inter-modal operator responsible for aligning the two modalities during training, and a second, intra-modal operator that only enforces intra-modal normalization but does nothing to promote intra-modal alignment. Via spectral analysis of the inter-modal operator, we identify an approximately isotropic subspace in which the two modalities are well-aligned, as well as anisotropic directions specific to each modality. We demonstrate that this aligned subspace can be directly obtained from the projector weights and that removing the anisotropic directions improves intra-modal alignment. Our experiments on intra-modal retrieval and classification benchmarks show that our training-free method reduces intra-modal misalignment, greatly lowers latency, and outperforms existing approaches across multiple pre-trained CLIP-like models. The code is publicly available at: https://github.com/simomagi/IsoCLIP.
[197] MedQ-Engine: A Closed-Loop Data Engine for Evolving MLLMs in Medical Image Quality Assessment
Jiyao Liu, Junzhi Ning, Wanying Qu, Lihao Liu, Chenglong Ma, Junjun He, Ningsheng Xu
Main category: cs.CV
TL;DR: MedQ-Engine: A closed-loop data engine for medical image quality assessment that iteratively discovers model failures, collects targeted annotations, and fine-tunes models to improve multimodal LLM performance in clinical reasoning tasks.
Details
Motivation: Current multimodal large language models (MLLMs) perform poorly in medical image quality assessment compared to human experts, especially when providing descriptive clinical reasoning beyond simple scores. Two main challenges: high cost of descriptive annotations and inability of one-time data collection to adapt to evolving model weaknesses.Method: Proposes MedQ-Engine, a closed-loop system with three components: 1) Evaluates models to discover failure prototypes via data-driven clustering, 2) Explores million-scale image pool using prototypes as retrieval anchors with progressive human-in-the-loop annotation, 3) Evolves through quality-assured fine-tuning. Includes entropy-guided routing to triage annotations and minimize labeling costs.
Result: Across five medical imaging modalities, MedQ-Engine elevates an 8B-parameter model to surpass GPT-4o by over 13% and narrows the gap with human experts to only 4.34%. Achieves this with only 10K annotations, showing more than 4x sample efficiency over random sampling.
Conclusion: MedQ-Engine provides an effective framework for improving multimodal LLMs in medical image quality assessment through iterative, targeted data collection and fine-tuning, significantly reducing the annotation burden while achieving near-expert performance.
Abstract: Medical image quality assessment (Med-IQA) is a prerequisite for clinical AI deployment, yet multimodal large language models (MLLMs) still fall substantially short of human experts, particularly when required to provide descriptive assessments with clinical reasoning beyond simple quality scores. However, improving them is hindered by the high cost of acquiring descriptive annotations and by the inability of one-time data collection to adapt to the model’s evolving weaknesses. To address these challenges, we propose MedQ-Engine, a closed-loop data engine that iteratively evaluates the model to discover failure prototypes via data-driven clustering, explores a million-scale image pool using these prototypes as retrieval anchors with progressive human-in-the-loop annotation, and evolves through quality-assured fine-tuning, forming a self-improving cycle. Models are evaluated on complementary perception and description tasks. An entropy-guided routing mechanism triages annotations to minimize labeling cost. Experiments across five medical imaging modalities show that MedQ-Engine elevates an 8B-parameter model to surpass GPT-4o by over 13% and narrow the gap with human experts to only 4.34%, using only 10K annotations with more than 4x sample efficiency over random sampling.
[198] SIMPLER: Efficient Foundation Model Adaptation via Similarity-Guided Layer Pruning for Earth Observation
Víctor Barreiro, Johannes Jakubik, Francisco Argüello, Dora B. Heras
Main category: cs.CV
TL;DR: SIMPLER is a pre-fine-tuning architecture selection method that reduces inference and deployment costs for Earth Observation foundation models by identifying redundant layers before adaptation, achieving significant parameter reduction while maintaining performance.
Details
Motivation: Fine-tuning large foundation models for Earth Observation is computationally expensive in both training and inference. Existing parameter-efficient methods reduce training costs but keep full inference complexity, while post-hoc compression requires costly full fine-tuning first.Method: SIMPLER exploits the stabilization of representations in deeper layers of pre-trained vision transformers. It computes layer-wise representation similarity on unlabeled task data and applies an automated scoring function to select redundant layers, requiring no gradients, magnitude heuristics, or hyperparameter tuning.
Result: On Prithvi-EO-2, SIMPLER prunes up to 79% of parameters while retaining 94% of baseline performance, yielding 2.1x training speedup and 2.6x inference speedup. The method generalizes to TerraMind (multimodal EO foundation model) and ImageNet-pretrained ViT-MAE.
Conclusion: SIMPLER provides an efficient pre-fine-tuning architecture selection method that reduces both training and inference costs for Earth Observation foundation models, demonstrating applicability across tasks, architectures, and spectral modalities.
Abstract: Fine-tuning foundation models for Earth Observation is computationally expensive, with high training time and memory demands for both training and deployment. Parameter-efficient methods reduce training cost but retain full inference complexity, while post-hoc compression optimizes inference only after costly full fine-tuning. We introduce SIMPLER, a pre-fine-tuning architecture selection method that reduces inference and deployment costs by identifying an effective model depth before adaptation. SIMPLER exploits stabilization of representations in deeper layers of pre-trained vision transformers: it computes layer-wise representation similarity on unlabeled task data and applies an automated scoring function to select redundant layers, with no gradients, magnitude heuristics, or hyperparameter tuning required. On Prithvi-EO-2, SIMPLER prunes up to 79% of parameters while retaining 94% of baseline performance, yielding a 2.1x training speedup and 2.6x inference speedup. The method generalizes to TerraMind (a multimodal EO foundation model) and ImageNet-pretrained ViT-MAE, demonstrating applicability across tasks, architectures, and spectral modalities. Code is available at https://gitlab.citius.gal/hpc4rs/simpler.
[199] PanORama: Multiview Consistent Panoptic Segmentation in Operating Rooms
Tuna Gürbüz, Ege Özsoy, Tony Danjun Wang, Nassir Navab
Main category: cs.CV
TL;DR: PanORama: A calibration-free, multiview-consistent panoptic segmentation method for operating rooms that models cross-view interactions in a single forward pass to achieve reliable spatial understanding in cluttered surgical environments.
Details
Motivation: Operating rooms are cluttered, dynamic, and highly occluded environments where reliable spatial understanding is essential for surgical situational awareness. Current methods struggle with multiview consistency due to limited visibility in some views leading to mispredictions across cameras.Method: PanORama introduces multiview-consistent panoptic segmentation by modeling cross-view interactions at the feature level inside the backbone in a single forward pass. It’s calibration-free (requires no camera parameters) and generalizes to unseen camera viewpoints within any multiview configuration.
Result: Achieves >70% Panoptic Quality (PQ) performance on MM-OR and 4D-OR datasets, outperforming previous state-of-the-art methods. The approach demonstrates robust multiview segmentation and spatial understanding in operating rooms.
Conclusion: PanORama substantially enhances multiview segmentation and spatial understanding in ORs, opening new opportunities for surgical perception and assistance by providing reliable, calibration-free panoptic segmentation across multiple viewpoints.
Abstract: Operating rooms (ORs) are cluttered, dynamic, highly occluded environments, where reliable spatial understanding is essential for situational awareness during complex surgical workflows. Achieving spatial understanding for panoptic segmentation from sparse multiview images poses a fundamental challenge, as limited visibility in a subset of views often leads to mispredictions across cameras. To this end, we introduce PanORama, the first panoptic segmentation for the operating room that is multiview-consistent by design. By modeling cross-view interactions at the feature level inside the backbone in a single forward pass, view consistency emerges directly rather than through post-hoc refinement. We evaluate on the MM-OR and 4D-OR datasets, achieving >70% Panoptic Quality (PQ) performance, and outperforming the previous state of the art. Importantly, PanORama is calibration-free, requiring no camera parameters, and generalizes to unseen camera viewpoints within any multiview configuration at inference time. By substantially enhancing multiview segmentation and, consequently, spatial understanding in the OR, we believe our approach opens new opportunities for surgical perception and assistance. Code will be released upon acceptance.
[200] SegVGGT: Joint 3D Reconstruction and Instance Segmentation from Multi-View Images
Jinyuan Qu, Hongyang Li, Lei Zhang
Main category: cs.CV
TL;DR: SegVGGT is a unified end-to-end framework that performs simultaneous 3D reconstruction and instance segmentation directly from multi-view RGB images using object queries and geometric features.
Details
Motivation: Traditional 3D instance segmentation relies on high-quality point clouds or posed RGB-D scans, requiring complex multi-stage pipelines and being sensitive to reconstruction noise. Recent feed-forward transformers for 3D reconstruction lack semantic understanding, creating a gap between geometry reconstruction and high-level semantic tasks.Method: SegVGGT introduces object queries that interact with multi-level geometric features to integrate instance identification into visual geometry grounded transformers. To address attention dispersion from massive global image tokens, they propose Frame-level Attention Distribution Alignment (FADA) strategy that explicitly guides object queries to attend to instance-relevant frames during training without extra inference overhead.
Result: Extensive experiments show state-of-the-art performance on ScanNetv2 and ScanNet200 datasets, outperforming both recent joint models and RGB-D-based approaches, while demonstrating strong generalization capabilities on ScanNet++.
Conclusion: SegVGGT successfully bridges the gap between 3D reconstruction and semantic understanding by unifying feed-forward 3D reconstruction with instance segmentation in an end-to-end framework, achieving superior performance through integrated geometric and semantic processing.
Abstract: 3D instance segmentation methods typically rely on high-quality point clouds or posed RGB-D scans, requiring complex multi-stage processing pipelines, and are highly sensitive to reconstruction noise. While recent feed-forward transformers have revolutionized multi-view 3D reconstruction, they remain decoupled from high-level semantic understanding. In this work, we present SegVGGT, a unified end-to-end framework that simultaneously performs feed-forward 3D reconstruction and instance segmentation directly from multi-view RGB images. By introducing object queries that interact with multi-level geometric features, our method deeply integrates instance identification into the visual geometry grounded transformer. To address the severe attention dispersion problem caused by the massive number of global image tokens, we propose the Frame-level Attention Distribution Alignment (FADA) strategy. FADA explicitly guides object queries to attend to instance-relevant frames during training, providing structured supervision without extra inference overhead. Extensive experiments demonstrate that SegVGGT achieves the state-of-the-art performance on ScanNetv2 and ScanNet200, outperforming both recent joint models and RGB-D-based approaches, while exhibiting strong generalization capabilities on ScanNet++.
[201] Chain-of-Adaptation: Surgical Vision-Language Adaptation with Reinforcement Learning
Jiajie Li, Chenhui Xu, Meihuan Liu, Jinjun Xiong
Main category: cs.CV
TL;DR: CoA framework preserves multimodal priors during domain adaptation using structured reasoning and reinforcement learning
Details
Motivation: Conventional fine-tuning on domain-specific datasets can damage pretrained multimodal priors and reduce generalization, creating a need for adaptation methods that maintain core capabilities while integrating domain knowledgeMethod: Chain-of-Adaptation (CoA) framework using structured reasoning format and reinforcement learning to enhance domain alignment without sacrificing general multimodal competence
Result: CoA achieves higher accuracy, stronger generalization, and more stable behavior than supervised fine-tuning on surgical benchmarks in both in-distribution and out-of-distribution settings
Conclusion: CoA effectively preserves model’s core visual-language abilities while enabling domain specialization, providing a reliable pathway for adapting VLMs to specific domains
Abstract: Conventional fine-tuning on domain-specific datasets can inadvertently alter a model’s pretrained multimodal priors, leading to reduced generalization. To address this, we propose Chain-of-Adaptation (CoA), an adaptation framework designed to integrate domain knowledge while maintaining the model’s inherent reasoning and perceptual capabilities. CoA introduces a structured reasoning format that enhances domain alignment without sacrificing general multimodal competence by reinforcement learning. Experiments on standard surgical benchmarks, under both in-distribution and out-of-distribution settings, demonstrate that CoA achieves higher accuracy, stronger generalization, and more stable behavior than supervised fine-tuning. Furthermore, ablation studies confirm that CoA effectively preserves the model’s core visual-language abilities, providing a reliable pathway for domain specialization in VLMs.
[202] LIORNet: Self-Supervised LiDAR Snow Removal Framework for Autonomous Driving under Adverse Weather Conditions
Ji-il Park, Inwook Shim
Main category: cs.CV
TL;DR: LIORNet: A self-supervised LiDAR denoising network for adverse weather conditions using U-Net++ architecture and pseudo-labels from physical/statistical cues, achieving state-of-the-art performance without manual annotations.
Details
Motivation: LiDAR performance degrades significantly in adverse weather (snow, rain, fog) due to spurious noise points, causing false perception. Existing methods have limitations: distance-based filters struggle to distinguish valid points, intensity-based methods lack adaptability, and learning-based methods require expensive annotations and have generalization issues.Method: LIORNet uses U-Net++ backbone with self-supervised learning guided by pseudo-labels generated from multiple physical and statistical cues: range-dependent intensity thresholds, snow reflectivity, point sparsity, and sensing range constraints. This eliminates need for manual annotations while leveraging strengths of distance-based, intensity-based, and learning-based approaches.
Result: Extensive experiments on WADS and CADC datasets show LIORNet outperforms state-of-the-art filtering algorithms in both accuracy and runtime while preserving critical environmental features. It demonstrates strong denoising capability in extreme weather conditions.
Conclusion: LIORNet provides a practical and robust solution for LiDAR perception in extreme weather, with strong potential for real-time deployment in autonomous driving systems. It overcomes annotation difficulties and limitations of single-principle approaches through integrated self-supervised learning.
Abstract: LiDAR sensors provide high-resolution 3D perception and long-range detection, making them indispensable for autonomous driving and robotics. However, their performance significantly degrades under adverse weather conditions such as snow, rain, and fog, where spurious noise points dominate the point cloud and lead to false perception. To address this problem, various approaches have been proposed: distance-based filters exploiting spatial sparsity, intensity-based filters leveraging reflectance distributions, and learning-based methods that adapt to complex environments. Nevertheless, distance-based methods struggle to distinguish valid object points from noise, intensity-based methods often rely on fixed thresholds that lack adaptability to changing conditions, and learning-based methods suffer from the high cost of annotation, limited generalization, and computational overhead. In this study, we propose LIORNet, which eliminates these drawbacks and integrates the strengths of all three paradigms. LIORNet is built upon a U-Net++ backbone and employs a self-supervised learning strategy guided by pseudo-labels generated from multiple physical and statistical cues, including range-dependent intensity thresholds, snow reflectivity, point sparsity, and sensing range constraints. This design enables LIORNet to distinguish noise points from environmental structures without requiring manual annotations, thereby overcoming the difficulty of snow labeling and the limitations of single-principle approaches. Extensive experiments on the WADS and CADC datasets demonstrate that LIORNet outperforms state-of-the-art filtering algorithms in both accuracy and runtime while preserving critical environmental features. These results highlight LIORNet as a practical and robust solution for LiDAR perception in extreme weather, with strong potential for real-time deployment in autonomous driving systems.
[203] Timestep-Aware Block Masking for Efficient Diffusion Model Inference
Haodong He, Yuan Gao, Weizhong Zhang, Gui-Song Xia
Main category: cs.CV
TL;DR: A framework for optimizing diffusion model inference by learning timestep-specific masks to dynamically bypass or reuse features, reducing computational cost while maintaining quality.
Details
Motivation: Diffusion models suffer from high inference latency due to iterative denoising, requiring optimization of computational graphs to accelerate sampling without quality degradation.Method: Proposes learning timestep-specific masks to determine which blocks to execute or bypass via feature reuse at each denoising step. Uses timestep-aware loss scaling and knowledge-guided mask rectification to prune redundant dependencies, with independent per-timestep optimization for memory efficiency.
Result: Achieves significant efficiency gains across various diffusion architectures (DDPM, LDM, DiT, PixArt) with superior balance between sampling speed and generative quality.
Conclusion: Treating denoising as optimized computational paths enables efficient diffusion model inference while maintaining quality, with architecture-agnostic applicability.
Abstract: Diffusion Probabilistic Models (DPMs) have achieved great success in image generation but suffer from high inference latency due to their iterative denoising nature. Motivated by the evolving feature dynamics across the denoising trajectory, we propose a novel framework to optimize the computational graph of pre-trained DPMs on a per-timestep basis. By learning timestep-specific masks, our method dynamically determines which blocks to execute or bypass through feature reuse at each inference stage. Unlike global optimization methods that incur prohibitive memory costs via full-chain backpropagation, our method optimizes masks for each timestep independently, ensuring a memory-efficient training process. To guide this process, we introduce a timestep-aware loss scaling mechanism that prioritizes feature fidelity during sensitive denoising phases, complemented by a knowledge-guided mask rectification strategy to prune redundant spatial-temporal dependencies. Our approach is architecture-agnostic and demonstrates significant efficiency gains across a broad spectrum of models, including DDPM, LDM, DiT, and PixArt. Experimental results show that by treating the denoising process as a sequence of optimized computational paths, our method achieves a superior balance between sampling speed and generative quality. Our code will be released.
[204] Cov2Pose: Leveraging Spatial Covariance for Direct Manifold-aware 6-DoF Object Pose Estimation
Nassim Ali Ousalah, Peyman Rostami, Vincent Gaudillière, Emmanuel Koumandakis, Anis Kacem, Enjie Ghorbel, Djamila Aouada
Main category: cs.CV
TL;DR: Direct 6-DoF pose estimation using covariance-pooled SPD matrix features and continuous pose encoding via Cholesky decomposition with Riemannian geometry-aware regression.
Details
Motivation: Direct pose regression methods are efficient but less accurate than indirect methods, lacking spatial second-order statistics and using discontinuous pose representations that lack robustness.Method: Proposes covariance-pooled representation encoding convolutional features as SPD matrix, novel pose encoding as SPD matrix via Cholesky decomposition, and end-to-end regression with manifold-aware network head considering Riemannian geometry.
Result: Experiments demonstrate relevance of second-order pooling and continuous representations for direct pose regression, including under partial occlusion.
Conclusion: Covariance-pooled SPD representations and continuous pose encoding improve direct 6-DoF pose estimation by capturing spatial statistics and ensuring robustness.
Abstract: In this paper, we address the problem of 6-DoF object pose estimation from a single RGB image. Indirect methods that typically predict intermediate 2D keypoints, followed by a Perspective-n-Point solver, have shown great performance. Direct approaches, which regress the pose in an end-to-end manner, are usually computationally more efficient but less accurate. However, direct heads rely on globally pooled features, ignoring spatial second-order statistics despite their informativeness in pose prediction. They also predict, in most cases, discontinuous pose representations that lack robustness. Herein, we therefore propose a covariance-pooled representation that encodes convolutional feature distributions as a symmetric positive definite (SPD) matrix. Moreover, we propose a novel pose encoding in the form of an SPD matrix via its Cholesky decomposition. Pose is then regressed in an end-to-end manner with a manifold-aware network head, taking into account the Riemannian geometry of SPD matrices. Experiments and ablations consistently demonstrate the relevance of second-order pooling and continuous representations for direct pose regression, including under partial occlusion.
[205] 2K Retrofit: Entropy-Guided Efficient Sparse Refinement for High-Resolution 3D Geometry Prediction
Tianbao Zhang, Zhenyu Liang, Zhenbo Song, Nana Wang, Xiaomei Zhang, Xudong Cai, Zheng Zhu, Kejian Wu, Gang Wang, Zhaoxin Fan
Main category: cs.CV
TL;DR: 2K Retrofit enables efficient high-resolution geometric prediction for autonomous driving and robotics by using coarse predictions with entropy-based sparse refinement, achieving SOTA accuracy without retraining foundation models.
Details
Motivation: Current geometric foundation models struggle with real-world high-resolution (2K) scenarios due to prohibitive computational and memory demands, making practical deployment challenging for autonomous driving, robotics, and AR/MR applications.Method: Proposes 2K Retrofit framework that leverages fast coarse predictions followed by entropy-based sparse refinement to selectively enhance high-uncertainty regions, enabling efficient 2K-resolution inference without modifying or retraining the backbone model.
Result: Extensive experiments show 2K Retrofit consistently achieves state-of-the-art accuracy and speed on widely used benchmarks, bridging the gap between research advances and scalable deployment in high-resolution 3D vision applications.
Conclusion: 2K Retrofit provides an effective solution for efficient high-resolution geometric prediction, enabling practical deployment of foundation models in real-world applications without the computational burden of direct 2K inference.
Abstract: High-resolution geometric prediction is essential for robust perception in autonomous driving, robotics, and AR/MR, but current foundation models are fundamentally limited by their scalability to real-world, high-resolution scenarios. Direct inference on 2K images with these models incurs prohibitive computational and memory demands, making practical deployment challenging. To tackle the issue, we present 2K Retrofit, a novel framework that enables efficient 2K-resolution inference for any geometric foundation model, without modifying or retraining the backbone. Our approach leverages fast coarse predictions and an entropy-based sparse refinement to selectively enhance high-uncertainty regions, achieving precise and high-fidelity 2K outputs with minimal overhead. Extensive experiments on widely used benchmark demonstrate that 2K Retrofit consistently achieves state-of-the-art accuracy and speed, bridging the gap between research advances and scalable deployment in high-resolution 3D vision applications. Code will be released upon acceptance.
[206] MedSPOT: A Workflow-Aware Sequential Grounding Benchmark for Clinical GUI
Rozain Shakeel, Abdul Rahman Mohammad Ali, Muneeb Mushtaq, Tausifa Jan Saleem, Tajamul Ashraf
Main category: cs.CV
TL;DR: MedSPOT introduces a workflow-aware sequential grounding benchmark for clinical GUI environments, focusing on multi-step visual reasoning in medical software interfaces.
Details
Motivation: Existing GUI benchmarks focus on isolated single-step grounding queries, overlooking the sequential, workflow-driven reasoning required in real-world medical interfaces where tasks evolve across multiple steps and dynamic interface states.Method: Created MedSPOT benchmark with 216 task-driven videos and 597 annotated keyframes, modeling procedural interaction as sequences of structured spatial decisions. Introduced strict sequential evaluation protocol that terminates assessment upon first incorrect grounding prediction, and developed comprehensive failure taxonomy for systematic diagnosis.
Result: The benchmark captures interface hierarchies, contextual dependencies, and fine-grained spatial precision under evolving conditions, enabling evaluation of error propagation in multi-step clinical workflows.
Conclusion: MedSPOT establishes a realistic and safety-critical benchmark for assessing multimodal models in medical software environments by shifting evaluation from isolated grounding to workflow-aware sequential reasoning.
Abstract: Despite the rapid progress of Multimodal Large Language Models (MLLMs), their ability to perform reliable visual grounding in high-stakes clinical software environments remains underexplored. Existing GUI benchmarks largely focus on isolated, single-step grounding queries, overlooking the sequential, workflow-driven reasoning required in real-world medical interfaces, where tasks evolve across independent steps and dynamic interface states. We introduce MedSPOT, a workflow-aware sequential grounding benchmark for clinical GUI environments. Unlike prior benchmarks that treat grounding as a standalone prediction task, MedSPOT models procedural interaction as a sequence of structured spatial decisions. The benchmark comprises 216 task-driven videos with 597 annotated keyframes, in which each task consists of 2 to 3 interdependent grounding steps within realistic medical workflows. This design captures interface hierarchies, contextual dependencies, and fine-grained spatial precision under evolving conditions. To evaluate procedural robustness, we propose a strict sequential evaluation protocol that terminates task assessment upon the first incorrect grounding prediction, explicitly measuring error propagation in multi-step workflows. We further introduce a comprehensive failure taxonomy, including edge bias, small-target errors, no prediction, near miss, far miss, and toolbar confusion, to enable systematic diagnosis of model behavior in clinical GUI settings. By shifting evaluation from isolated grounding to workflow-aware sequential reasoning, MedSPOT establishes a realistic and safety-critical benchmark for assessing multimodal models in medical software environments. Code and data are available at: https://github.com/Tajamul21/MedSPOT.
[207] NEC-Diff: Noise-Robust Event-RAW Complementary Diffusion for Seeing Motion in Extreme Darkness
Haoyue Liu, Jinghan Xu, Luxin Feng, Hanyu Zhou, Haozhi Zhao, Yi Chang, Luxin Yan
Main category: cs.CV
TL;DR: NEC-Diff: A diffusion-based framework for low-light imaging using RAW images and event cameras to reconstruct fine scene structures under extreme darkness.
Details
Motivation: High-quality imaging in extremely low-light conditions is challenging due to photon scarcity causing severe noise and texture loss. Event cameras offer high dynamic range and motion sensitivity but existing approaches focus on texture recovery while ignoring image noise and event noise, hindering accurate reconstruction under photon-starved conditions.Method: Proposes NEC-Diff, a diffusion-based event-RAW hybrid imaging framework that: (1) combines linear light-response of RAW images with brightness-change nature of events for physics-driven dual-modal denoising, and (2) dynamically estimates SNR of both modalities to guide adaptive feature fusion for reliable cue injection into diffusion process. Also constructs REAL dataset with 47,800 pixel-aligned low-light RAW images, events, and references.
Result: Extensive experiments demonstrate superiority of NEC-Diff under extreme darkness. The framework effectively reconstructs fine scene structures from heavily noisy signals in extremely low-light conditions (0.001-0.8 lux illumination).
Conclusion: NEC-Diff successfully addresses the challenge of low-light imaging by leveraging complementary strengths of RAW images and event cameras through a diffusion-based framework with physics-driven constraints and adaptive feature fusion, enabling high-fidelity visual reconstruction under photon-starved conditions.
Abstract: High-quality imaging of dynamic scenes in extremely low-light conditions is highly challenging. Photon scarcity induces severe noise and texture loss, causing significant image degradation. Event cameras, featuring a high dynamic range (120 dB) and high sensitivity to motion, serve as powerful complements to conventional cameras by offering crucial cues for preserving subtle textures. However, most existing approaches emphasize texture recovery from events, while paying little attention to image noise or the intrinsic noise of events themselves, which ultimately hinders accurate pixel reconstruction under photon-starved conditions. In this work, we propose NEC-Diff, a novel diffusion-based event-RAW hybrid imaging framework that extracts reliable information from heavily noisy signals to reconstruct fine scene structures. The framework is driven by two key insights: (1) combining the linear light-response property of RAW images with the brightness-change nature of events to establish a physics-driven constraint for robust dual-modal denoising; and (2) dynamically estimating the SNR of both modalities based on denoising results to guide adaptive feature fusion, thereby injecting reliable cues into the diffusion process for high-fidelity visual reconstruction. Furthermore, we construct the REAL (Raw and Event Acquired in Low-light) dataset which provides 47,800 pixel-aligned low-light RAW images, events, and high-quality references under 0.001-0.8 lux illumination. Extensive experiments demonstrate the superiority of NEC-Diff under extreme darkness. The project are available at: https://github.com/jinghan-xu/NEC-Diff.
[208] Diffusion-Based Makeup Transfer with Facial Region-Aware Makeup Features
Zheng Gao, Debin Meng, Yunqi Miao, Zhensong Zhang, Songcen Xu, Ioannis Patras, Jifei Song
Main category: cs.CV
TL;DR: FRAM proposes a facial region-aware makeup transfer method using fine-tuned makeup CLIP and learnable tokens for regional control, with identity preservation via ControlNet Union.
Details
Motivation: Current diffusion-based makeup transfer methods have two limitations: (1) off-the-shelf foundation models (like CLIP) struggle to capture makeup styles, and (2) they perform global makeup transfer without facial region-aware control for specific areas like eyes and mouth.Method: Two-stage approach: (1) Fine-tune CLIP on synthetic makeup data generated using GPT-3 and text-driven image editing; (2) Use learnable tokens to query the makeup CLIP encoder for facial region-aware makeup features, and employ ControlNet Union to encode source image and 3D mesh for identity preservation.
Result: Experimental results verify superiority in regional controllability and makeup transfer performance compared to existing methods.
Conclusion: FRAM enables better facial region-aware makeup transfer with improved regional control and makeup style preservation through fine-tuned makeup CLIP and identity injection techniques.
Abstract: Current diffusion-based makeup transfer methods commonly use the makeup information encoded by off-the-shelf foundation models (e.g., CLIP) as condition to preserve the makeup style of reference image in the generation. Although effective, these works mainly have two limitations: (1) foundation models pre-trained for generic tasks struggle to capture makeup styles; (2) the makeup features of reference image are injected to the diffusion denoising model as a whole for global makeup transfer, overlooking the facial region-aware makeup features (i.e., eyes, mouth, etc) and limiting the regional controllability for region-specific makeup transfer. To address these, in this work, we propose Facial Region-Aware Makeup features (FRAM), which has two stages: (1) makeup CLIP fine-tuning; (2) identity and facial region-aware makeup injection. For makeup CLIP fine-tuning, unlike prior works using off-the-shelf CLIP, we synthesize annotated makeup style data using GPT-o3 and text-driven image editing model, and then use the data to train a makeup CLIP encoder through self-supervised and image-text contrastive learning. For identity and facial region-aware makeup injection, we construct before-and-after makeup image pairs from the edited images in stage 1 and then use them to learn to inject identity of source image and makeup of reference image to the diffusion denoising model for makeup transfer. Specifically, we use learnable tokens to query the makeup CLIP encoder to extract facial region-aware makeup features for makeup injection, which is learned via an attention loss to enable regional control. As for identity injection, we use a ControlNet Union to encode source image and its 3D mesh simultaneously. The experimental results verify the superiority of our regional controllability and our makeup transfer performance.
[209] LumosX: Relate Any Identities with Their Attributes for Personalized Video Generation
Jiazheng Xing, Fei Du, Hangjie Yuan, Pengwei Liu, Hongbin Xu, Hai Ci, Ruigang Niu, Weihua Chen, Fan Wang, Yong Liu
Main category: cs.CV
TL;DR: LumosX is a framework for personalized multi-subject video generation that addresses face-attribute alignment challenges through both data curation and novel attention mechanisms.
Details
Motivation: Existing text-to-video diffusion models struggle with precise face-attribute alignment across multiple subjects, lacking explicit mechanisms for intra-group consistency and subject-attribute dependencies.Method: Combines data curation using MLLMs to extract relational priors with novel Relational Self-Attention and Relational Cross-Attention mechanisms that intertwine position-aware embeddings with refined attention dynamics.
Result: Achieves state-of-the-art performance in fine-grained, identity-consistent, and semantically aligned personalized multi-subject video generation on their comprehensive benchmark.
Conclusion: LumosX advances both data and model design for personalized video generation, enabling better control over subject-attribute dependencies and intra-group consistency.
Abstract: Recent advances in diffusion models have significantly improved text-to-video generation, enabling personalized content creation with fine-grained control over both foreground and background elements. However, precise face-attribute alignment across subjects remains challenging, as existing methods lack explicit mechanisms to ensure intra-group consistency. Addressing this gap requires both explicit modeling strategies and face-attribute-aware data resources. We therefore propose LumosX, a framework that advances both data and model design. On the data side, a tailored collection pipeline orchestrates captions and visual cues from independent videos, while multimodal large language models (MLLMs) infer and assign subject-specific dependencies. These extracted relational priors impose a finer-grained structure that amplifies the expressive control of personalized video generation and enables the construction of a comprehensive benchmark. On the modeling side, Relational Self-Attention and Relational Cross-Attention intertwine position-aware embeddings with refined attention dynamics to inscribe explicit subject-attribute dependencies, enforcing disciplined intra-group cohesion and amplifying the separation between distinct subject clusters. Comprehensive evaluations on our benchmark demonstrate that LumosX achieves state-of-the-art performance in fine-grained, identity-consistent, and semantically aligned personalized multi-subject video generation. Code and models are available at https://jiazheng-xing.github.io/lumosx-home/.
[210] CFCML: A Coarse-to-Fine Crossmodal Learning Framework For Disease Diagnosis Using Multimodal Images and Tabular Data
Tianling Liu, Hongying Liu, Fanhua Shang, Lequan Yu, Tong Han, Liang Wan
Main category: cs.CV
TL;DR: A coarse-to-fine crossmodal learning framework for medical diagnosis that progressively reduces modality gaps between medical images and tabular data through multi-granularity feature exploration and hierarchical anchor-based contrastive learning.
Details
Motivation: Clinical diagnosis requires integrating medical images and tabular data, but significant modality gaps hinder crossmodal diagnostic accuracy. Existing methods focus on high-level encoder outputs, neglecting local image information and task-relevant feature extraction.Method: Proposes CFCML framework with two stages: 1) Coarse stage explores relationships between multi-granularity features from different image encoder stages and tabular data; 2) Fine stage generates unimodal/crossmodal prototypes with class-aware information and uses hierarchical anchor-based relationship mining strategy with contrastive learning to reduce modality gaps.
Result: Outperforms state-of-the-art methods with improvements of 1.53% and 0.91% in AUC metrics on MEN and Derm7pt datasets respectively.
Conclusion: The CFCML framework effectively reduces modality gaps between medical images and tabular data through progressive coarse-to-fine learning, improving crossmodal diagnostic accuracy by better integrating multimodal information.
Abstract: In clinical practice, crossmodal information including medical images and tabular data is essential for disease diagnosis. There exists a significant modality gap between these data types, which obstructs advancements in crossmodal diagnostic accuracy. Most existing crossmodal learning (CML) methods primarily focus on exploring relationships among high-level encoder outputs, leading to the neglect of local information in images. Additionally, these methods often overlook the extraction of task-relevant information. In this paper, we propose a novel coarse-to-fine crossmodal learning (CFCML) framework to progressively reduce the modality gap between multimodal images and tabular data, by thoroughly exploring inter-modal relationships. At the coarse stage, we explore the relationships between multi-granularity features from various image encoder stages and tabular information, facilitating a preliminary reduction of the modality gap. At the fine stage, we generate unimodal and crossmodal prototypes that incorporate class-aware information, and establish hierarchical anchor-based relationship mining (HRM) strategy to further diminish the modality gap and extract discriminative crossmodal information. This strategy utilize modality samples, unimodal prototypes, and crossmodal prototypes as anchors to develop contrastive learning approaches, effectively enhancing inter-class disparity while reducing intra-class disparity from multiple perspectives. Experimental results indicate that our method outperforms the state-of-the-art (SOTA) methods, achieving improvements of 1.53% and 0.91% in AUC metrics on the MEN and Derm7pt datasets, respectively. The code is available at https://github.com/IsDling/CFCML.
[211] From Masks to Pixels and Meaning: A New Taxonomy, Benchmark, and Metrics for VLM Image Tampering
Xinyi Shang, Yi Tang, Jiacheng Cui, Ahmed Elhagry, Salwa K. Al Khatib, Sondos Mahmoud Bsharat, Jiacheng Liu, Xiaohan Zhao, Jing-Hao Xue, Hao Li, Salman Khan, Zhiqiang Shen
Main category: cs.CV
TL;DR: PIXAR reformulates VLM image tampering detection from coarse region masks to pixel-grounded, meaning-aware tasks with taxonomy, benchmark, and evaluation framework.
Details
Motivation: Existing tampering detection benchmarks rely on object masks that misalign with true edit signals, treating untouched pixels inside masks as tampered while ignoring subtle edits outside masks.Method: 1) Introduces taxonomy spanning edit primitives and semantic classes; 2) Releases benchmark with per-pixel tamper maps and category supervision; 3) Proposes training framework and evaluation metrics for pixel-level correctness, localization, and semantic understanding.
Result: Reveals substantial over- and under-scoring using mask-only metrics, exposes failure modes on micro-edits and off-mask changes, and establishes rigorous standard for tamper localization, semantic classification and description.
Conclusion: The framework advances tampering detection from masks to pixels, meanings and language descriptions, providing more accurate evaluation of VLM image tampering understanding.
Abstract: Existing tampering detection benchmarks largely rely on object masks, which severely misalign with the true edit signal: many pixels inside a mask are untouched or only trivially modified, while subtle yet consequential edits outside the mask are treated as natural. We reformulate VLM image tampering from coarse region labels to a pixel-grounded, meaning and language-aware task. First, we introduce a taxonomy spanning edit primitives (replace/remove/splice/inpaint/attribute/colorization, etc.) and their semantic class of tampered object, linking low-level changes to high-level understanding. Second, we release a new benchmark with per-pixel tamper maps and paired category supervision to evaluate detection and classification within a unified protocol. Third, we propose a training framework and evaluation metrics that quantify pixel-level correctness with localization to assess confidence or prediction on true edit intensity, and further measure tamper meaning understanding via semantics-aware classification and natural language descriptions for the predicted regions. We also re-evaluate the existing strong segmentation/localization baselines on recent strong tamper detectors and reveal substantial over- and under-scoring using mask-only metrics, and expose failure modes on micro-edits and off-mask changes. Our framework advances the field from masks to pixels, meanings and language descriptions, establishing a rigorous standard for tamper localization, semantic classification and description. Code and benchmark data are available at https://github.com/VILA-Lab/PIXAR.
[212] MFil-Mamba: Multi-Filter Scanning for Spatial Redundancy-Aware Visual State Space Models
Puskal Khadka, KC Santosh
Main category: cs.CV
TL;DR: MFil-Mamba is a novel visual state space architecture that uses multi-filter scanning to adapt Mamba SSMs for computer vision tasks, achieving SOTA performance across multiple benchmarks.
Details
Motivation: Extending State Space Models (SSMs) like Mamba to computer vision is challenging due to the non-sequential nature of visual data and complex 2D spatial dependencies. Existing approaches rely on redundant traversal strategies that distort spatial relationships.Method: Proposes MFil-Mamba with a multi-filter scanning backbone that enables each scan to capture unique spatial information while minimizing redundancy. Includes adaptive weighting mechanism to fuse outputs from multiple scans and architectural enhancements.
Result: Achieves superior performance over SOTA models: 83.2% top-1 accuracy on ImageNet-1K, 47.3% box AP and 42.7% mask AP on MS COCO, and 48.5% mIoU on ADE20K dataset.
Conclusion: MFil-Mamba successfully adapts SSMs to vision tasks by addressing spatial dependency challenges through multi-filter scanning, demonstrating strong performance across diverse vision benchmarks.
Abstract: State Space Models (SSMs), especially recent Mamba architecture, have achieved remarkable success in sequence modeling tasks. However, extending SSMs to computer vision remains challenging due to the non-sequential structure of visual data and its complex 2D spatial dependencies. Although several early studies have explored adapting selective SSMs for vision applications, most approaches primarily depend on employing various traversal strategies over the same input. This introduces redundancy and distorts the intricate spatial relationships within images. To address these challenges, we propose MFil-Mamba, a novel visual state space architecture built on a multi-filter scanning backbone. Unlike fixed multi-directional traversal methods, our design enables each scan to capture unique and contextually relevant spatial information while minimizing redundancy. Furthermore, we incorporate an adaptive weighting mechanism to effectively fuse outputs from multiple scans in addition to architectural enhancements. MFil-Mamba achieves superior performance over existing state-of-the-art models across various benchmarks that include image classification, object detection, instance segmentation, and semantic segmentation. For example, our tiny variant attains 83.2% top-1 accuracy on ImageNet-1K, 47.3% box AP and 42.7% mask AP on MS COCO, and 48.5% mIoU on the ADE20K dataset. Code and models are available at https://github.com/puskal-khadka/MFil-Mamba.
[213] Preference-Guided Debiasing for No-Reference Enhancement Image Quality Assessment
Shiqi Gao, Kang Fu, Zitong Xu, Huiyu Duan, Xiongkuo Min, Jia Wang, Guangtao Zhai
Main category: cs.CV
TL;DR: A preference-guided debiasing framework for no-reference enhancement image quality assessment (EIQA) that learns enhancement-preference embeddings and removes algorithm-specific biases to improve generalization across different enhancement algorithms.
Details
Motivation: Current NR-IQA models for enhanced images overfit to specific enhancement algorithm patterns rather than evaluating genuine perceptual quality, leading to poor generalization across different enhancement methods.Method: Two-stage approach: 1) Learn continuous enhancement-preference embedding space using supervised contrastive learning to group similar enhancement styles; 2) Estimate and remove enhancement-induced nuisance components from quality representations before quality regression to focus on algorithm-invariant perceptual cues.
Result: Extensive experiments on public EIQA benchmarks show the method effectively mitigates algorithm-induced representation bias and achieves superior robustness and cross-algorithm generalization compared to existing approaches.
Conclusion: The proposed preference-guided debiasing framework successfully addresses the generalization problem in EIQA by separating enhancement-specific patterns from genuine quality assessment, enabling more robust cross-algorithm performance.
Abstract: Current no-reference image quality assessment (NR-IQA) models for enhanced images often struggle to generalize, as they tend to overfit to the distinct patterns of specific enhancement algorithms rather than evaluating genuine perceptual quality. To address this issue, we propose a preference-guided debiasing framework for no-reference enhancement image quality assessment (EIQA). Specifically, we first learn a continuous enhancement-preference embedding space using supervised contrastive learning, where images generated by similar enhancement styles are encouraged to have closer representations. Based on this, we further estimate the enhancement-induced nuisance component contained in the raw quality representation and remove it before quality regression. In this way, the model is guided to focus on algorithm-invariant perceptual quality cues instead of enhancement-specific visual fingerprints. To facilitate stable optimization, we adopt a two-stage training strategy that first learns the enhancement-preference space and then performs debiased quality prediction. Extensive experiments on public EIQA benchmarks demonstrate that the proposed method effectively mitigates algorithm-induced representation bias and achieves superior robustness and cross-algorithm generalization compared with existing approaches.
[214] Generalizable NGP-SR: Generalizable Neural Radiance Fields Super-Resolution via Neural Graph Primitives
Wanqi Yuan, Omkar Sharad Mayekar, Connor Pennington, Nianyi Li
Main category: cs.CV
TL;DR: Generalizable NGP-SR is a 3D-aware super-resolution framework that reconstructs high-resolution radiance fields directly from low-resolution posed images using Neural Graphics Primitives, enabling view-consistent HR novel view synthesis without per-scene optimization.
Details
Motivation: NeRF achieves photorealistic novel view synthesis but becomes costly for high-resolution rendering, requiring dense sampling and larger models. 2D super-resolution of per-view renderings breaks multi-view consistency, creating a need for 3D-aware super-resolution that maintains consistency across views.Method: Built on Neural Graphics Primitives (NGP), NGP-SR conditions radiance prediction on 3D coordinates and learned local texture tokens. This enables recovery of high-frequency details within the radiance field itself, producing view-consistent HR novel views without external HR references or post-hoc 2D upsampling. The model is generalizable and can be applied to unseen scenes without per-scene optimization.
Result: Experiments on multiple datasets show that NGP-SR consistently improves both reconstruction quality and runtime efficiency over prior NeRF-based super-resolution methods, offering a practical solution for scalable high-resolution novel view synthesis.
Conclusion: NGP-SR provides an effective 3D-aware super-resolution framework that enables high-resolution, view-consistent novel view synthesis from low-resolution inputs, with generalization capability to unseen scenes and improved efficiency over existing methods.
Abstract: Neural Radiance Fields (NeRF) achieve photorealistic novel view synthesis but become costly when high-resolution (HR) rendering is required, as HR outputs demand dense sampling and higher-capacity models. Moreover, naively super-resolving per-view renderings in 2D often breaks multi-view consistency. We propose Generalizable NGP-SR, a 3D-aware super-resolution framework that reconstructs an HR radiance field directly from low-resolution (LR) posed images. Built on Neural Graphics Primitives (NGP), NGP-SR conditions radiance prediction on 3D coordinates and learned local texture tokens, enabling recovery of high-frequency details within the radiance field and producing view-consistent HR novel views without external HR references or post-hoc 2D upsampling. Importantly, our model is generalizable: once trained, it can be applied to unseen scenes and rendered from novel viewpoints without per-scene optimization. Experiments on multiple datasets show that NGP-SR consistently improves both reconstruction quality and runtime efficiency over prior NeRF-based super-resolution methods, offering a practical solution for scalable high-resolution novel view synthesis.
[215] Synergistic Perception and Generative Recomposition: A Multi-Agent Orchestration for Expert-Level Building Inspection
Hui Zhong, Yichun Gao, Luyan Liu, Xusen Guo, Zhaonian Kuang, Qiming Zhang, Xinhu Zheng
Main category: cs.CV
TL;DR: FacadeFixer: A multi-agent framework for building facade defect inspection that uses collaborative reasoning between detection, segmentation, and generative agents to handle complex defects and data scarcity through semantic recomposition and data augmentation.
Details
Motivation: Building facade defect inspection faces challenges due to extreme geometric variability, low contrast against complex backgrounds, composite defects, severe pixel imbalance, feature ambiguity, and critical scarcity of high-quality pixel-level annotations, which hinder generalization of existing models.Method: Proposes FacadeFixer, a unified multi-agent framework that treats defect perception as collaborative reasoning. It orchestrates specialized agents for detection and segmentation to handle multi-type defect interference, working with a generative agent for semantic recomposition. This decouples intricate defects from noisy backgrounds and synthesizes them onto diverse clean textures to generate high-fidelity augmented data with precise masks.
Result: FacadeFixer significantly outperforms state-of-the-art baselines, excels in capturing pixel-level structural anomalies, and demonstrates generative synthesis as a robust solution to data scarcity in infrastructure inspection. A comprehensive multi-task dataset covering six primary facade categories with pixel-level annotations was introduced.
Conclusion: The multi-agent collaborative reasoning approach effectively addresses facade defect inspection challenges, with generative synthesis providing a powerful solution to annotation scarcity. The framework and dataset advance infrastructure inspection capabilities.
Abstract: Building facade defect inspection is fundamental to structural health monitoring and sustainable urban maintenance, yet it remains a formidable challenge due to extreme geometric variability, low contrast against complex backgrounds, and the inherent complexity of composite defects (e.g., cracks co-occurring with spalling). Such characteristics lead to severe pixel imbalance and feature ambiguity, which, coupled with the critical scarcity of high-quality pixel-level annotations, hinder the generalization of existing detection and segmentation models. To address gaps, we propose \textit{FacadeFixer}, a unified multi-agent framework that treats defect perception as a collaborative reasoning task rather than isolated recognition. Specifically,\textit{FacadeFixer} orchestrates specialized agents for detection and segmentation to handle multi-type defect interference, working in tandem with a generative agent to enable semantic recomposition. This process decouples intricate defects from noisy backgrounds and realistically synthesizes them onto diverse clean textures, generating high-fidelity augmented data with precise expert-level masks. To support this, we introduce a comprehensive multi-task dataset covering six primary facade categories with pixel-level annotations. Extensive experiments demonstrate that \textit{FacadeFixer} significantly outperforms state-of-the-art (SOTA) baselines. Specifically, it excels in capturing pixel-level structural anomalies and highlights generative synthesis as a robust solution to data scarcity in infrastructure inspection. Our code and dataset will be made publicly available.
[216] Can Large Multimodal Models Inspect Buildings? A Hierarchical Benchmark for Structural Pathology Reasoning
Hui Zhong, Yichun Gao, Luyan Liu, Hai Yang, Wang Wang, Haowei Zhang, Xinhu Zheng
Main category: cs.CV
TL;DR: DefectBench: A multi-dimensional benchmark evaluating Large Multimodal Models (LMMs) for building facade defect inspection, assessing semantic perception, spatial localization, and generative geometry segmentation capabilities.
Details
Motivation: Traditional building facade inspection uses specialized discriminative models that lack visual understanding and reasoning capabilities. LMMs promise active reasoning but lack rigorous evaluation standards for high-stakes engineering domains.Method: Created a human-in-the-loop semi-automated annotation framework to unify 12 datasets into standardized hierarchical ontology. Developed DefectBench to evaluate 18 SOTA LMMs across three cognitive dimensions: Semantic Perception, Spatial Localization, and Generative Geometry Segmentation.
Result: Current LMMs show exceptional topological awareness and semantic understanding but significant deficiencies in metric localization precision. Validated viability of zero-shot generative segmentation where general-purpose foundation models can rival specialized supervised networks without domain-specific training.
Conclusion: Provides rigorous benchmarking standard and high-quality open-source database for autonomous AI agents in civil engineering, establishing new baseline for LMM evaluation in engineering domains.
Abstract: Automated building facade inspection is a critical component of urban resilience and smart city maintenance. Traditionally, this field has relied on specialized discriminative models (e.g., YOLO, Mask R-CNN) that excel at pixel-level localization but are constrained to passive perception and worse generization without the visual understandng to interpret structural topology. Large Multimodal Models (LMMs) promise a paradigm shift toward active reasoning, yet their application in such high-stakes engineering domains lacks rigorous evaluation standards. To bridge this gap, we introduce a human-in-the-loop semi-automated annotation framework, leveraging expert-proposal verification to unify 12 fragmented datasets into a standardized, hierarchical ontology. Building on this foundation, we present \textit{DefectBench}, the first multi-dimensional benchmark designed to interrogate LMMs beyond basic semantic recognition. \textit{DefectBench} evaluates 18 state-of-the-art (SOTA) LMMs across three escalating cognitive dimensions: Semantic Perception, Spatial Localization, and Generative Geometry Segmentation. Extensive experiments reveal that while current LMMs demonstrate exceptional topological awareness and semantic understanding (effectively diagnosing “what” and “how”), they exhibit significant deficiencies in metric localization precision (“where”). Crucially, however, we validate the viability of zero-shot generative segmentation, showing that general-purpose foundation models can rival specialized supervised networks without domain-specific training. This work provides both a rigorous benchmarking standard and a high-quality open-source database, establishing a new baseline for the advancement of autonomous AI agents in civil engineering.
[217] TinyML Enhances CubeSat Mission Capabilities
Luigi Capogrosso, Michele Magno
Main category: cs.CV
TL;DR: TinyML pipeline for optimizing CNN models for CubeSat onboard image classification using pruning, quantization, and hardware-aware mapping to STM32N6 MCU with NPU.
Details
Motivation: Traditional Earth observation requires transmitting raw imagery to ground stations, which is infeasible for CubeSats due to computational, energy, and bandwidth constraints. Need for onboard processing solutions.Method: Pipeline integrates structured iterative pruning, post-training INT8 quantization, and hardware-aware operator mapping to compress CNN models for deployment on STM32N6 microcontroller with Arm Cortex-M55 and Neural-ART NPU.
Result: Achieved average 89.55% RAM reduction and 70.09% Flash memory reduction with task-acceptable accuracy (0.4-8.6 pp drop). Energy consumption: 0.68-6.45 mJ per inference, latency: 3.22-30.38 ms.
Conclusion: The TinyML pipeline enables efficient onboard EO processing on CubeSats, satisfying stringent energy budgets and real-time constraints while reducing downlink bandwidth requirements.
Abstract: Earth observation (EO) missions traditionally rely on transmitting raw or minimally processed imagery from satellites to ground stations for computationally intensive analysis. This paradigm is infeasible for CubeSat systems due to stringent constraints on the onboard embedded processors, energy availability, and communication bandwidth. To overcome these limitations, the paper presents a TinyML-based Convolutional Neural Networks (ConvNets) model optimization and deployment pipeline for onboard image classification, enabling accurate, energy-efficient, and hardware-aware inference under CubeSat-class constraints. Our pipeline integrates structured iterative pruning, post-training INT8 quantization, and hardware-aware operator mapping to compress models and align them with the heterogeneous compute architecture of the STM32N6 microcontroller from STMicroelectronics. This Microcontroller Unit (MCU) integrates a novel Arm Cortex-M55 core and a Neural-ART Neural Processing Unit (NPU), providing a realistic proxy for CubeSat onboard computers. The paper evaluates the proposed approach on three EO benchmark datasets (i.e., EuroSAT, RS_C11, MEDIC) and four models (i.e., SqueezeNet, MobileNetV3, EfficientNet, MCUNetV1). We demonstrate an average reduction in RAM usage of 89.55% and Flash memory of 70.09% for the optimized models, significantly decreasing downlink bandwidth requirements while maintaining task-acceptable accuracy (with a drop ranging from 0.4 to 8.6 percentage points compared to the Float32 baseline). The energy consumption per inference ranges from 0.68 mJ to 6.45 mJ, with latency spanning from 3.22 ms to 30.38 ms. These results fully satisfy the stringent energy budgets and real-time constraints required for efficient onboard EO processing.
[218] LagerNVS: Latent Geometry for Fully Neural Real-time Novel View Synthesis
Stanislaw Szymanowicz, Minghao Chen, Jianyuan Wang, Christian Rupprecht, Andrea Vedaldi
Main category: cs.CV
TL;DR: LagerNVS: A 3D-aware neural network for novel view synthesis that uses 3D reconstruction pre-training and achieves state-of-the-art results with real-time rendering
Details
Motivation: While neural networks can perform 3D tasks like novel view synthesis without explicit 3D reconstruction, the authors argue that strong 3D inductive biases are still beneficial for network design and performance.Method: LagerNVS uses an encoder-decoder architecture with 3D-aware latent features. The encoder is initialized from a 3D reconstruction network pre-trained with explicit 3D supervision, paired with a lightweight decoder, and trained end-to-end with photometric losses.
Result: Achieves state-of-the-art deterministic feed-forward novel view synthesis (31.4 PSNR on Re10k), works with and without known cameras, renders in real time, generalizes to in-the-wild data, and can be paired with a diffusion decoder for generative extrapolation.
Conclusion: Incorporating 3D inductive biases through pre-trained 3D reconstruction networks significantly improves novel view synthesis performance while maintaining efficiency and generalization capabilities.
Abstract: Recent work has shown that neural networks can perform 3D tasks such as Novel View Synthesis (NVS) without explicit 3D reconstruction. Even so, we argue that strong 3D inductive biases are still helpful in the design of such networks. We show this point by introducing LagerNVS, an encoder-decoder neural network for NVS that builds on `3D-aware’ latent features. The encoder is initialized from a 3D reconstruction network pre-trained using explicit 3D supervision. This is paired with a lightweight decoder, and trained end-to-end with photometric losses. LagerNVS achieves state-of-the-art deterministic feed-forward Novel View Synthesis (including 31.4 PSNR on Re10k), with and without known cameras, renders in real time, generalizes to in-the-wild data, and can be paired with a diffusion decoder for generative extrapolation.
[219] Improving Image-to-Image Translation via a Rectified Flow Reformulation
Satoshi Iizuka, Shun Okamoto, Kazuhiro Fukui
Main category: cs.CV
TL;DR: I2I-RFR reformulates standard image-to-image regression networks as continuous-time transport models via noise-augmented inputs and t-reweighted pixel loss, enabling progressive refinement at inference while maintaining simple supervised training.
Details
Motivation: Standard pixel-wise I2I regression often over-smooths ill-posed and multimodal targets, while generative alternatives require complex pipelines and task-specific tuning. There's a need for a lightweight method that combines the simplicity of regression with the refinement capabilities of generative models.Method: Augments backbone input by channel-wise concatenation with noise-corrupted ground-truth targets and optimizes a t-reweighted pixel loss. This admits a rectified-flow interpretation via induced velocity field, enabling ODE-based progressive refinement at inference with few explicit solver steps.
Result: Extensive experiments across image-to-image translation and video restoration tasks show general performance improvements across various tasks and backbones, with clear gains in perceptual quality and detail preservation.
Conclusion: I2I-RFR provides a lightweight way to incorporate continuous-time refinement into conventional I2I models without requiring heavy generative pipelines, requiring only input channel expansion and minimal inference steps.
Abstract: In this work, we propose Image-to-Image Rectified Flow Reformulation (I2I-RFR), a practical plug-in reformulation that recasts standard I2I regression networks as continuous-time transport models. While pixel-wise I2I regression is simple, stable, and easy to adapt across tasks, it often over-smooths ill-posed and multimodal targets, whereas generative alternatives often require additional components, task-specific tuning, and more complex training and inference pipelines. Our method augments the backbone input by channel-wise concatenation with a noise-corrupted version of the ground-truth target and optimizes a simple t-reweighted pixel loss. This objective admits a rectified-flow interpretation via an induced velocity field, enabling ODE-based progressive refinement at inference time while largely preserving the standard supervised training pipeline. In most cases, adopting I2I-RFR requires only expanding the input channels, and inference can be performed with a few explicit solver steps (e.g., 3 steps) without distillation. Extensive experiments across multiple image-to-image translation and video restoration tasks show that I2I-RFR generally improves performance across a wide range of tasks and backbones, with particularly clear gains in perceptual quality and detail preservation. Overall, I2I-RFR provides a lightweight way to incorporate continuous-time refinement into conventional I2I models without requiring a heavy generative pipeline.
[220] MuSteerNet: Human Reaction Generation from Videos via Observation-Reaction Mutual Steering
Yuan Zhou, Yongzhi Li, Yanqi Dai, Xingyu Zhu, Yi Tan, Qingshan Xu, Beier Zhu, Richang Hong, Hanwang Zhang
Main category: cs.CV
TL;DR: MuSteerNet generates 3D human reactions from videos using mutual steering between observations and reactions to address relational distortion issues in existing methods.
Details
Motivation: Existing video-driven human reaction generation methods fail to effectively leverage video inputs to steer human reaction synthesis, resulting in mismatched reactions due to severe relational distortion between visual observations and reaction types.Method: Proposes MuSteerNet with two key components: 1) Prototype Feedback Steering to mitigate relational distortion by refining visual observations using gated delta-rectification modulator and relational margin constraint guided by prototypical vectors, and 2) Dual-Coupled Reaction Refinement that leverages rectified visual cues to steer refinement of generated reaction motions.
Result: Extensive experiments and ablation studies validate the effectiveness of the method, achieving competitive performance in generating 3D human reactions from videos.
Conclusion: MuSteerNet effectively addresses relational distortion in video-driven human reaction generation through mutual steering between observations and reactions, improving reaction quality and alignment with video content.
Abstract: Video-driven human reaction generation aims to synthesize 3D human motions that directly react to observed video sequences, which is crucial for building human-like interactive AI systems. However, existing methods often fail to effectively leverage video inputs to steer human reaction synthesis, resulting in reaction motions that are mismatched with the content of video sequences. We reveal that this limitation arises from a severe relational distortion between visual observations and reaction types. In light of this, we propose MuSteerNet, a simple yet effective framework that generates 3D human reactions from videos via observation-reaction mutual steering. Specifically, we first propose a Prototype Feedback Steering mechanism to mitigate relational distortion by refining visual observations with a gated delta-rectification modulator and a relational margin constraint, guided by prototypical vectors learned from human reactions. We then introduce Dual-Coupled Reaction Refinement that fully leverages rectified visual cues to further steer the refinement of generated reaction motions, thereby effectively improving reaction quality and enabling MuSteerNet to achieve competitive performance. Extensive experiments and ablation studies validate the effectiveness of our method. Code coming soon: https://github.com/zhouyuan888888/MuSteerNet.
[221] Wildfire Spread Scenarios: Increasing Sample Diversity of Segmentation Diffusion Models with Training-Free Methods
Sebastian Gerard, Josephine Sullivan
Main category: cs.CV
TL;DR: Training-free sampling methods improve sample diversity in ambiguous segmentation tasks using diffusion models, applied to medical, urban, and wildfire datasets.
Details
Motivation: Predicting future states in uncertain environments requires models that can consider multiple plausible outcomes. Diffusion models can learn multi-modal distributions but naive sampling is computationally inefficient, requiring many samples to find low-probability modes that may still be operationally relevant.Method: Adapt particle guidance and SPELL techniques (originally for diverse natural image generation) to discrete segmentation tasks, and propose a simple clustering-based technique. Validate on LIDC medical dataset, modified Cityscapes dataset, and MMFire wildfire spread dataset.
Result: Compared to naive sampling, these approaches increase the HM IoU* metric by up to 7.5% on MMFire and 16.4% on Cityscapes, demonstrating that training-free methods can efficiently increase sample diversity of segmentation diffusion models with little cost to image quality and runtime.
Conclusion: Training-free sampling methods can effectively improve sample diversity in ambiguous segmentation tasks across different domains including medical imaging, urban scenes, and wildfire prediction.
Abstract: Predicting future states in uncertain environments, such as wildfire spread, medical diagnosis, or autonomous driving, requires models that can consider multiple plausible outcomes. While diffusion models can effectively learn such multi-modal distributions, naively sampling from these models is computationally inefficient, potentially requiring hundreds of samples to find low-probability modes that may still be operationally relevant. In this work, we address the challenge of sample-efficient ambiguous segmentation by evaluating several training-free sampling methods that encourage diverse predictions. We adapt two techniques, particle guidance and SPELL, originally designed for the generation of diverse natural images, to discrete segmentation tasks, and additionally propose a simple clustering-based technique. We validate these approaches on the LIDC medical dataset, a modified version of the Cityscapes dataset, and MMFire, a new simulation-based wildfire spread dataset introduced in this paper. Compared to naive sampling, these approaches increase the HM IoU* metric by up to 7.5% on MMFire and 16.4% on Cityscapes, demonstrating that training-free methods can be used to efficiently increase the sample diversity of segmentation diffusion models with little cost to image quality and runtime. Code and dataset: https://github.com/SebastianGer/wildfire-spread-scenarios
[222] CoVR-R:Reason-Aware Composed Video Retrieval
Omkar Thawakar, Dmitry Demidov, Vaishnav Potlapalli, Sai Prasanna Teja Reddy Bogireddy, Viswanatha Reddy Gajjala, Alaa Mostafa Lasheen, Rao Muhammad Anwer, Fahad Khan
Main category: cs.CV
TL;DR: A zero-shot composed video retrieval method that uses large multimodal models to reason about causal and temporal after-effects of textual modifications on reference videos, outperforming baselines on challenging implicit-effect cases.
Details
Motivation: Prior composed video retrieval methods assume modification text fully specifies visual changes, overlooking implicit consequences like motion, state transitions, viewpoint changes, and duration cues that emerge from edits. The authors argue that successful CoVR requires reasoning about these after-effects.Method: A reasoning-first, zero-shot approach that leverages large multimodal models to: (1) infer causal and temporal consequences implied by the edit, and (2) align the resulting reasoned queries to candidate videos without task-specific finetuning. Also introduces CoVR-Reason benchmark with structured reasoning traces and challenging distractors.
Result: The zero-shot method outperforms strong retrieval baselines on recall at K, particularly excelling on implicit-effect subsets. Automatic and human analysis confirm higher step consistency and effect factuality in retrieved results.
Conclusion: Incorporating reasoning into general-purpose multimodal models enables effective CoVR by explicitly accounting for causal and temporal after-effects, reducing dependence on task-specific supervision, improving generalization to challenging implicit-effect cases, and enhancing interpretability of retrieval outcomes.
Abstract: Composed Video Retrieval (CoVR) aims to find a target video given a reference video and a textual modification. Prior work assumes the modification text fully specifies the visual changes, overlooking after-effects and implicit consequences (e.g., motion, state transitions, viewpoint or duration cues) that emerge from the edit. We argue that successful CoVR requires reasoning about these after-effects. We introduce a reasoning-first, zero-shot approach that leverages large multimodal models to (i) infer causal and temporal consequences implied by the edit, and (ii) align the resulting reasoned queries to candidate videos without task-specific finetuning. To evaluate reasoning in CoVR, we also propose CoVR-Reason, a benchmark that pairs each (reference, edit, target) triplet with structured internal reasoning traces and challenging distractors that require predicting after-effects rather than keyword matching. Experiments show that our zero-shot method outperforms strong retrieval baselines on recall at K and particularly excels on implicit-effect subsets. Our automatic and human analysis confirm higher step consistency and effect factuality in our retrieved results. Our findings show that incorporating reasoning into general-purpose multimodal models enables effective CoVR by explicitly accounting for causal and temporal after-effects. This reduces dependence on task-specific supervision, improves generalization to challenging implicit-effect cases, and enhances interpretability of retrieval outcomes. These results point toward a scalable and principled framework for explainable video search. The model, code, and benchmark are available at https://github.com/mbzuai-oryx/CoVR-R.
[223] Deterministic Mode Proposals: An Efficient Alternative to Generative Sampling for Ambiguous Segmentation
Sebastian Gerard, Josephine Sullivan
Main category: cs.CV
TL;DR: A deterministic framework for generating multiple segmentation proposals in a single forward pass, addressing ambiguity in tasks like medical imaging where multiple predictions are equally valid.
Details
Motivation: Current methods for ambiguous segmentation tasks rely on generative models that require expensive stochastic sampling and post-hoc clustering. There's a need for more efficient deterministic approaches that can directly generate likely outcomes.Method: Introduces mode proposal models - a deterministic framework that produces a fixed-size set of proposal masks in one forward pass. Adapts confidence mechanisms from object detection to handle superfluous proposals in high-dimensional segmentation space. Can be trained without knowing full outcome distributions and can estimate prior mode probabilities by decomposing velocity fields from pre-trained flow models.
Result: Significantly reduces inference time while achieving higher ground-truth coverage than existing generative models. Applicable to real-world datasets where full outcome distributions are unknown.
Conclusion: The proposed deterministic framework offers an efficient alternative to stochastic sampling for ambiguous segmentation tasks, with practical advantages for real-world applications.
Abstract: Many segmentation tasks, such as medical image segmentation or future state prediction, are inherently ambiguous, meaning that multiple predictions are equally correct. Current methods typically rely on generative models to capture this uncertainty. However, identifying the underlying modes of the distribution with these methods is computationally expensive, requiring large numbers of samples and post-hoc clustering. In this paper, we shift the focus from stochastic sampling to the direct generation of likely outcomes. We introduce mode proposal models, a deterministic framework that efficiently produces a fixed-size set of proposal masks in a single forward pass. To handle superfluous proposals, we adapt a confidence mechanism, traditionally used in object detection, to the high-dimensional space of segmentation masks. Our approach significantly reduces inference time while achieving higher ground-truth coverage than existing generative models. Furthermore, we demonstrate that our model can be trained without knowing the full distribution of outcomes, making it applicable to real-world datasets. Finally, we show that by decomposing the velocity field of a pre-trained flow model, we can efficiently estimate prior mode probabilities for our proposals.
[224] MME-CoF-Pro: Evaluating Reasoning Coherence in Video Generative Models with Text and Visual Hints
Yu Qi, Xinyi Xu, Ziyu Guo, Siyuan Ma, Renrui Zhang, Xinyan Chen, Ruichuan An, Ruofan Xing, Jiayi Zhang, Haojie Huang, Pheng-Ann Heng, Jonathan Tremblay, Lawson L. S. Wong
Main category: cs.CV
TL;DR: MME-CoF-Pro is a benchmark for evaluating reasoning coherence in video generative models, assessing whether generated events remain causally consistent across frames.
Details
Motivation: Current video generative models show emerging reasoning behaviors, but there's a gap in evaluating whether generated events maintain causal consistency across frames (reasoning coherence). Existing literature lacks comprehensive evaluation methods for this important property.Method: Proposes MME-CoF-Pro benchmark with 303 samples across 16 categories covering visual logical to scientific reasoning. Introduces Reasoning Score metric for process-level intermediate reasoning steps, with three evaluation settings: no hint, text hint, and visual hint to study reasoning hint guidance mechanisms.
Result: Evaluation of 7 open and closed-source video models reveals: (1) Video generative models exhibit weak reasoning coherence decoupled from generation quality, (2) Text hints boost apparent correctness but cause inconsistency and hallucinated reasoning, (3) Visual hints benefit structured perceptual tasks but struggle with fine-grained perception.
Conclusion: The benchmark reveals significant gaps in reasoning coherence of current video generative models and provides insights into how different hint types affect reasoning processes, highlighting the need for improved reasoning consistency in video generation.
Abstract: Video generative models show emerging reasoning behaviors. It is essential to ensure that generated events remain causally consistent across frames for reliable deployment, a property we define as reasoning coherence. To bridge the gap in literature for missing reasoning coherence evaluation, we propose MME-CoF-Pro, a comprehensive video reasoning benchmark to assess reasoning coherence in video models. Specifically, MME-CoF-Pro contains 303 samples across 16 categories, ranging from visual logical to scientific reasoning. It introduces Reasoning Score as evaluation metric for assessing process-level necessary intermediate reasoning steps, and includes three evaluation settings, (a) no hint (b) text hint and (c) visual hint, enabling a controlled investigation into the underlying mechanisms of reasoning hint guidance. Evaluation results in 7 open and closed-source video models reveals insights including: (1) Video generative models exhibit weak reasoning coherence, decoupled from generation quality. (2) Text hints boost apparent correctness but often cause inconsistency and hallucinated reasoning (3) Visual hints benefit structured perceptual tasks but struggle with fine-grained perception. Website: https://video-reasoning-coherence.github.io/
[225] Behavioral Engagement in VR-Based Sign Language Learning: Visual Attention as a Predictor of Performance and Temporal Dynamics
Davide Traini, José Manuel Alcalde-Llergo, Mariana Buenestado-Fernández, Domenico Ursino, Enrique Yeguas-Bolívar
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2603.19535: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.19535&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[226] HiFiGaze: Improving Eye Tracking Accuracy Using Screen Content Knowledge
Taejun Kim, Vimal Mollyn, Riku Arakawa, Chris Harrison
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to access errorMethod: Unable to determine method due to access error
Result: Unable to determine results due to access error
Conclusion: Unable to determine conclusion due to access error
Abstract: Failed to fetch summary for 2603.19588: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.19588&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[227] Deep Face Restoration: A Survey
Tao Wang, Kaihao Zhang, Jiankang Deng, Tong Lu, Wei Liu, Stefanos Zafeiriou
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2211.02831: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2211.02831&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[228] Hyper-STTN: Hypergraph Augmented Spatial-Temporal Transformer Network for Trajectory Prediction
Weizheng Wang, Baijian Yang, Sungeun Hong, Wenhai Sun, Byung-Cheol Min
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to failed paper retrievalMethod: Unable to determine method due to failed paper retrieval
Result: Unable to determine results due to failed paper retrieval
Conclusion: Unable to determine conclusion due to failed paper retrieval
Abstract: Failed to fetch summary for 2401.06344: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2401.06344&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[229] SRGS: Super-Resolution 3D Gaussian Splatting
Xiang Feng, Yongbo He, Linxi Chen, Yan Yang, Chengkai Wang, Yifei Chen, Yixuan Zhong, Zhenzhong Kuang, Jiajun ding, Xufei Yin, Yanming Zhu
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2404.10318: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2404.10318&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[230] A Survey of AI-Generated Video Evaluation
Xiao Liu, Xinhao Xiang, Zizhong Li, Yongheng Wang, Zhuoheng Li, Zhuosheng Liu, Weidi Zhang, Weiqi Ye, Jiawei Zhang
Main category: cs.CV
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation without access to paper contentMethod: Cannot determine method without access to paper content
Result: Cannot determine results without access to paper content
Conclusion: Cannot draw conclusions without access to paper content
Abstract: Failed to fetch summary for 2410.19884: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2410.19884&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[231] AtGCN: A Graph Convolutional Network For Ataxic Gait Detection
Karan Bania, Tanmay Verlekar
Main category: cs.CV
TL;DR: Unable to analyze paper 2410.22862 due to HTTP 429 error when fetching summary from arXiv API
Details
Motivation: Cannot determine motivation as paper content is unavailable due to API rate limiting errorMethod: Cannot determine method as paper content is unavailable due to API rate limiting error
Result: Cannot determine results as paper content is unavailable due to API rate limiting error
Conclusion: Cannot draw conclusions as paper content is unavailable due to API rate limiting error
Abstract: Failed to fetch summary for 2410.22862: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2410.22862&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[232] Pseudo-Simulation for Autonomous Driving
Wei Cao, Marcel Hallgarten, Tianyu Li, Daniel Dauner, Xunjiang Gu, Caojun Wang, Yakov Miron, Marco Aiello, Hongyang Li, Igor Gilitschenski, Boris Ivanovic, Marco Pavone, Andreas Geiger, Kashyap Chitta
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2506.04218: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2506.04218&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[233] GoDe: Gaussians on Demand for Progressive Level of Detail and Scalable Compression
Francesco Di Sario, Riccardo Renzulli, Marco Grangetto, Akihiro Sugimoto, Enzo Tartaglione
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to access errorMethod: Unable to determine method due to access error
Result: Unable to determine results due to access error
Conclusion: Unable to determine conclusion due to access error
Abstract: Failed to fetch summary for 2501.13558: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2501.13558&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[234] On the Theory of Bias Tuning in Event Cameras
David El-Chai Ben-Ezra, Daniel Brisk, Adar Tal
Main category: cs.CV
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot determine conclusion as paper content is unavailable
Abstract: Failed to fetch summary for 2501.18788: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2501.18788&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[235] SASNet: Spatially-Adaptive Sinusoidal Networks for INRs
Haoan Feng, Diana Aldana, Tiago Novello, Leila De Floriani
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to access errorMethod: Unable to determine method due to access error
Result: Unable to determine results due to access error
Conclusion: Unable to determine conclusion due to access error
Abstract: Failed to fetch summary for 2503.09750: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2503.09750&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[236] Exploring Disentangled and Controllable Human Image Synthesis: From End-to-End to Stage-by-Stage
Zhengwentai Sun, Chenghong Li, Hongjie Liao, Xihe Yang, Keru Zheng, Heyuan Li, Yihao Zhi, Shuliang Ning, Shuguang Cui, Xiaoguang Han
Main category: cs.CV
TL;DR: Unable to analyze paper 2503.19486 due to HTTP 429 error when fetching abstract from arXiv API
Details
Motivation: Cannot determine motivation as abstract retrieval failed due to rate limitingMethod: Cannot determine method as abstract retrieval failed
Result: Cannot determine results as abstract retrieval failed
Conclusion: Cannot draw conclusions about paper content due to technical retrieval issue
Abstract: Failed to fetch summary for 2503.19486: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2503.19486&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[237] Superclass-Guided Representation Disentanglement for Spurious Correlation Mitigation
Chenruo Liu, Hongjun Liu, Zeyu Lai, Yiqiu Shen, Chen Zhao, Qi Lei
Main category: cs.CV
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation without access to paper contentMethod: Cannot determine method without access to paper content
Result: Cannot determine results without access to paper content
Conclusion: Cannot determine conclusion without access to paper content
Abstract: Failed to fetch summary for 2508.08570: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2508.08570&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[238] DreamCS: Geometry-Aware Text-to-3D Generation with Unpaired 3D Reward Supervision
Xiandong Zou, Ruihao Xia, Hongsong Wang, Pan Zhou
Main category: cs.CV
TL;DR: Paper ID 2506.09814: Unable to fetch summary due to HTTP 429 error (rate limiting). No content available for analysis.
Details
Motivation: Cannot determine motivation as paper content is unavailable due to HTTP 429 error from arXiv API.Method: Cannot determine method as paper content is unavailable due to HTTP 429 error from arXiv API.
Result: Cannot determine results as paper content is unavailable due to HTTP 429 error from arXiv API.
Conclusion: Cannot draw conclusions as paper content is unavailable due to HTTP 429 error from arXiv API.
Abstract: Failed to fetch summary for 2506.09814: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2506.09814&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[239] PoseMaster: A Unified 3D Native Framework for Stylized Pose Generation
Hongyu Yan, Kunming Luo, Weiyu Li, Yixun Liang, Shengming Li, Jingwei Huang, Chunchao Guo, Ping Tan
Main category: cs.CV
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation without access to paper contentMethod: Cannot determine method without access to paper content
Result: Cannot determine results without access to paper content
Conclusion: Cannot draw conclusions without access to paper content
Abstract: Failed to fetch summary for 2506.21076: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2506.21076&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[240] LED Benchmark: Diagnosing Structural Layout Errors for Document Layout Analysis
Inbum Heo, Taewook Hwang, Jeesu Jung, Sangkeun Jung
Main category: cs.CV
TL;DR: Failed to fetch paper summary - HTTP 429 error indicates rate limiting from arXiv API
Details
Motivation: Unable to determine motivation due to API rate limiting preventing access to paper contentMethod: Cannot analyze method as paper content is unavailable due to API restrictions
Result: No results available - technical issue with arXiv API access
Conclusion: Cannot provide analysis due to technical limitations in accessing the paper
Abstract: Failed to fetch summary for 2507.23295: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2507.23295&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[241] Hyperbolic Cycle Alignment for Infrared-Visible Image Fusion
Timing Li, Bing Cao, Jiahe Feng, Haifang Cao, Qinghau Hu, Pengfei Zhu
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2507.23508: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2507.23508&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[242] CARES: Context-Aware Resolution Selector for VLMs
Moshe Kimhi, Nimrod Shabtay, Raja Giryes, Chaim Baskin, Eli Schwartz
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to failed paper fetchMethod: Unable to determine method due to failed paper fetch
Result: Unable to determine results due to failed paper fetch
Conclusion: Unable to determine conclusion due to failed paper fetch
Abstract: Failed to fetch summary for 2510.19496: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.19496&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[243] Stand-In: A Lightweight and Plug-and-Play Identity Control for Video Generation
Bowen Xue, Zheng-Peng Duan, Qixin Yan, Wenjing Wang, Hao Liu, Chun-Le Guo, Chongyi Li, Chen Li, Jing Lyu
Main category: cs.CV
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) - the arXiv API request failed
Details
Motivation: Cannot determine motivation without access to paper contentMethod: Cannot determine method without access to paper content
Result: Cannot determine results without access to paper content
Conclusion: Cannot determine conclusion without access to paper content
Abstract: Failed to fetch summary for 2508.07901: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2508.07901&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[244] RobotArena $\infty$: Scalable Robot Benchmarking via Real-to-Sim Translation
Yash Jangir, Yidi Zhang, Pang-Chi Lo, Kashu Yamazaki, Chenyu Zhang, Kuan-Hsun Tu, Tsung-Wei Ke, Lei Ke, Yonatan Bisk, Katerina Fragkiadaki
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to failed paper fetchMethod: Unable to determine method due to failed paper fetch
Result: Unable to determine results due to failed paper fetch
Conclusion: Unable to determine conclusion due to failed paper fetch
Abstract: Failed to fetch summary for 2510.23571: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.23571&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[245] ZOO-Prune: Training-Free Token Pruning via Zeroth-Order Gradient Estimation in Vision-Language Models
Youngeun Kim, Youjia Zhang, Huiling Liu, Aecheon Jung, Sunwoo Lee, Sungeun Hong
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to access errorMethod: Unable to determine method due to access error
Result: Unable to determine results due to access error
Conclusion: Unable to determine conclusion due to access error
Abstract: Failed to fetch summary for 2509.24837: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.24837&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[246] VR-Thinker: Boosting Video Reward Models through Thinking-with-Image Reasoning
Qunzhong Wang, Jie Liu, Jiajun Liang, Yilei Jiang, Yuanxing Zhang, Yaozhi Zheng, Xintao Wang, Pengfei Wan, Xiangyu Yue, Jiaheng Liu
Main category: cs.CV
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation without access to paper contentMethod: Cannot determine method without access to paper content
Result: Cannot determine results without access to paper content
Conclusion: Cannot draw conclusions without access to paper content
Abstract: Failed to fetch summary for 2510.10518: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.10518&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[247] Prompt-based Adaptation in Large-scale Vision Models: A Survey
Xi Xiao, Yunbei Zhang, Lin Zhao, Yiyang Liu, Xiaoying Liao, Zheda Mai, Xingjian Li, Xiao Wang, Hao Xu, Jihun Hamm, Xue Lin, Min Xu, Qifan Wang, Tianyang Wang, Cheng Han
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2510.13219: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.13219&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[248] Discovering Intersectional Bias via Directional Alignment in Face Recognition Embeddings
Ignacio Serna
Main category: cs.CV
TL;DR: Unable to analyze paper 2510.15520 due to HTTP 429 error when fetching summary from arXiv API
Details
Motivation: Cannot determine motivation due to missing paper contentMethod: Cannot determine method due to missing paper content
Result: Cannot determine results due to missing paper content
Conclusion: Cannot draw conclusion due to missing paper content
Abstract: Failed to fetch summary for 2510.15520: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.15520&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[249] 3D-Consistent Multi-View Editing by Correspondence Guidance
Josef Bengtson, David Nilsson, Dong In Lee, Yaroslava Lochman, Fredrik Kahl
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to access errorMethod: Unable to determine method due to access error
Result: Unable to determine results due to access error
Conclusion: Unable to determine conclusion due to access error
Abstract: Failed to fetch summary for 2511.22228: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.22228&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[250] CompAgent: An Agentic Framework for Visual Compliance Verification
Rahul Ghosh, Baishali Chaudhury, Hari Prasanna Das, Meghana Ashok, Ryan Razkenari, Long Chen, Sungmin Hong, Chun-Hao Liu
Main category: cs.CV
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation without access to paper contentMethod: Cannot determine method without access to paper content
Result: Cannot determine results without access to paper content
Conclusion: Cannot draw conclusions without access to paper content
Abstract: Failed to fetch summary for 2511.00171: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.00171&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[251] CageDroneRF: A Large-Scale RF Benchmark and Toolkit for Drone Perception
Mohammad Rostami, Atik Faysal, Hongtao Xia, Hadi Kasasbeh, Ziang Gao, Huaxia Wang
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2601.03302: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.03302&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[252] Locally-Supervised Global Image Restoration
Benjamin Walder, Daniel Toader, Robert Nuster, Günther Paltauf, Peter Burgholzer, Gregor Langer, Lukas Krainer, Markus Haltmeier
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2511.01998: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.01998&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[253] Camera-Aware Cross-View Alignment for Referring 3D Gaussian Splatting Segmentation
Yuwen Tao, Kanglei Zhou, Xin Tan, Yuan Xie
Main category: cs.CV
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot determine conclusion as paper content is unavailable
Abstract: Failed to fetch summary for 2511.03992: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.03992&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[254] Multimodal Continual Instruction Tuning with Dynamic Gradient Guidance
Songze Li, Mingyu Gao, Tonghua Su, Xu-Yao Zhang, Zhongjie Wang
Main category: cs.CV
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot determine conclusion as paper content is unavailable
Abstract: Failed to fetch summary for 2511.15164: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.15164&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[255] FastMMoE: Accelerating Multimodal Large Language Models through Dynamic Expert Activation and Routing-Aware Token Pruning
Guoyang Xia, Yifeng Ding, Fengfa Li, Lei Ren, Wei Chen, Fangxiang Feng, Xiaojie Wang
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Unable to determine motivation due to API rate limiting preventing access to paper contentMethod: No method information available - arXiv API returned HTTP 429 error (Too Many Requests)
Result: No results available - paper content could not be retrieved due to technical limitations
Conclusion: Cannot analyze paper due to API rate limiting; need to try again later or use alternative access methods
Abstract: Failed to fetch summary for 2511.17885: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.17885&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[256] Rethinking Test Time Scaling for Flow-Matching Generative Models
Qingtao Yu, Changlin Song, Minghao Sun, Zhengyang Yu, Vinay Kumar Verma, Soumya Roy, Sumit Negi, Hongdong Li, Dylan Campbell
Main category: cs.CV
TL;DR: Paper analysis unavailable due to HTTP 429 error when fetching abstract from arXiv API
Details
Motivation: Unable to determine paper motivation due to access errorMethod: Unable to determine paper method due to access error
Result: Unable to determine paper results due to access error
Conclusion: Unable to analyze paper due to technical access issues with arXiv API
Abstract: Failed to fetch summary for 2511.22242: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.22242&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[257] CryoHype: Reconstructing a thousand cryo-EM structures with transformer-based hypernetworks
Jeffrey Gu, Minkyu Jeon, Ambri Ma, Serena Yeung-Levy, Ellen D. Zhong
Main category: cs.CV
TL;DR: Unable to analyze paper 2512.06332 due to HTTP 429 error when fetching abstract from arXiv API
Details
Motivation: Cannot determine motivation as abstract is unavailableMethod: Cannot determine method as abstract is unavailable
Result: Cannot determine results as abstract is unavailable
Conclusion: Cannot draw conclusions without access to paper content
Abstract: Failed to fetch summary for 2512.06332: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.06332&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[258] EagleVision: A Dual-Stage Framework with BEV-grounding-based Chain-of-Thought for Spatial Intelligence
Jiaxu Wan, Xu Wang, Mengwei Xie, Hang Zhang, Mu Xu, Yang Han, Hong Zhang, Ding Yuan, Yifan Yang
Main category: cs.CV
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot determine conclusion as paper content is unavailable
Abstract: Failed to fetch summary for 2512.15160: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.15160&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[259] From Intuition to Investigation: A Tool-Augmented Reasoning MLLM Framework for Generalizable Face Anti-Spoofing
Haoyuan Zhang, Keyao Wang, Guosheng Zhang, Haixiao Yue, Zhiwen Tan, Siran Peng, Tianshuo Zhang, Xiao Tan, Kunbin Chen, Wei He, Jingdong Wang, Ajian Liu, Xiangyu Zhu, Zhen Lei
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to API rate limiting preventing access to paper contentMethod: Cannot analyze method without access to paper content
Result: No results available due to API access issues
Conclusion: Paper analysis not possible due to technical limitations in accessing the content
Abstract: Failed to fetch summary for 2603.01038: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.01038&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[260] FeatureSLAM: Feature-enriched 3D gaussian splatting SLAM in real time
Christopher Thirgood, Oscar Mendez, Erin Ling, Jon Storey, Simon Hadfield
Main category: cs.CV
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting). The paper ID 2601.05738 could not be retrieved from arXiv API.
Details
Motivation: Cannot determine motivation as the paper content is unavailable due to API rate limiting.Method: Cannot determine method as the paper content is unavailable due to API rate limiting.
Result: Cannot determine results as the paper content is unavailable due to API rate limiting.
Conclusion: Cannot draw conclusions as the paper content is unavailable due to API rate limiting.
Abstract: Failed to fetch summary for 2601.05738: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.05738&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[261] Guiding Diffusion-based Reconstruction with Contrastive Signals for Balanced Visual Representation
Boyu Han, Qianqian Xu, Shilong Bao, Zhiyong Yang, Ruochen Cui, Xilin Zhao, Qingming Huang
Main category: cs.CV
TL;DR: Paper ID 2603.04803 could not be retrieved due to HTTP 429 error (rate limiting), so analysis cannot be performed.
Details
Motivation: Unable to determine motivation as paper content was not accessible due to server rate limiting.Method: Unable to determine method as paper content was not accessible due to server rate limiting.
Result: Unable to determine results as paper content was not accessible due to server rate limiting.
Conclusion: Unable to draw conclusions as paper content was not accessible due to server rate limiting.
Abstract: Failed to fetch summary for 2603.04803: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.04803&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[262] Towards Open Environments and Instructions: General Vision-Language Navigation via Fast-Slow Interactive Reasoning
Yang Li, Aming Wu, Zihao Zhang, Yahong Han
Main category: cs.CV
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation due to failed API requestMethod: Cannot determine method due to failed API request
Result: Cannot determine results due to failed API request
Conclusion: Cannot determine conclusion due to failed API request
Abstract: Failed to fetch summary for 2601.09111: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.09111&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[263] RayRoPE: Projective Ray Positional Encoding for Multi-view Attention
Yu Wu, Minsik Jeon, Jen-Hao Rick Chang, Oncel Tuzel, Shubham Tulsiani
Main category: cs.CV
TL;DR: Unable to analyze paper 2601.15275 due to HTTP 429 error when fetching abstract from arXiv API
Details
Motivation: Cannot determine motivation as abstract is unavailableMethod: Cannot determine method as abstract is unavailable
Result: Cannot determine results as abstract is unavailable
Conclusion: Cannot draw conclusions without access to paper content
Abstract: Failed to fetch summary for 2601.15275: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.15275&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[264] VirPro: Visual-referred Probabilistic Prompt Learning for Weakly-Supervised Monocular 3D Detection
Chupeng Liu, Jiyong Rao, Shangquan Sun, Runkai Zhao, Weidong Cai
Main category: cs.CV
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot draw conclusions as paper content is unavailable
Abstract: Failed to fetch summary for 2603.17470: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.17470&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[265] ReLaX: Reasoning with Latent Exploration for Large Reasoning Models
Shimin Zhang, Xianwei Chen, Yufan Shen, Ziyuan Ye, Jibin Wu
Main category: cs.CV
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation without paper contentMethod: Cannot determine method without paper content
Result: Cannot determine results without paper content
Conclusion: Cannot determine conclusion without paper content
Abstract: Failed to fetch summary for 2512.07558: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.07558&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[266] Uncertainty-Aware Counterfactual Traffic Signal Control with Predictive Safety and Starvation-Avoidance Constraints Using Vision-Based Sensing
Jayawant Bodagala, Balaji Bodagala
Main category: cs.CV
TL;DR: Unable to analyze paper 2602.07784 due to HTTP 429 error when fetching summary from arXiv API
Details
Motivation: Cannot determine motivation as paper content could not be retrievedMethod: Cannot determine method as paper content could not be retrieved
Result: Cannot determine results as paper content could not be retrieved
Conclusion: Cannot draw conclusions as paper content could not be retrieved
Abstract: Failed to fetch summary for 2602.07784: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.07784&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[267] S3T-Former: A Purely Spike-Driven State-Space Topology Transformer for Skeleton Action Recognition
Naichuan Zheng, Hailun Xia, Zepeng Sun, Weiyi Li, Yujia Wang
Main category: cs.CV
TL;DR: Paper 2603.18062: Could not fetch summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to missing abstractMethod: Unable to determine method due to missing abstract
Result: Unable to determine results due to missing abstract
Conclusion: Unable to draw conclusions due to missing abstract
Abstract: Failed to fetch summary for 2603.18062: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.18062&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[268] ReMoT: Reinforcement Learning with Motion Contrast Triplets
Cong Wan, Zeyu Guo, Jiangyang Li, SongLin Dong, Yifan Bai, Lin Peng, Zhiheng Ma, Yihong Gong
Main category: cs.CV
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot draw conclusions without access to paper content
Abstract: Failed to fetch summary for 2603.00461: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.00461&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[269] Vision-TTT: Efficient and Expressive Visual Representation Learning with Test-Time Training
Quan Kong, Yanru Xiao, Yuhao Shen, Cong Wang
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to failed paper fetchMethod: Unable to determine method due to failed paper fetch
Result: Unable to determine results due to failed paper fetch
Conclusion: Unable to draw conclusions due to failed paper fetch
Abstract: Failed to fetch summary for 2603.00518: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.00518&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[270] GazeShift: Unsupervised Gaze Estimation and Dataset for VR
Gil Shapira, Ishay Goldin, Evgeny Artyomov, Donghoon Kim, Yosi Keller, Niv Zehngut
Main category: cs.CV
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation due to failed paper fetchMethod: Cannot determine method due to failed paper fetch
Result: Cannot determine results due to failed paper fetch
Conclusion: Cannot determine conclusion due to failed paper fetch
Abstract: Failed to fetch summary for 2603.07832: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.07832&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[271] Layer Consistency Matters: Elegant Latent Transition Discrepancy for Generalizable Synthetic Image Detection
Yawen Yang, Feng Li, Shuqi Kong, Yunfeng Diao, Xinjian Gao, Zenglin Shi, Meng Wang
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to failed paper fetchMethod: Unable to determine method due to failed paper fetch
Result: Unable to determine results due to failed paper fetch
Conclusion: Unable to determine conclusion due to failed paper fetch
Abstract: Failed to fetch summary for 2603.10598: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.10598&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[272] A$^2$-Edit: Precise Reference-Guided Image Editing of Arbitrary Objects and Ambiguous Masks
Huayu Zheng, Guangzhao Li, Baixuan Zhao, Siqi Luo, Hantao Jiang, Guangtao Zhai, Xiaohong Liu
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2603.10685: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.10685&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[273] Points-to-3D: Structure-Aware 3D Generation with Point Cloud Priors
Jiatong Xia, Zicheng Duan, Anton van den Hengel, Lingqiao Liu
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to technical error in fetching paper contentMethod: Unable to determine method due to technical error in fetching paper content
Result: Unable to determine results due to technical error in fetching paper content
Conclusion: Unable to draw conclusions due to technical error in fetching paper content
Abstract: Failed to fetch summary for 2603.18782: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.18782&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[274] G2HFNet: GeoGran-Aware Hierarchical Feature Fusion Network for Salient Object Detection in Optical Remote Sensing Images
Bin Wan, Runmin Cong, Xiaofei Zhou, Hao Fang, Chengtao Lv, Sam Kwong
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to access errorMethod: Unable to determine method due to access error
Result: Unable to determine results due to access error
Conclusion: Unable to determine conclusion due to access error
Abstract: Failed to fetch summary for 2603.12680: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.12680&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[275] Think and Answer ME: Benchmarking and Exploring Multi-Entity Reasoning Grounding in Remote Sensing
Shuchang Lyu, Haiquan Wen, Guangliang Cheng, Meng Li, Zheng Zhou, You Zhou, Dingding Yao, Zhenwei Shi
Main category: cs.CV
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation without access to paper contentMethod: Cannot determine method without access to paper content
Result: Cannot determine results without access to paper content
Conclusion: Cannot draw conclusions without access to paper content
Abstract: Failed to fetch summary for 2603.12788: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.12788&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[276] CustomTex: High-fidelity Indoor Scene Texturing via Multi-Reference Customization
Weilin Chen, Jiahao Rao, Wenhao Wang, Xinyang Li, Xuan Cheng, Liujuan Cao
Main category: cs.CV
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot draw conclusions about paper content due to access limitations
Abstract: Failed to fetch summary for 2603.19121: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.19121&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[277] Medical Image Spatial Grounding with Semantic Sampling
Andrew Seohwan Yu, Mohsen Hariri, Kunio Nakamura, Mingrui Yang, Xiaojuan Li, Vipin Chaudhary
Main category: cs.CV
TL;DR: Unable to analyze paper 2603.14579 due to HTTP 429 error when fetching summary from arXiv API
Details
Motivation: Cannot determine motivation as paper content could not be retrievedMethod: Cannot determine method as paper content could not be retrieved
Result: Cannot determine results as paper content could not be retrieved
Conclusion: Cannot draw conclusions as paper content could not be retrieved
Abstract: Failed to fetch summary for 2603.14579: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.14579&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[278] IRIS: A Real-World Benchmark for Inverse Recovery and Identification of Physical Dynamic Systems from Monocular Video
Rasul Khanbayov, Mohamed Rayan Barhdadi, Erchin Serpedin, Hasan Kurban
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2603.16432: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.16432&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[279] CycleCap: Improving VLMs Captioning Performance via Self-Supervised Cycle Consistency Fine-Tuning
Marios Krestenitis, Christos Tzelepis, Konstantinos Ioannidis, Stefanos Vrochidis, Ioannis Kompatsiaris, Georgios Tzimiropoulos, Shaogang Gong, Ioannis Patras
Main category: cs.CV
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation due to failed paper retrievalMethod: Cannot determine method due to failed paper retrieval
Result: Cannot determine results due to failed paper retrieval
Conclusion: Cannot draw conclusions due to failed paper retrieval
Abstract: Failed to fetch summary for 2603.18282: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.18282&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[280] Generalized Hand-Object Pose Estimation with Occlusion Awareness
Hui Yang, Wei Sun, Jian Liu, Jian Xiao, Tao Xie, Hossein Rahmani, Ajmal Saeed mian, Nicu Sebe, Gim Hee Lee
Main category: cs.CV
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot draw conclusions about paper content due to access limitations
Abstract: Failed to fetch summary for 2603.19013: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.19013&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[281] Tinted Frames: Question Framing Blinds Vision-Language Models
Wan-Cyuan Fan, Jiayun Luo, Declan Kutscher, Leonid Sigal, Ritwik Gupta
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2603.19203: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.19203&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[282] Matryoshka Gaussian Splatting
Zhilin Guo, Boqiao Zhang, Hakan Aktas, Kyle Fogarty, Jeffrey Hu, Nursena Koprucu Aslan, Wenzhao Li, Canberk Baykal, Albert Miao, Josef Bengtson, Chenliang Zhou, Weihao Xia, Cristina Nader Vasconcelos, Cengiz Oztireli
Main category: cs.CV
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation without access to paper contentMethod: Cannot determine method without access to paper content
Result: Cannot determine results without access to paper content
Conclusion: Cannot draw conclusions without access to paper content
Abstract: Failed to fetch summary for 2603.19234: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.19234&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[283] EgoSpot:Egocentric Multimodal Control for Hands-Free Mobile Manipulation
Ganlin Zhang, Deheng Zhang, Longteng Duan, Guo Han, Yuqian Fu, Danda Pani Paudel, Luc Van Gool, Eric Vollenweider
Main category: cs.CV
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting). The paper ID 2306.02393 exists but cannot be analyzed without access to its content.
Details
Motivation: Cannot determine motivation without access to paper content.Method: Cannot determine method without access to paper content.
Result: Cannot determine results without access to paper content.
Conclusion: Cannot draw conclusions without access to paper content.
Abstract: Failed to fetch summary for 2306.02393: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2306.02393&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[284] Generative Blocks World: Moving Things Around in Pictures
Vaibhav Vavilala, Seemandhar Jain, Rahul Vasanth, D.A. Forsyth, Anand Bhattad
Main category: cs.CV
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation as paper content is unavailable due to API rate limitingMethod: Cannot determine method as paper content is unavailable due to API rate limiting
Result: Cannot determine results as paper content is unavailable due to API rate limiting
Conclusion: Cannot determine conclusion as paper content is unavailable due to API rate limiting
Abstract: Failed to fetch summary for 2506.20703: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2506.20703&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[285] Mash, Spread, Slice! Learning to Manipulate Object States via Visual Spatial Progress
Priyanka Mandikal, Jiaheng Hu, Shivin Dass, Sagnik Majumder, Roberto Martín-Martín, Kristen Grauman
Main category: cs.CV
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot draw conclusions as paper content is unavailable
Abstract: Failed to fetch summary for 2509.24129: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.24129&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[286] SpikeGrasp: A Benchmark for 6-DoF Grasp Pose Detection from Stereo Spike Streams
Zhuoheng Gao, Jiyao Zhang, Zhiyong Xie, Hao Dong, Zhaofei Yu, Rongmei Chen, Guozhang Chen, Tiejun Huang
Main category: cs.CV
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting). The paper with ID 2510.10602 cannot be analyzed without access to its abstract or content.
Details
Motivation: Cannot determine motivation without access to the paper content. The arXiv API returned a rate limiting error (HTTP 429), preventing retrieval of the paper details.Method: No method information available due to API rate limiting preventing access to the paper content.
Result: No results available as the paper content could not be retrieved from arXiv due to HTTP 429 error.
Conclusion: Cannot provide analysis or conclusion without access to the paper content. The arXiv API rate limiting prevents retrieval of this specific paper’s information.
Abstract: Failed to fetch summary for 2510.10602: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.10602&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[287] FD-VLA: Force-Distilled Vision-Language-Action Model for Contact-Rich Manipulation
Ruiteng Zhao, Wenshuo Wang, Yicheng Ma, Xiaocong Li, Francis E.H. Tay, Marcelo H. Ang Jr., Haiyue Zhu
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to failed API requestMethod: Unable to determine method due to failed API request
Result: Unable to determine results due to failed API request
Conclusion: Unable to draw conclusions due to failed API request
Abstract: Failed to fetch summary for 2602.02142: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.02142&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
cs.AI
[288] When both Grounding and not Grounding are Bad – A Partially Grounded Encoding of Planning into SAT (Extended Version)
João Filipe, Gregor Behnke
Main category: cs.AI
TL;DR: SAT-based planning encodings that keep actions lifted while partially grounding predicates, achieving linear scaling with plan length instead of quadratic
Details
Motivation: Classical planning faces exponential blowup from full grounding of lifted first-order representations, while fully lifted approaches avoid grounding but have limitations. There's a need for middle-ground approaches that balance compactness and efficiency.Method: Introduces three SAT encodings that maintain actions at lifted level while partially grounding predicates, enabling linear scaling with plan length rather than quadratic scaling of previous SAT encodings.
Result: Empirical evaluation shows the best encoding outperforms state-of-the-art in length-optimal planning on hard-to-ground domains, demonstrating improved performance for longer plans.
Conclusion: Partial grounding with lifted actions offers a promising middle ground between fully lifted and fully grounded planning, achieving better scalability and performance on challenging domains.
Abstract: Classical planning problems are typically defined using lifted first-order representations, which offer compactness and generality. While most planners ground these representations to simplify reasoning, this can cause an exponential blowup in size. Recent approaches instead operate directly on the lifted level to avoid full grounding. We explore a middle ground between fully lifted and fully grounded planning by introducing three SAT encodings that keep actions lifted while partially grounding predicates. Unlike previous SAT encodings, which scale quadratically with plan length, our approach scales linearly, enabling better performance on longer plans. Empirically, our best encoding outperforms the state of the art in length-optimal planning on hard-to-ground domains.
[289] Hyperagents
Jenny Zhang, Bingchen Zhao, Wannan Yang, Jakob Foerster, Jeff Clune, Minqi Jiang, Sam Devlin, Tatiana Shavrina
Main category: cs.AI
TL;DR: Hyperagents framework enables AI systems to self-improve by allowing meta-level modification procedures to be editable, enabling metacognitive self-modification across diverse domains beyond just coding.
Details
Motivation: Existing self-improving AI systems rely on fixed, handcrafted meta-level mechanisms that limit improvement speed, and current approaches like Darwin Gödel Machine only work well in coding domains where task performance aligns with self-modification skill.Method: Introduces hyperagents - self-referential agents combining task agent and meta agent into a single editable program where the meta-level modification procedure itself is editable, enabling metacognitive self-modification. Instantiates this as DGM-Hyperagents (DGM-H) extending Darwin Gödel Machine.
Result: DGM-H improves performance over time across diverse domains, outperforms baselines without self-improvement/open-ended exploration and prior self-improving systems, and achieves meta-level improvements (persistent memory, performance tracking) that transfer across domains and accumulate across runs.
Conclusion: Hyperagents framework enables open-ended AI systems that continually improve their search for how to improve, not just search for better solutions, potentially supporting self-accelerating progress on any computable task.
Abstract: Self-improving AI systems aim to reduce reliance on human engineering by learning to improve their own learning and problem-solving processes. Existing approaches to self-improvement rely on fixed, handcrafted meta-level mechanisms, fundamentally limiting how fast such systems can improve. The Darwin Gödel Machine (DGM) demonstrates open-ended self-improvement in coding by repeatedly generating and evaluating self-modified variants. Because both evaluation and self-modification are coding tasks, gains in coding ability can translate into gains in self-improvement ability. However, this alignment does not generally hold beyond coding domains. We introduce \textbf{hyperagents}, self-referential agents that integrate a task agent (which solves the target task) and a meta agent (which modifies itself and the task agent) into a single editable program. Crucially, the meta-level modification procedure is itself editable, enabling metacognitive self-modification, improving not only the task-solving behavior, but also the mechanism that generates future improvements. We instantiate this framework by extending DGM to create DGM-Hyperagents (DGM-H), eliminating the assumption of domain-specific alignment between task performance and self-modification skill to potentially support self-accelerating progress on any computable task. Across diverse domains, the DGM-H improves performance over time and outperforms baselines without self-improvement or open-ended exploration, as well as prior self-improving systems. Furthermore, the DGM-H improves the process by which it generates new agents (e.g., persistent memory, performance tracking), and these meta-level improvements transfer across domains and accumulate across runs. DGM-Hyperagents offer a glimpse of open-ended AI systems that do not merely search for better solutions, but continually improve their search for how to improve.
[290] Teaching an Agent to Sketch One Part at a Time
Xiaodan Du, Ruize Xu, David Yunis, Yael Vinker, Greg Shakhnarovich
Main category: cs.AI
TL;DR: A method for generating vector sketches part-by-part using a multimodal language model agent trained with multi-turn process-reward reinforcement learning after supervised fine-tuning, enabled by a new ControlSketch-Part dataset with rich part-level annotations.
Details
Motivation: To achieve interpretable, controllable, and locally editable text-to-vector sketch generation by producing sketches one part at a time, addressing limitations in existing sketch generation methods that lack fine-grained control and editability.Method: Train a multimodal language model-based agent using a novel multi-turn process-reward reinforcement learning following supervised fine-tuning. Use a new ControlSketch-Part dataset with rich part-level annotations obtained through an automatic annotation pipeline that segments vector sketches into semantic parts with structured multi-stage labeling.
Result: The approach enables interpretable, controllable, and locally editable text-to-vector sketch generation by incorporating structured part-level data and providing the agent with visual feedback through the generation process.
Conclusion: Part-by-part vector sketch generation using multimodal language models with structured part-level data and process feedback allows for more interpretable, controllable, and editable sketch creation from text descriptions.
Abstract: We develop a method for producing vector sketches one part at a time. To do this, we train a multi-modal language model-based agent using a novel multi-turn process-reward reinforcement learning following supervised fine-tuning. Our approach is enabled by a new dataset we call ControlSketch-Part, containing rich part-level annotations for sketches, obtained using a novel, generic automatic annotation pipeline that segments vector sketches into semantic parts and assigns paths to parts with a structured multi-stage labeling process. Our results indicate that incorporating structured part-level data and providing agent with the visual feedback through the process enables interpretable, controllable, and locally editable text-to-vector sketch generation.
[291] Learning to Disprove: Formal Counterexample Generation with Large Language Models
Zenan Li, Zhaoyu Li, Kaiyu Yang, Xiaoxing Ma, Zhendong Su
Main category: cs.AI
TL;DR: Fine-tuning LLMs for formal counterexample generation in mathematics, addressing the gap in current AI math research that focuses only on proof construction.
Details
Motivation: Current AI efforts in mathematics focus almost exclusively on proof construction while neglecting the equally important task of finding counterexamples. There's a need to develop AI systems that can both prove true statements and disprove false ones through counterexamples.Method: Fine-tune large language models for formal counterexample generation, requiring models to propose candidate counterexamples and produce formal proofs verifiable in Lean 4 theorem prover. Introduce symbolic mutation strategy to synthesize diverse training data by extracting theorems and discarding selected hypotheses. Use multi-reward expert iteration framework for training.
Result: Experiments on three newly collected benchmarks show significant performance gains from the mutation strategy and training framework, validating the approach’s effectiveness for counterexample generation and theorem proving.
Conclusion: The paper successfully addresses the gap in AI mathematical reasoning by developing LLMs capable of both proof construction and counterexample generation, with the symbolic mutation strategy and training framework proving effective.
Abstract: Mathematical reasoning demands two critical, complementary skills: constructing rigorous proofs for true statements and discovering counterexamples that disprove false ones. However, current AI efforts in mathematics focus almost exclusively on proof construction, often neglecting the equally important task of finding counterexamples. In this paper, we address this gap by fine-tuning large language models (LLMs) to reason about and generate counterexamples. We formalize this task as formal counterexample generation, which requires LLMs not only to propose candidate counterexamples but also to produce formal proofs that can be automatically verified in the Lean 4 theorem prover. To enable effective learning, we introduce a symbolic mutation strategy that synthesizes diverse training data by systematically extracting theorems and discarding selected hypotheses, thereby producing diverse counterexample instances. Together with curated datasets, this strategy enables a multi-reward expert iteration framework that substantially enhances both the effectiveness and efficiency of training LLMs for counterexample generation and theorem proving. Experiments on three newly collected benchmarks validate the advantages of our approach, showing that the mutation strategy and training framework yield significant performance gains.
[292] ItinBench: Benchmarking Planning Across Multiple Cognitive Dimensions with Large Language Models
Tianlong Wang, Pinqiao Wang, Weili Shi, Sheng li
Main category: cs.AI
TL;DR: ItinBench is a benchmark that integrates spatial reasoning (route optimization) with traditional verbal reasoning tasks in trip itinerary planning to evaluate LLMs across multiple cognitive domains simultaneously.
Details
Motivation: Current LLM evaluations focus on specific reasoning tasks in controlled environments, but real-world applications require handling multiple cognitive domains simultaneously. Travel planning provides a practical context, but existing benchmarks lack spatial reasoning integration.Method: Created ItinBench benchmark featuring route optimization (spatial reasoning) alongside traditional verbal reasoning tasks in trip itinerary planning. Evaluated various LLMs including Llama 3.1 8B, Mistral Large, Gemini 1.5 Pro, and GPT family across these integrated tasks.
Result: LLMs struggle to maintain high and consistent performance when concurrently handling multiple cognitive dimensions. Performance varies across different reasoning domains within the same task context.
Conclusion: Comprehensive reasoning testbeds need to incorporate tasks from distinct cognitive domains to better reflect real-world challenges and provide more accurate evaluation of LLM capabilities.
Abstract: Large language models (LLMs) with advanced cognitive capabilities are emerging as agents for various reasoning and planning tasks. Traditional evaluations often focus on specific reasoning or planning questions within controlled environments. Recent studies have explored travel planning as a medium to integrate various verbal reasoning tasks into real-world contexts. However, reasoning tasks extend beyond verbal reasoning alone, and a comprehensive evaluation of LLMs requires a testbed that incorporates tasks from multiple cognitive domains. To address this gap, we introduce ItinBench, a benchmark that features one task of spatial reasoning, i.e., route optimization, into trip itinerary planning while keeping the traditional verbal reasoning tasks. ItinBench evaluates various LLMs across diverse tasks simultaneously, including Llama 3.1 8B, Mistral Large, Gemini 1.5 Pro, and GPT family. Our findings reveal that LLMs struggle to maintain high and consistent performance when concurrently handling multiple cognitive dimensions. By incorporating tasks from distinct human-level cognitive domains, ItinBench provides new insights into building more comprehensive reasoning testbeds that better reflect real-world challenges. The code and dataset: https://ethanwtl.github.io/IBweb/
[293] DEAF: A Benchmark for Diagnostic Evaluation of Acoustic Faithfulness in Audio Language Models
Jiaqi Xiong, Yunjia Qi, Qi Cao, Yu Zheng, Yutong Zhang, Ziteng Wang, Ruofan Liao, Weisheng Xu, Sichen Liu
Main category: cs.AI
TL;DR: DEAF benchmark evaluates Audio MLLMs’ acoustic faithfulness using 2,700+ conflict stimuli across emotional prosody, background sounds, and speaker identity to test if models genuinely process acoustic signals or rely on text-based inference.
Details
Motivation: Recent Audio MLLMs show impressive performance on speech benchmarks, but it's unclear whether they genuinely process acoustic signals or rely on text-based semantic inference. The authors want to systematically study this question and evaluate acoustic faithfulness.Method: Introduces DEAF benchmark with over 2,700 conflict stimuli spanning three acoustic dimensions. Designs a controlled multi-level evaluation framework that progressively increases textual influence (semantic conflicts, misleading prompts, combinations). Introduces diagnostic metrics to quantify model reliance on textual cues vs. acoustic signals.
Result: Evaluation of seven Audio MLLMs reveals consistent pattern of text dominance: models are sensitive to acoustic variations, but predictions are predominantly driven by textual inputs, revealing a gap between high benchmark performance and genuine acoustic understanding.
Conclusion: There’s a significant gap between Audio MLLMs’ performance on standard speech benchmarks and their genuine acoustic understanding. Models show text dominance despite acoustic sensitivity, highlighting the need for better evaluation of acoustic faithfulness in multimodal models.
Abstract: Recent Audio Multimodal Large Language Models (Audio MLLMs) demonstrate impressive performance on speech benchmarks, yet it remains unclear whether these models genuinely process acoustic signals or rely on text-based semantic inference. To systematically study this question, we introduce DEAF (Diagnostic Evaluation of Acoustic Faithfulness), a benchmark of over 2,700 conflict stimuli spanning three acoustic dimensions: emotional prosody, background sounds, and speaker identity. Then, we design a controlled multi-level evaluation framework that progressively increases textual influence, ranging from semantic conflicts in the content to misleading prompts and their combination, allowing us to disentangle content-driven bias from prompt-induced sycophancy. We further introduce diagnostic metrics to quantify model reliance on textual cues over acoustic signals. Our evaluation of seven Audio MLLMs reveals a consistent pattern of text dominance: models are sensitive to acoustic variations, yet predictions are predominantly driven by textual inputs, revealing a gap between high performance on standard speech benchmarks and genuine acoustic understanding.
[294] PA2D-MORL: Pareto Ascent Directional Decomposition based Multi-Objective Reinforcement Learning
Tianmeng Hu, Biao Luo
Main category: cs.AI
TL;DR: PA2D-MORL: A multi-objective RL method using Pareto ascent directional decomposition for better approximation of Pareto policy sets in complex continuous control tasks.
Details
Motivation: Existing MORL methods struggle to achieve high-quality approximations of Pareto policy sets in complex tasks with continuous or high-dimensional state-action spaces, necessitating more effective decomposition and optimization approaches.Method: Proposes Pareto Ascent Directional Decomposition (PA2D-MORL) that uses Pareto ascent direction to select scalarization weights and compute multi-objective policy gradients, ensuring joint improvement on all objectives. Combines evolutionary framework for multiple policy optimization with Pareto adaptive fine-tuning to enhance frontier density and spread.
Result: Experiments on various multi-objective robot control tasks show clear outperformance over state-of-the-art algorithms in both quality and stability of outcomes.
Conclusion: PA2D-MORL provides an effective solution for approximating Pareto policy sets in complex multi-objective reinforcement learning problems, particularly suitable for continuous control applications.
Abstract: Multi-objective reinforcement learning (MORL) provides an effective solution for decision-making problems involving conflicting objectives. However, achieving high-quality approximations to the Pareto policy set remains challenging, especially in complex tasks with continuous or high-dimensional state-action space. In this paper, we propose the Pareto Ascent Directional Decomposition based Multi-Objective Reinforcement Learning (PA2D-MORL) method, which constructs an efficient scheme for multi-objective problem decomposition and policy improvement, leading to a superior approximation of Pareto policy set. The proposed method leverages Pareto ascent direction to select the scalarization weights and computes the multi-objective policy gradient, which determines the policy optimization direction and ensures joint improvement on all objectives. Meanwhile, multiple policies are selectively optimized under an evolutionary framework to approximate the Pareto frontier from different directions. Additionally, a Pareto adaptive fine-tuning approach is applied to enhance the density and spread of the Pareto frontier approximation. Experiments on various multi-objective robot control tasks show that the proposed method clearly outperforms the current state-of-the-art algorithm in terms of both quality and stability of the outcomes.
[295] A Subgoal-driven Framework for Improving Long-Horizon LLM Agents
Taiyi Wang, Sian Gooding, Florian Hartmann, Oriana Riva, Edward Grefenstette
Main category: cs.AI
TL;DR: LLM-based agents for web navigation struggle with long-horizon planning; proposed solution combines real-time subgoal decomposition planning with milestone-based RL rewards, achieving state-of-the-art performance on WebArena-Lite.
Details
Motivation: Existing LLM-based agents for web navigation fail at long-horizon planning due to losing track during online execution and difficulty learning from sparse rewards during RL fine-tuning, limiting their ability to handle complex, extended tasks.Method: Two contributions: 1) Agent framework using proprietary models for online planning via subgoal decomposition, and 2) MiRA (Milestoning your Reinforcement Learning Enhanced Agent) - RL training framework with dense, milestone-based reward signals.
Result: Real-time planning improved proprietary models by ~10% absolute increase in success rate on WebArena-Lite. MiRA applied to Gemma3-12B increased success rate from 6.4% to 43.0%, surpassing GPT-4-Turbo (17.6%), GPT-4o (13.9%), and previous SOTA WebRL (38.4%).
Conclusion: Combining explicit inference-time planning with milestone-based rewards significantly improves agents’ long-horizon capabilities, enabling more robust and general-purpose autonomous systems for complex web navigation tasks.
Abstract: Large language model (LLM)-based agents have emerged as powerful autonomous controllers for digital environments, including mobile interfaces, operating systems, and web browsers. Web navigation, for example, requires handling dynamic content and long sequences of actions, making it particularly challenging. Existing LLM-based agents struggle with long-horizon planning in two main ways. During online execution, they often lose track as new information arrives, lacking a clear and adaptive path toward the final goal. This issue is further exacerbated during reinforcement learning (RL) fine-tuning, where sparse and delayed rewards make it difficult for agents to identify which actions lead to success, preventing them from maintaining coherent reasoning over extended tasks. To address these challenges, we propose two contributions. First, we introduce an agent framework that leverages proprietary models for online planning through subgoal decomposition. Second, we present MiRA (Milestoning your Reinforcement Learning Enhanced Agent), an RL training framework that uses dense, milestone-based reward signals. The real-time planning mechanism improves proprietary models such as Gemini by approximately a 10% absolute increase in success rate (SR) on the WebArena-Lite benchmark. Meanwhile, applying MiRA to the open Gemma3-12B model increases its success rate from 6.4% to 43.0%. This performance surpasses proprietary systems such as GPT-4-Turbo (17.6%) and GPT-4o (13.9%), as well as the previous open-model state of the art, WebRL (38.4%). Overall, our findings demonstrate that combining explicit inference-time planning with milestone-based rewards significantly improves an agent’s long-horizon capabilities, paving the way for more robust and general-purpose autonomous systems.
[296] PowerLens: Taming LLM Agents for Safe and Personalized Mobile Power Management
Xingyu Feng, Chang Sun, Yuzhu Wang, Zhangbing Zhou, Chengwen Luo, Zhuangzhuang Chen, Xiaomin Ouyang, Huanqi Yang
Main category: cs.AI
TL;DR: PowerLens uses LLMs for personalized mobile power management by bridging user activities with system parameters through commonsense reasoning, achieving significant energy savings with safety guarantees.
Details
Motivation: Battery life is critical for mobile devices, but existing power management uses static rules or coarse heuristics that ignore user activities and personal preferences. There's a need for context-aware, personalized power management that adapts to individual usage patterns.Method: PowerLens employs LLMs’ commonsense reasoning to bridge semantic gaps between user activities and system parameters. It uses a multi-agent architecture to recognize user context from UI semantics and generate holistic power policies across 18 device parameters. A PDL-based constraint framework verifies actions before execution, and a two-tier memory system learns individualized preferences from implicit user overrides through confidence-based distillation.
Result: Extensive experiments on rooted Android devices show PowerLens achieves 81.7% action accuracy and 38.8% energy savings over stock Android, outperforming rule-based and LLM-based baselines. The system demonstrates high user satisfaction, fast preference convergence (3-5 days), strong safety guarantees, and consumes only 0.5% of daily battery capacity itself.
Conclusion: PowerLens demonstrates that LLMs can effectively bridge the semantic gap between user activities and system parameters for personalized power management, achieving significant energy savings while maintaining safety and user satisfaction through implicit preference learning.
Abstract: Battery life remains a critical challenge for mobile devices, yet existing power management mechanisms rely on static rules or coarse-grained heuristics that ignore user activities and personal preferences. We present PowerLens, a system that tames the reasoning power of Large Language Models (LLMs) for safe and personalized mobile power management on Android devices. The key idea is that LLMs’ commonsense reasoning can bridge the semantic gap between user activities and system parameters, enabling zero-shot, context-aware policy generation that adapts to individual preferences through implicit feedback. PowerLens employs a multi-agent architecture that recognizes user context from UI semantics and generates holistic power policies across 18 device parameters. A PDL-based constraint framework verifies every action before execution, while a two-tier memory system learns individualized preferences from implicit user overrides through confidence-based distillation, requiring no explicit configuration and converging within 3–5 days. Extensive experiments on a rooted Android device show that PowerLens achieves 81.7% action accuracy and 38.8% energy saving over stock Android, outperforming rule-based and LLM-based baselines, with high user satisfaction, fast preference convergence, and strong safety guarantees, with the system itself consuming only 0.5% of daily battery capacity.
[297] HyEvo: Self-Evolving Hybrid Agentic Workflows for Efficient Reasoning
Beibei Xu, Yutong Ye, Chuyun Shen, Yingbo Zhou, Cheng Chen, Mingsong Chen
Main category: cs.AI
TL;DR: HyEvo is an automated workflow generation framework that combines LLM nodes for semantic reasoning with deterministic code nodes for rule-based execution, using evolutionary strategies to optimize workflow topology and reduce inference costs.
Details
Motivation: Existing automated workflow generation methods are inefficient and underperform due to reliance on predefined operator libraries and homogeneous LLM-only workflows where all computation is done through probabilistic inference.Method: HyEvo integrates probabilistic LLM nodes for semantic reasoning with deterministic code nodes for rule-based execution, using an LLM-driven multi-island evolutionary strategy with reflect-then-generate mechanism to iteratively refine workflow topology and node logic via execution feedback.
Result: HyEvo consistently outperforms existing methods across diverse reasoning and coding benchmarks, while reducing inference cost by up to 19× and execution latency by up to 16× compared to state-of-the-art open-source baselines.
Conclusion: HyEvo demonstrates that heterogeneous workflow synthesis combining LLMs with deterministic code execution can significantly improve performance and efficiency in automated workflow generation for complex tasks.
Abstract: Although agentic workflows have demonstrated strong potential for solving complex tasks, existing automated generation methods remain inefficient and underperform, as they rely on predefined operator libraries and homogeneous LLM-only workflows in which all task-level computation is performed through probabilistic inference. To address these limitations, we propose HyEvo, an automated workflow-generation framework that leverages heterogeneous atomic synthesis. HyEvo integrates probabilistic LLM nodes for semantic reasoning with deterministic code nodes for rule-based execution, offloading predictable operations from LLM inference and reducing inference cost and execution latency. To efficiently navigate the hybrid search space, HyEvo employs an LLM-driven multi-island evolutionary strategy with a reflect-then-generate mechanism, iteratively refining both workflow topology and node logic via execution feedback. Comprehensive experiments show that HyEvo consistently outperforms existing methods across diverse reasoning and coding benchmarks, while reducing inference cost and execution latency by up to 19$\times$ and 16$\times$, respectively, compared to the state-of-the-art open-source baseline.
[298] DAPS++: Rethinking Diffusion Inverse Problems with Decoupled Posterior Annealing
Hao Chen, Renzheng Zhang, Scott S. Howard
Main category: cs.AI
TL;DR: DAPS++ decouples diffusion-based initialization from likelihood-driven refinement for inverse problems, showing diffusion priors mainly serve as warm initializers while reconstruction is driven by measurement consistency.
Details
Motivation: Current Bayesian diffusion solvers for inverse problems claim joint inference but actually show decoupled behavior where diffusion priors offer limited guidance and reconstruction is largely measurement-driven. The paper aims to explain this practical behavior and develop a more efficient approach.Method: Introduces DAPS++ which fully decouples diffusion-based initialization from likelihood-driven refinement. Uses diffusion prior as warm initializer to place estimates near data manifold, then allows likelihood term to guide inference directly while maintaining numerical stability.
Result: DAPS++ achieves high computational efficiency with fewer function evaluations and measurement-optimization steps, while maintaining robust reconstruction performance across diverse image restoration tasks.
Conclusion: Diffusion priors in inverse problem solvers primarily function as warm initializers rather than guiding inference throughout, and decoupling initialization from refinement leads to more efficient and effective reconstruction.
Abstract: From a Bayesian perspective, score-based diffusion solves inverse problems through joint inference, embedding the likelihood with the prior to guide the sampling process. However, this formulation fails to explain its practical behavior: the prior offers limited guidance, while reconstruction is largely driven by the measurement-consistency term, leading to an inference process that is effectively decoupled from the diffusion dynamics. We show that the diffusion prior in these solvers functions primarily as a warm initializer that places estimates near the data manifold, while reconstruction is driven almost entirely by measurement consistency. Based on this observation, we introduce \textbf{DAPS++}, which fully decouples diffusion-based initialization from likelihood-driven refinement, allowing the likelihood term to guide inference more directly while maintaining numerical stability and providing insight into why unified diffusion trajectories remain effective in practice. By requiring fewer function evaluations (NFEs) and measurement-optimization steps, \textbf{DAPS++} achieves high computational efficiency and robust reconstruction performance across diverse image restoration tasks.
[299] Stepwise: Neuro-Symbolic Proof Search for Automated Systems Verification
Baoding He, Zenan Li, Wei Sun, Yuan Yao, Taolue Chen, Xiaoxing Ma, Zhendong Su
Main category: cs.AI
TL;DR: Neuro-symbolic framework combining LLMs with interactive theorem provers to automate proof generation for software verification, achieving 77.6% theorem proof rate on seL4 benchmark.
Details
Motivation: Formal verification via interactive theorem proving is manual and doesn't scale well. LLMs show promise in mathematical reasoning but need integration with symbolic verification tools to be effective for systems-level verification.Method: Best-first tree search over proof states with LLM querying for candidate steps. Fine-tunes LLMs on proof state-step pairs, integrates ITP tools for step repair, state filtering/ranking, and automatic subgoal discharge. Implemented on Isabelle REPL with fine-grained proof state exposure.
Result: Proves up to 77.6% of theorems on seL4 benchmark, substantially surpassing previous LLM-based approaches and standalone Sledgehammer. Solves significantly more multi-step proofs and shows strong generalization across additional Isabelle benchmarks.
Conclusion: Neuro-symbolic approach enables data-efficient LLM adaptation and semantics-informed search space pruning, indicating a viable path toward scalable automated software verification.
Abstract: Formal verification via interactive theorem proving is increasingly used to ensure the correctness of critical systems, yet constructing large proof scripts remains highly manual and limits scalability. Advances in large language models (LLMs), especially in mathematical reasoning, make their integration into software verification increasingly promising. This paper introduces a neuro-symbolic proof generation framework designed to automate proof search for systems-level verification projects. The framework performs a best-first tree search over proof states, repeatedly querying an LLM for the next candidate proof step. On the neural side, we fine-tune LLMs using datasets of proof state-step pairs; on the symbolic side, we incorporate a range of ITP tools to repair rejected steps, filter and rank proof states, and automatically discharge subgoals when search progress stalls. This synergy enables data-efficient LLM adaptation and semantics-informed pruning of the search space. We implement the framework on a new Isabelle REPL that exposes fine-grained proof states and automation tools, and evaluate it on the FVEL seL4 benchmark and additional Isabelle developments. On seL4, the system proves up to 77.6% of the theorems, substantially surpassing previous LLM-based approaches and standalone Sledgehammer, while solving significantly more multi-step proofs. Results across further benchmarks demonstrate strong generalization, indicating a viable path toward scalable automated software verification.
[300] Embodied Science: Closing the Discovery Loop with Agentic Embodied AI
Xiang Zhuang, Chenyi Zhou, Kehua Feng, Zhihui Zhu, Yunfan Gao, Yijie Zhong, Yichi Zhang, Junjie Huang, Keyan Ding, Lei Bai, Haofen Wang, Qiang Zhang, Huajun Chen
Main category: cs.AI
TL;DR: Embodied science paradigm reframes scientific discovery as a closed-loop system where AI agents physically interact with experimental environments through perception, reasoning, action, and discovery cycles.
Details
Motivation: Current AI approaches to scientific discovery are misaligned with the physical, long-horizon nature of real scientific research, treating discovery as isolated predictions rather than continuous physical interaction. There's a need to bridge the gap between digital prediction and empirical validation.Method: Proposes a unified Perception-Language-Action-Discovery (PLAD) framework where embodied agents: 1) perceive experimental environments, 2) reason over scientific knowledge, 3) execute physical interventions, and 4) internalize outcomes to drive subsequent exploration.
Result: The paper presents a conceptual framework for autonomous discovery systems that tightly couple agentic reasoning with physical execution, offering a roadmap for applications in life and chemical sciences.
Conclusion: Embodied science represents a paradigm shift that grounds computational reasoning in robust physical feedback, enabling more realistic and effective autonomous scientific discovery systems.
Abstract: Artificial intelligence has demonstrated remarkable capability in predicting scientific properties, yet scientific discovery remains an inherently physical, long-horizon pursuit governed by experimental cycles. Most current computational approaches are misaligned with this reality, framing discovery as isolated, task-specific predictions rather than continuous interaction with the physical world. Here, we argue for embodied science, a paradigm that reframes scientific discovery as a closed loop tightly coupling agentic reasoning with physical execution. We propose a unified Perception-Language-Action-Discovery (PLAD) framework, wherein embodied agents perceive experimental environments, reason over scientific knowledge, execute physical interventions, and internalize outcomes to drive subsequent exploration. By grounding computational reasoning in robust physical feedback, this approach bridges the gap between digital prediction and empirical validation, offering a roadmap for autonomous discovery systems in the life and chemical sciences.
[301] FormalEvolve: Neuro-Symbolic Evolutionary Search for Diverse and Prover-Effective Autoformalization
Haijian Lu, Wei Wang, Jing Liu
Main category: cs.AI
TL;DR: FormalEvolve: A neuro-symbolic evolutionary framework for autoformalization that generates diverse formalizations via LLM-driven mutation/crossover and AST rewrites to improve both semantic consistency and prover effectiveness.
Details
Motivation: Current autoformalization approaches focus on semantic consistency but ignore prover effectiveness - even semantically correct formalizations can vary greatly in proof-search cost and success rates. There's a need to optimize formalizations for both semantic accuracy and practical provability.Method: FormalEvolve uses a budgeted, test-time search framework with neuro-symbolic evolution: 1) LLM-driven mutation and crossover with bounded patch repair, 2) symbolic AST rewrite operations for structural diversity, 3) compilation-gated evaluation to filter candidates, and 4) evolutionary search under a fixed generator-call budget.
Result: On CombiBench and ProofNet with T=100 budget: achieves semantic hit rates of 58.0% and 84.9%, reduces cross-problem concentration of successes (lower Gini coefficient), and improves downstream proving performance under fixed prover budgets.
Conclusion: FormalEvolve demonstrates that evolutionary search with neuro-symbolic operators can effectively generate diverse, provable formalizations, addressing both semantic consistency and prover effectiveness in autoformalization.
Abstract: Autoformalization aims to translate natural-language mathematics into compilable, machine-checkable statements. However, semantic consistency does not imply prover effectiveness: even semantically consistent formalizations can differ substantially in proof-search cost and success rate. In this work, we formulate autoformalization as a budgeted, test-time search for semantically consistent repertoires, and propose FormalEvolve, a compilation-gated neuro-symbolic evolutionary framework. FormalEvolve generates diverse candidates via LLM-driven mutation and crossover with bounded patch repair, while symbolic Abstract Syntax Tree (AST) rewrite operations further inject structural diversity. On CombiBench and ProofNet, under a strict generator-call budget of T = 100, FormalEvolve reaches semantic hit rates (SH@100) of 58.0% and 84.9%, and reduces cross-problem concentration of semantic successes(lower Gini). Under a fixed prover budget, FormalEvolve also improves downstream proving performance on CombiBench. Code will be released publicly.
[302] Utility-Guided Agent Orchestration for Efficient LLM Tool Use
Boyan Liu, Gongming Zhao, Hongli Xu
Main category: cs.AI
TL;DR: A utility-guided orchestration policy for LLM agents that balances answer quality and execution cost by selecting actions based on estimated gain, step cost, uncertainty, and redundancy.
Details
Motivation: Tool-using LLM agents face a tension between answer quality and execution cost - fixed workflows are inflexible while free-form methods like ReAct improve performance but at the expense of excessive tool calls, longer trajectories, higher token consumption, and increased latency.Method: Proposes agent orchestration as an explicit decision problem with a utility-guided policy that selects among actions (respond, retrieve, tool call, verify, stop) by balancing estimated gain, step cost, uncertainty, and redundancy. Provides a controllable and analyzable policy framework for studying quality-cost trade-offs.
Result: Experiments across various approaches (direct answering, threshold control, fixed workflows, ReAct, policy variants) show that explicit orchestration signals substantially affect agent behavior. Additional analyses demonstrate that lightweight utility design provides a defensible and practical mechanism for agent control.
Conclusion: Explicit orchestration policies offer a controllable framework for managing quality-cost trade-offs in tool-using LLM agents, with utility-guided approaches providing practical mechanisms for agent control beyond prompt-level behavior.
Abstract: Tool-using large language model (LLM) agents often face a fundamental tension between answer quality and execution cost. Fixed workflows are stable but inflexible, while free-form multi-step reasoning methods such as ReAct may improve task performance at the expense of excessive tool calls, longer trajectories, higher token consumption, and increased latency. In this paper, we study agent orchestration as an explicit decision problem rather than leaving it entirely to prompt-level behavior. We propose a utility-guided orchestration policy that selects among actions such as respond, retrieve, tool call, verify, and stop by balancing estimated gain, step cost, uncertainty, and redundancy. Our goal is not to claim universally best task performance, but to provide a controllable and analyzable policy framework for studying quality-cost trade-offs in tool-using LLM agents. Experiments across direct answering, threshold control, fixed workflows, ReAct, and several policy variants show that explicit orchestration signals substantially affect agent behavior. Additional analyses on cost definitions, workflow fairness, and redundancy control further demonstrate that lightweight utility design can provide a defensible and practical mechanism for agent control.
[303] On the Ability of Transformers to Verify Plans
Yash Sarrof, Yupei Du, Katharina Stein, Alexander Koller, Sylvie Thiébaux, Michael Hahn
Main category: cs.AI
TL;DR: Transformers can provably learn to verify long plans in classical planning domains, with theoretical analysis using C*-RASP framework for length generalization guarantees.
Details
Motivation: Transformers have shown inconsistent success in AI planning tasks, and there's limited theoretical understanding of when generalization should be expected, especially when the number of objects (and thus vocabulary size) grows at test time.Method: Introduces C*-RASP, an extension of C-RASP designed to establish length generalization guarantees for transformers under simultaneous growth in sequence length and vocabulary size. Analyzes decoder-only models’ ability to verify whether given plans correctly solve planning instances.
Result: Identifies a large class of classical planning domains for which transformers can provably learn to verify long plans, and identifies structural properties that significantly affect the learnability of length generalizable solutions. Empirical experiments corroborate the theory.
Conclusion: Provides theoretical foundations for understanding transformers’ capabilities in planning tasks, with formal guarantees about length generalization in plan verification scenarios.
Abstract: Transformers have shown inconsistent success in AI planning tasks, and theoretical understanding of when generalization should be expected has been limited. We take important steps towards addressing this gap by analyzing the ability of decoder-only models to verify whether a given plan correctly solves a given planning instance. To analyse the general setting where the number of objects – and thus the effective input alphabet – grows at test time, we introduce C*-RASP, an extension of C-RASP designed to establish length generalization guarantees for transformers under the simultaneous growth in sequence length and vocabulary size. Our results identify a large class of classical planning domains for which transformers can provably learn to verify long plans, and structural properties that significantly affects the learnability of length generalizable solutions. Empirical experiments corroborate our theory.
[304] Experience is the Best Teacher: Motivating Effective Exploration in Reinforcement Learning for LLMs
Wenjian Zhang, Kongcheng Zhang, Jiaxin Qi, Baisheng Lai, Jianqiang Huang
Main category: cs.AI
TL;DR: HeRL is a reinforcement learning framework that uses hindsight experience from failed trajectories to guide LLM exploration toward desired behaviors specified in rewards, improving reasoning capabilities.
Details
Motivation: Current RL approaches with rubric-based rewards for LLMs suffer from ineffective exploration confined to current policy distribution, limiting their ability to learn optimal behaviors.Method: HeRL treats failed trajectories with unmet rubrics as hindsight experience, using them as in-context guidance for policy exploration. It also introduces bonus rewards to incentivize responses with greater improvement potential under such guidance.
Result: Extensive experiments across various benchmarks show HeRL achieves superior performance gains over baselines and enables experience-guided self-improvement at test time.
Conclusion: HeRL facilitates effective learning from desired high-quality samples without repeated trial-and-error, providing more accurate gradient estimation and improving LLM reasoning capabilities.
Abstract: Reinforcement Learning (RL) with rubric-based rewards has recently shown remarkable progress in enhancing general reasoning capabilities of Large Language Models (LLMs), yet still suffers from ineffective exploration confined to curent policy distribution. In fact, RL optimization can be viewed as steering the policy toward an ideal distribution that maximizes the rewards, while effective exploration should align efforts with desired target. Leveraging this insight, we propose HeRL, a Hindsight experience guided Reinforcement Learning framework to bootstrap effective exploration by explicitly telling LLMs the desired behaviors specified in rewards. Concretely, HeRL treats failed trajectories along with their unmet rubrics as hindsight experience, which serves as in-context guidance for the policy to explore desired responses beyond its current distribution. Additionally, we introduce a bonus reward to incentivize responses with greater potential for improvement under such guidance. HeRL facilitates effective learning from desired high quality samples without repeated trial-and-error from scratch, yielding a more accurate estimation of the expected gradient theoretically. Extensive experiments across various benchmarks demonstrate that HeRL achieves superior performance gains over baselines, and can further benefit from experience guided self-improvement at test time. Our code is available at https://github.com/sikelifei/HeRL.
[305] DIAL-KG: Schema-Free Incremental Knowledge Graph Construction via Dynamic Schema Induction and Evolution-Intent Assessment
Weidong Bao, Yilin Wang, Ruyu Gao, Fangling Leng, Yubin Bao, Ge Yu
Main category: cs.AI
TL;DR: DIAL-KG is a closed-loop framework for incremental knowledge graph construction that dynamically adapts to new data through a three-stage cycle of extraction, validation, and schema evolution.
Details
Motivation: Traditional KG construction methods are static, requiring complete graph reconstruction for new data and being limited by predefined schemas, making them unsuitable for dynamic real-world scenarios where data arrives continuously.Method: A three-stage closed-loop framework: 1) Dual-Track Extraction (triple generation for simple knowledge, event extraction for complex knowledge), 2) Governance Adjudication (validates facts to prevent hallucinations and staleness), 3) Schema Evolution (induces new schemas from validated knowledge to guide future cycles).
Result: Extensive experiments demonstrate state-of-the-art performance in both constructed graph quality and induced schemas compared to existing methods.
Conclusion: DIAL-KG provides a flexible, incremental approach to KG construction that adapts to dynamic data streams while maintaining high quality through validation and schema evolution mechanisms.
Abstract: Knowledge Graphs (KGs) are foundational to applications such as search, question answering, and recommendation. Conventional knowledge graph construction methods are predominantly static, rely ing on a single-step construction from a fixed corpus with a prede f ined schema. However, such methods are suboptimal for real-world sce narios where data arrives dynamically, as incorporating new informa tion requires complete and computationally expensive graph reconstruc tions. Furthermore, predefined schemas hinder the flexibility of knowl edge graph construction. To address these limitations, we introduce DIAL KG, a closed-loop framework for incremental KG construction orches trated by a Meta-Knowledge Base (MKB). The framework oper ates in a three-stage cycle: (i) Dual-Track Extraction, which ensures knowledge completeness by defaulting to triple generation and switching to event extraction for complex knowledge; (ii) Governance Adjudica tion, which ensures the fidelity and currency of extracted facts to prevent hallucinations and knowledge staleness; and (iii) Schema Evolution, in which new schemas are induced from validated knowledge to guide subsequent construction cycles, and knowledge from the current round is incrementally applied to the existing KG. Extensive experiments demon strate that our framework achieves state-of-the-art (SOTA) performance in the quality of both the constructed graph and the induced schemas.
[306] Pitfalls in Evaluating Interpretability Agents
Tal Haklay, Nikhil Prakash, Sana Pandey, Antonio Torralba, Aaron Mueller, Jacob Andreas, Tamar Rott Shaham, Yonatan Belinkov
Main category: cs.AI
TL;DR: An agentic system using LLMs for automated circuit analysis in neural networks, with evaluation challenges and proposed intrinsic evaluation methods.
Details
Motivation: To scale neural network interpretability analysis using LLMs as autonomous agents, addressing the challenge of evaluating complex automated interpretability systems that generate explanations at scale.Method: Builds an agentic system where a research agent iteratively designs experiments and refines hypotheses for circuit analysis. Proposes unsupervised intrinsic evaluation based on functional interchangeability of model components.
Result: System appears competitive against human expert explanations across six circuit analysis tasks, but reveals pitfalls of replication-based evaluation including subjective human explanations, outcome-based comparisons, and LLM memorization issues.
Conclusion: Demonstrates fundamental challenges in evaluating complex automated interpretability systems and reveals key limitations of replication-based evaluation, proposing intrinsic evaluation as an alternative approach.
Abstract: Automated interpretability systems aim to reduce the need for human labor and scale analysis to increasingly large models and diverse tasks. Recent efforts toward this goal leverage large language models (LLMs) at increasing levels of autonomy, ranging from fixed one-shot workflows to fully autonomous interpretability agents. This shift creates a corresponding need to scale evaluation approaches to keep pace with both the volume and complexity of generated explanations. We investigate this challenge in the context of automated circuit analysis – explaining the roles of model components when performing specific tasks. To this end, we build an agentic system in which a research agent iteratively designs experiments and refines hypotheses. When evaluated against human expert explanations across six circuit analysis tasks in the literature, the system appears competitive. However, closer examination reveals several pitfalls of replication-based evaluation: human expert explanations can be subjective or incomplete, outcome-based comparisons obscure the research process, and LLM-based systems may reproduce published findings via memorization or informed guessing. To address some of these pitfalls, we propose an unsupervised intrinsic evaluation based on the functional interchangeability of model components. Our work demonstrates fundamental challenges in evaluating complex automated interpretability systems and reveals key limitations of replication-based evaluation.
[307] Learning Dynamic Belief Graphs for Theory-of-mind Reasoning
Ruxiao Chen, Xilei Zhao, Thomas J. Cova, Frank A. Drews, Susu Xu
Main category: cs.AI
TL;DR: A structured cognitive trajectory model for LLM-based Theory of Mind that represents mental states as dynamic belief graphs, improving action prediction and belief trajectory recovery in high-uncertainty environments.
Details
Motivation: Existing LLM-based ToM approaches either use direct prompting or static latent-state models, which produce incoherent mental models over time and weak reasoning in dynamic, high-stakes contexts like disaster response and emergency medicine.Method: Introduces a structured cognitive trajectory model with: (1) projection from textual probabilistic statements to consistent probabilistic graphical model updates, (2) energy-based factor graph representation of belief interdependencies, and (3) ELBO-based objective capturing belief accumulation and delayed decisions.
Result: Significantly improves action prediction across multiple real-world disaster evacuation datasets and recovers interpretable belief trajectories consistent with human reasoning.
Conclusion: Provides a principled module for augmenting LLMs with Theory of Mind capabilities in high-uncertainty environments through dynamic belief graph representations.
Abstract: Theory of Mind (ToM) reasoning with Large Language Models (LLMs) requires inferring how people’s implicit, evolving beliefs shape what they seek and how they act under uncertainty – especially in high-stakes settings such as disaster response, emergency medicine, and human-in-the-loop autonomy. Prior approaches either prompt LLMs directly or use latent-state models that treat beliefs as static and independent, often producing incoherent mental models over time and weak reasoning in dynamic contexts. We introduce a structured cognitive trajectory model for LLM-based ToM that represents mental state as a dynamic belief graph, jointly inferring latent beliefs, learning their time-varying dependencies, and linking belief evolution to information seeking and decisions. Our model contributes (i) a novel projection from textualized probabilistic statements to consistent probabilistic graphical model updates, (ii) an energy-based factor graph representation of belief interdependencies, and (iii) an ELBO-based objective that captures belief accumulation and delayed decisions. Across multiple real-world disaster evacuation datasets, our model significantly improves action prediction and recovers interpretable belief trajectories consistent with human reasoning, providing a principled module for augmenting LLMs with ToM in high-uncertainty environment. https://anonymous.4open.science/r/ICML_submission-6373/
[308] Survey of Various Fuzzy and Uncertain Decision-Making Methods
Takaaki Fujita, Florentin Smarandache
Main category: cs.AI
TL;DR: Survey paper on uncertainty-aware multi-criteria decision-making (MCDM) methods, organizing the field into a taxonomy covering problem settings, weight elicitation, solution procedures, and guidance for method selection.
Details
Motivation: Real-world decision-making faces challenges with vagueness, incomplete information, heterogeneous data, and conflicting expert opinions, necessitating systematic approaches to handle uncertainty in multi-criteria decision contexts.Method: Survey methodology organizing uncertainty-aware MCDM into a task-oriented taxonomy covering: problem-level settings (discrete, group, dynamic, multi-stage, etc.), weight elicitation (subjective/objective schemes with fuzzy/linguistic inputs), inter-criteria structure modeling, and solution procedures (compensatory scoring, distance-to-reference, outranking frameworks).
Result: Comprehensive organization of the field highlighting typical inputs, core computational steps, primary outputs, and guidance for method selection based on robustness, interpretability, and data availability.
Conclusion: Identifies open research directions including explainable uncertainty integration, stability analysis, and scalability improvements for large-scale dynamic decision environments.
Abstract: Decision-making in real applications is often affected by vagueness, incomplete information, heterogeneous data, and conflicting expert opinions. This survey reviews uncertainty-aware multi-criteria decision-making (MCDM) and organizes the field into a concise, task-oriented taxonomy. We summarize problem-level settings (discrete, group/consensus, dynamic, multi-stage, multi-level, multiagent, and multi-scenario), weight elicitation (subjective and objective schemes under fuzzy/linguistic inputs), and inter-criteria structure and causality modelling. For solution procedures, we contrast compensatory scoring methods, distance-to-reference and compromise approaches, and non-compensatory outranking frameworks for ranking or sorting. We also outline rule/evidence-based and sequential decision models that produce interpretable rules or policies. The survey highlights typical inputs, core computational steps, and primary outputs, and provides guidance on choosing methods according to robustness, interpretability, and data availability. It concludes with open directions on explainable uncertainty integration, stability, and scalability in large-scale and dynamic decision environments.
[309] HPS: Hard Preference Sampling for Human Preference Alignment
Xiandong Zou, Wanyu Lin, Yuchen Li, Pan Zhou
Main category: cs.AI
TL;DR: HPS is a novel preference alignment framework that prioritizes preferred responses while rejecting all dispreferred and harmful ones, focusing on “hard” dispreferred responses to enhance rejection capabilities with efficient single-sample Monte Carlo sampling.
Details
Motivation: Current preference optimization methods (PL/BT models) have limitations: poor handling of harmful content, inefficient use of dispreferred responses, and high computational costs for PL methods.Method: Hard Preference Sampling (HPS) introduces a training loss that prioritizes most preferred responses while rejecting all dispreferred and harmful ones, emphasizing “hard” dispreferred responses (those closely resembling preferred ones). Uses single-sample Monte Carlo sampling to reduce computational overhead.
Result: Experiments on HH-RLHF and PKU-Safety datasets show HPS achieves comparable BLEU and reward scores while greatly improving reward margins and reducing harmful content generation.
Conclusion: HPS provides a robust and efficient framework for human preference alignment that improves sample efficiency, maximizes reward margins, and reduces harmful content generation compared to existing methods.
Abstract: Aligning Large Language Model (LLM) responses with human preferences is vital for building safe and controllable AI systems. While preference optimization methods based on Plackett-Luce (PL) and Bradley-Terry (BT) models have shown promise, they face challenges such as poor handling of harmful content, inefficient use of dispreferred responses, and, specifically for PL, high computational costs. To address these issues, we propose Hard Preference Sampling (HPS), a novel framework for robust and efficient human preference alignment. HPS introduces a training loss that prioritizes the most preferred response while rejecting all dispreferred and harmful ones. It emphasizes “hard” dispreferred responses – those closely resembling preferred ones – to enhance the model’s rejection capabilities. By leveraging a single-sample Monte Carlo sampling strategy, HPS reduces computational overhead while maintaining alignment quality. Theoretically, HPS improves sample efficiency over existing PL methods and maximizes the reward margin between preferred and dispreferred responses, ensuring clearer distinctions. Experiments on HH-RLHF and PKU-Safety datasets validate HPS’s effectiveness, achieving comparable BLEU and reward scores while greatly improving reward margins and thus reducing harmful content generation.
[310] Average Reward Reinforcement Learning for Omega-Regular and Mean-Payoff Objectives
Milad Kazemi, Mateo Perez, Fabio Somenzi, Sadegh Soudjani, Ashutosh Trivedi, Alvaro Velasquez
Main category: cs.AI
TL;DR: First model-free RL framework for absolute liveness specifications in continuing tasks using average-reward objectives, outperforming discount-based methods.
Details
Motivation: Manual reward design is tedious and error-prone, while existing ω-regular RL approaches use episodic discounted settings misaligned with infinite trace semantics. Need for principled approach for continuing tasks without episodic resets.Method: Translates absolute liveness specifications into average-reward objectives for model-free RL in unknown communicating MDPs. Introduces lexicographic multi-objective optimization: first maximize satisfaction probability of specification, then maximize external average-reward objective.
Result: Guarantees convergence in unknown communicating MDPs, supports on-the-fly reductions without full environment knowledge. Experiments show continuing average-reward approach outperforms competing discount-based methods.
Conclusion: Provides first model-free RL framework for absolute liveness specifications in continuing tasks, offering principled alternative to manual reward design with better alignment to ω-regular semantics.
Abstract: Recent advances in reinforcement learning (RL) have renewed interest in reward design for shaping agent behavior, but manually crafting reward functions is tedious and error-prone. A principled alternative is to specify behavioral requirements in a formal, unambiguous language and automatically compile them into learning objectives. $ω$-regular languages are a natural fit, given their role in formal verification and synthesis. However, most existing $ω$-regular RL approaches operate in an episodic, discounted setting with periodic resets, which is misaligned with $ω$-regular semantics over infinite traces. For continuing tasks, where the agent interacts with the environment over a single uninterrupted lifetime, the average-reward criterion is more appropriate. We focus on absolute liveness specifications, a subclass of $ω$-regular languages that cannot be violated by any finite prefix and thus aligns naturally with continuing interaction. We present the first model-free RL framework that translates absolute liveness specifications into average-reward objectives and enables learning in unknown communicating Markov decision processes (MDPs) without episodic resetting. We also introduce a reward structure for lexicographic multi-objective optimization: among policies that maximize the satisfaction probability of an absolute liveness specification, the agent maximizes an external average-reward objective. Our method guarantees convergence in unknown communicating MDPs and supports on-the-fly reductions that do not require full environment knowledge, enabling model-free learning. Experiments across several benchmarks show that the continuing, average-reward approach outperforms competing discount-based methods.
[311] Preference-Driven Multi-Objective Combinatorial Optimization with Conditional Computation
Mingfeng Fan, Jianan Zhou, Yifeng Zhang, Yaoxin Wu, Jinbiao Chen, Guillaume Adrien Sartoretti
Main category: cs.AI
TL;DR: POCCO is a plug-and-play framework for multi-objective combinatorial optimization that adaptively selects specialized neural architectures for different subproblems and uses preference-driven optimization instead of explicit rewards.
Details
Motivation: Existing deep RL methods for multi-objective combinatorial optimization treat all subproblems equally with a single model, which limits effective exploration of solution space and leads to suboptimal performance.Method: Proposes POCCO framework with: 1) conditional computation block that routes subproblems to specialized neural architectures, and 2) preference-driven optimization algorithm that learns pairwise preferences between winning and losing solutions.
Result: Experimental results across four classic MOCOP benchmarks show significant superiority and strong generalization when applied to two state-of-the-art neural methods for MOCOPs.
Conclusion: POCCO provides an effective and versatile framework for improving multi-objective combinatorial optimization through adaptive model selection and preference-based learning.
Abstract: Recent deep reinforcement learning methods have achieved remarkable success in solving multi-objective combinatorial optimization problems (MOCOPs) by decomposing them into multiple subproblems, each associated with a specific weight vector. However, these methods typically treat all subproblems equally and solve them using a single model, hindering the effective exploration of the solution space and thus leading to suboptimal performance. To overcome the limitation, we propose POCCO, a novel plug-and-play framework that enables adaptive selection of model structures for subproblems, which are subsequently optimized based on preference signals rather than explicit reward values. Specifically, we design a conditional computation block that routes subproblems to specialized neural architectures. Moreover, we propose a preference-driven optimization algorithm that learns pairwise preferences between winning and losing solutions. We evaluate the efficacy and versatility of POCCO by applying it to two state-of-the-art neural methods for MOCOPs. Experimental results across four classic MOCOP benchmarks demonstrate its significant superiority and strong generalization.
[312] Multimodal Fused Learning for Solving the Generalized Traveling Salesman Problem in Robotic Task Planning
Jiaqi Cheng, Mingfeng Fan, Xuefeng Zhang, Jingsong Liang, Yuhong Cao, Guohua Wu, Guillaume Adrien Sartoretti
Main category: cs.AI
TL;DR: A Multimodal Fused Learning framework for solving Generalized Traveling Salesman Problems in robotics using graph and image representations for real-time task planning.
Details
Motivation: Mobile robots need effective and efficient task planning for applications like warehouse retrieval and environmental monitoring, which involve solving Generalized Traveling Salesman Problems (GTSP) that are challenging to solve both accurately and efficiently.Method: Proposes a Multimodal Fused Learning (MMFL) framework that uses both graph and image-based representations. Includes a coordinate-based image builder to transform GTSP instances into spatial representations, adaptive resolution scaling for different problem scales, and a multimodal fusion module with bottlenecks to integrate geometric and spatial features.
Result: Extensive experiments show MMFL significantly outperforms state-of-the-art methods across various GTSP instances while maintaining computational efficiency for real-time applications. Physical robot tests validate practical effectiveness in real-world scenarios.
Conclusion: The MMFL framework provides an effective solution for real-time robotic task planning by leveraging multimodal representations to solve challenging GTSP problems with both accuracy and efficiency.
Abstract: Effective and efficient task planning is essential for mobile robots, especially in applications like warehouse retrieval and environmental monitoring. These tasks often involve selecting one location from each of several target clusters, forming a Generalized Traveling Salesman Problem (GTSP) that remains challenging to solve both accurately and efficiently. To address this, we propose a Multimodal Fused Learning (MMFL) framework that leverages both graph and image-based representations to capture complementary aspects of the problem, and learns a policy capable of generating high-quality task planning schemes in real time. Specifically, we first introduce a coordinate-based image builder that transforms GTSP instances into spatially informative representations. We then design an adaptive resolution scaling strategy to enhance adaptability across different problem scales, and develop a multimodal fusion module with dedicated bottlenecks that enables effective integration of geometric and spatial features. Extensive experiments show that our MMFL approach significantly outperforms state-of-the-art methods across various GTSP instances while maintaining the computational efficiency required for real-time robotic applications. Physical robot tests further validate its practical effectiveness in real-world scenarios.
[313] Improved Generalized Planning with LLMs through Strategy Refinement and Reflection
Katharina Stein, Nils Hodel, Daniel Fišer, Jörg Hoffmann, Michael Katz, Alexander Koller
Main category: cs.AI
TL;DR: LLM-based approach for generating generalized plans in PDDL planning with pseudocode debugging and reflection to improve plan quality
Details
Motivation: Previous LLM-based approaches for generating generalized plans in PDDL planning directly convert natural language strategies to Python programs without debugging the strategy itself, leading to incorrect generalized plans when the strategy is flawed.Method: Three key extensions: 1) Generate strategy as pseudocode with automatic debugging before program generation, 2) Add reflection step during Python debugging to identify failure reasons, 3) Generate multiple program variants and select the best one.
Result: The approach achieves 82% average coverage across 17 benchmark domains, substantially improving generalized plan quality compared to previous methods.
Conclusion: Debugging pseudocode strategies before implementation and incorporating reflection significantly enhances LLM-based generalized plan generation in PDDL planning.
Abstract: LLMs have recently been used to generate Python programs representing generalized plans in PDDL planning, i.e., plans that generalize across the tasks of a given PDDL domain. Previous work proposed a framework consisting of three steps: the LLM first generates a summary and then a strategy for the domain, both in natural language, and then implements that strategy as a Python program, that gets debugged on example planning tasks. In that work, only one strategy is generated and passed directly to the program generation. If the strategy is incorrect, its implementation will therefore result in an incorrect generalized plan. Here, we introduce an approach that generates the strategy in the form of pseudocode and enables automatic debugging of the pseudocode, hence allowing us to identify and fix errors prior to the generation of the generalized plan itself. Additionally, we extend the Python debugging phase with a reflection step prompting the LLM to pinpoint the reason for the observed plan failure. Finally, we take inspiration from LLM code generation to produce several program variants and pick the best one. Running experiments on 17 benchmark domains with two reasoning and two non-reasoning LLMs, we show that these extensions substantially improve the quality of the generalized plans. Our best performing configuration achieves an average coverage of 82% across the domains.
[314] Evaluation-Aware Reinforcement Learning
Shripad Vilasrao Deshmukh, Will Schwarzer, Scott Niekum
Main category: cs.AI
TL;DR: EvA-RL is a reinforcement learning framework that optimizes policies for both performance and evaluation accuracy during training, rather than treating evaluation as a post-hoc process.
Details
Motivation: Existing policy evaluation methods in RL suffer from high variance or bias, and evaluation is typically treated as a post-hoc process separate from policy learning. This can lead to unreliable evaluation of deployed RL policies, which is critical for safe deployment.Method: EvA-RL directly optimizes policies for efficient and accurate evaluation in addition to performance. The framework allows for co-learning of both the evaluation-aware policy and the evaluation mechanism itself to mitigate tradeoffs between evaluation accuracy and expected return.
Result: Theoretical analysis and empirical results demonstrate that EvA-RL effectively trades off between evaluation accuracy and expected return. The co-learning approach provides evaluation benefits without significantly sacrificing policy performance.
Conclusion: EvA-RL elevates reliable evaluation to a first-class principle in reinforcement learning, opening a new line of research that integrates evaluation considerations directly into policy learning.
Abstract: Policy evaluation is a core component of many reinforcement learning (RL) algorithms and a critical tool for ensuring safe deployment of RL policies. However, existing policy evaluation methods often suffer from high variance or bias. To address these issues, we introduce Evaluation-Aware Reinforcement Learning (EvA-RL), a general policy learning framework that considers evaluation accuracy at train-time, as opposed to standard post-hoc policy evaluation methods. Specifically, EvA-RL directly optimizes policies for efficient and accurate evaluation, in addition to being performant. We provide an instantiation of EvA-RL and demonstrate through a combination of theoretical analysis and empirical results that EvA-RL effectively trades off between evaluation accuracy and expected return. Finally, we show that the evaluation-aware policy and the evaluation mechanism itself can be co-learned to mitigate this tradeoff, providing the evaluation benefits without significantly sacrificing policy performance. This work opens a new line of research that elevates reliable evaluation to a first-class principle in reinforcement learning.
[315] RealUnify: Do Unified Models Truly Benefit from Unification? A Comprehensive Benchmark
Yang Shi, Yuhao Dong, Yue Ding, Yuran Wang, Xuanyu Zhu, Sheng Zhou, Wenting Liu, Haochen Tian, Rundong Wang, Huanqian Wang, Zuyan Liu, Bohan Zeng, Ruizhe Chen, Qixun Wang, Zhuoran Zhang, Xinlong Chen, Chengzhuo Tong, Bozhou Li, Qiang Liu, Haotian Wang, Wenjing Yang, Yuanxing Zhang, Pengfei Wan, Yi-Fan Zhang, Ziwei Liu
Main category: cs.AI
TL;DR: RealUnify is a benchmark for evaluating bidirectional synergy between visual understanding and generation in unified multimodal models, revealing current models struggle with effective capability integration.
Details
Motivation: Existing benchmarks assess visual understanding and generation in isolation, failing to determine whether unified models can leverage understanding to enhance generation or use generative simulation to facilitate deeper comprehension.Method: Introduces RealUnify benchmark with 1,000 human-annotated instances across 10 categories and 32 subtasks, structured around two axes: Understanding Enhances Generation (reasoning-guided image generation) and Generation Enhances Understanding (mental simulation for reasoning tasks). Uses dual-evaluation protocol combining end-to-end assessment with diagnostic stepwise evaluation.
Result: Evaluation of 12 leading unified models and 6 specialized baselines shows current unified models struggle to achieve effective synergy between understanding and generation capabilities, indicating architectural unification alone is insufficient.
Conclusion: Architectural unification of visual understanding and generation is not enough for effective synergy; new training strategies and inductive biases are needed to fully unlock the potential of unified multimodal modeling.
Abstract: The integration of visual understanding and generation into unified multimodal models represents a significant stride toward general-purpose AI. However, a fundamental question remains unanswered by existing benchmarks: does this architectural unification actually enable synergetic interaction between the constituent capabilities? Existing evaluation paradigms, which primarily assess understanding and generation in isolation, are insufficient for determining whether a unified model can leverage its understanding to enhance its generation, or use generative simulation to facilitate deeper comprehension. To address this critical gap, we introduce RealUnify, a benchmark specifically designed to evaluate bidirectional capability synergy. RealUnify comprises 1,000 meticulously human-annotated instances spanning 10 categories and 32 subtasks. It is structured around two core axes: 1) Understanding Enhances Generation, which requires reasoning (e.g., commonsense, logic) to guide image generation, and 2) Generation Enhances Understanding, which necessitates mental simulation or reconstruction (e.g., of transformed or disordered visual inputs) to solve reasoning tasks. A key contribution is our dual-evaluation protocol, which combines direct end-to-end assessment with a diagnostic stepwise evaluation that decomposes tasks into distinct understanding and generation phases. This protocol allows us to precisely discern whether performance bottlenecks stem from deficiencies in core abilities or from a failure to integrate them. Through large-scale evaluations of 12 leading unified models and 6 specialized baselines, we find that current unified models still struggle to achieve effective synergy, indicating that architectural unification alone is insufficient. These results highlight the need for new training strategies and inductive biases to fully unlock the potential of unified modeling.
[316] PDDL Axioms Are Equivalent to Least Fixed Point Logic (Extended Version)
Claudia Grundke, Gabriele Röger
Main category: cs.AI
TL;DR: PDDL axioms with negative occurrences of derived predicates are shown to be equivalent to least fixed point logic, strictly more expressive than stratified Datalog, with a compilation method to eliminate such negative occurrences.
Details
Motivation: The paper addresses the discrepancy between the PDDL standard's restrictions on negative occurrences of predicates in axiom bodies and common practice in the planning literature where authors often deviate from these limitations, focusing on stratifiability instead.Method: Theoretical analysis comparing the expressive power of PDDL axiom variants with least fixed point logic and stratified Datalog, complemented by a compilation technique to eliminate negative occurrences of derived predicates from PDDL axioms.
Result: Both PDDL axiom variants (standard restrictions vs. stratifiable) can express exactly the same queries as least fixed point logic, making them strictly more expressive than stratified Datalog.
Conclusion: The paper provides theoretical clarification on the expressive power of PDDL axioms and offers practical compilation methods for handling negative occurrences in planning domains.
Abstract: Axioms are a feature of the Planning Domain Definition Language PDDL that can be considered as a generalization of database query languages such as Datalog. The PDDL standard restricts negative occurrences of predicates in axiom bodies to predicates that are directly set by actions and not derived by axioms. In the literature, authors often deviate from this limitation and only require that the set of axioms is stratifiable. We show that both variants can express exactly the same queries as least fixed point logic. They are thus strictly more expressive than stratified Datalog, which aligns with another restriction on axioms occasionally considered in the planning literature. Complementing this theoretical analysis, we also present a compilation that eliminates negative occurrences of derived predicates from PDDL axioms.
[317] VIRO: Robust and Efficient Neuro-Symbolic Reasoning with Verification for Referring Expression Comprehension
Hyejin Park, Junhyuk Kwon, Suha Kwak, Jungseul Ok
Main category: cs.AI
TL;DR: VIRO introduces verification-integrated reasoning operators for neuro-symbolic referring expression comprehension, addressing cascading errors in compositional reasoning by embedding lightweight verifiers at each reasoning step.
Details
Motivation: Current neuro-symbolic REC approaches using LLMs and VLMs assume accurate intermediate reasoning steps, leading to cascading errors and high-confidence false positives when no target is present. The paper aims to address this limitation by introducing verification mechanisms.Method: VIRO embeds lightweight operator-level verifiers within reasoning steps. Each operator executes and validates its output (object existence, spatial relationships), enabling robust handling of no-target cases through verification-aware abstention. The framework decouples program generation and execution for scalability.
Result: Achieves state-of-the-art performance with 61.1% balanced accuracy across target-present and no-target settings, demonstrates generalization to real-world egocentric data, shows high reliability (≤0.3% program failure rate), efficient per-query runtime, and scalability.
Conclusion: VIRO provides a robust neuro-symbolic framework for REC that addresses cascading errors through integrated verification, achieving strong performance, reliability, and generalization while maintaining interpretable reasoning.
Abstract: Referring Expression Comprehension (REC) aims to localize the image region corresponding to a natural language query. Recent neuro-symbolic REC approaches leverage large language models (LLMs) and vision-language models (VLMs) to perform compositional reasoning, decomposing queries into structured programs and executing them step-by-step. While such approaches achieve interpretable reasoning and strong zero-shot generalization, they assume that intermediate reasoning steps are accurate. However, this assumption causes cascading errors: false detections and invalid relations propagate through the reasoning chain, yielding high-confidence false positives even when no target is present in the image. To address this limitation, we introduce Verification-Integrated Reasoning Operators (VIRO), a neuro-symbolic framework that embeds lightweight operator-level verifiers within reasoning steps. Each operator executes and validates its output, such as object existence or spatial relationships, allowing the system to robustly handle no-target cases through verification-aware abstention. Our framework achieves state-of-the-art performance, reaching 61.1% balanced accuracy across target-present and no-target settings, and demonstrates generalization to real-world egocentric data. VIRO also shows high reliability with a program failure rate of at most 0.3%, efficient per-query runtime, and scalability through decoupled program generation and execution.
[318] On Sample-Efficient Generalized Planning via Learned Transition Models
Nitin Gupta, Vishal Pallagani, John A. Aydin, Biplav Srivastava
Main category: cs.AI
TL;DR: Learning explicit neural transition models for generalized planning outperforms direct action-sequence prediction in out-of-distribution generalization with fewer training samples and smaller models.
Details
Motivation: Current Transformer-based planners for generalized planning (like PlanGPT and Plansformer) directly predict action sequences without explicit transition modeling, which leads to state drift in long-horizon tasks and requires large datasets/models. The authors aim to address these limitations by learning explicit transition models.Method: Formulate generalized planning as transition-model learning where a neural model approximates the successor-state function. Instead of predicting actions directly, the model autoregressively predicts intermediate world states, learning domain dynamics as an implicit world model. Evaluate multiple state representations and neural architectures including relational graph encodings.
Result: Learning explicit transition models yields higher out-of-distribution satisficing-plan success than direct action-sequence prediction in multiple domains, while achieving these gains with significantly fewer training instances and smaller models.
Conclusion: Explicit transition modeling is more effective for generalized planning than direct action prediction, offering better generalization, sample efficiency, and robustness to state drift in long-horizon tasks.
Abstract: Generalized planning studies the construction of solution strategies that generalize across families of planning problems sharing a common domain model, formally defined by a transition function $γ: S \times A \rightarrow S$. Classical approaches achieve such generalization through symbolic abstractions and explicit reasoning over $γ$. In contrast, recent Transformer-based planners, such as PlanGPT and Plansformer, largely cast generalized planning as direct action-sequence prediction, bypassing explicit transition modeling. While effective on in-distribution instances, these approaches typically require large datasets and model sizes, and often suffer from state drift in long-horizon settings due to the absence of explicit world-state evolution. In this work, we formulate generalized planning as a transition-model learning problem, in which a neural model explicitly approximates the successor-state function $\hatγ \approx γ$ and generates plans by rolling out symbolic state trajectories. Instead of predicting actions directly, the model autoregressively predicts intermediate world states, thereby learning the domain dynamics as an implicit world model. To study size-invariant generalization and sample efficiency, we systematically evaluate multiple state representations and neural architectures, including relational graph encodings. Our results show that learning explicit transition models yields higher out-of-distribution satisficing-plan success than direct action-sequence prediction in multiple domains, while achieving these gains with significantly fewer training instances and smaller models. This is an extended version of a short paper accepted at ICAPS 2026 under the same title.
[319] Generative AI-assisted Participatory Modeling in Socio-Environmental Planning under Deep Uncertainty
Zhihao Pei, Nir Lipovetzky, Angela M. Rojas-Arevalo, Fjalar J. de Haan, Enayat A. Moallemi
Main category: cs.AI
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation without access to paper contentMethod: Cannot determine method without access to paper content
Result: Cannot determine results without access to paper content
Conclusion: Cannot draw conclusions without access to paper content
Abstract: Failed to fetch summary for 2603.17021: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.17021&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[320] Agentic Business Process Management: A Research Manifesto
Diego Calvanese, Angelo Casciani, Giuseppe De Giacomo, Marlon Dumas, Fabiana Fournier, Timotheus Kampik, Emanuele La Malfa, Lior Limonad, Andrea Marrella, Andreas Metzger, Marco Montali, Daniel Amyot, Peter Fettke, Artem Polyvyanyy, Stefanie Rinderle-Ma, Sebastian Sardiña, Niek Tax, Barbara Weber
Main category: cs.AI
TL;DR: APM extends BPM to govern autonomous agents in organizations, shifting from automation to framed autonomy with process awareness.
Details
Motivation: Traditional BPM focuses on automation, but autonomous agents require new governance approaches that balance autonomy with organizational alignment through process awareness.Method: Conceptual framework introducing core abstractions and architectural elements for APM systems, focusing on four key agent capabilities: framed autonomy, explainability, conversational actionability, and self-modification.
Result: A manifesto establishing APM foundations, identifying research challenges at the intersection of BPM, AI, and multi-agent systems, and providing a roadmap for bridging these communities.
Conclusion: APM represents a paradigm shift requiring new approaches to govern autonomous agents in organizational processes, with significant research challenges ahead in integrating BPM, AI, and multi-agent systems.
Abstract: This paper presents a manifesto that articulates the conceptual foundations of Agentic Business Process Management (APM), an extension of Business Process Management (BPM) for governing autonomous agents executing processes in organizations. From a management perspective, APM represents a paradigm shift from the traditional process view of the business process, driven by the realization of process awareness and an agent-oriented abstraction, where software and human agents act as primary functional entities that perceive, reason, and act within explicit process frames. This perspective marks a shift from traditional, automation-oriented BPM toward systems in which autonomy is constrained, aligned, and made operational through process awareness. We introduce the core abstractions and architectural elements required to realize APM systems and elaborate on four key capabilities that such APM agents must support: framed autonomy, explainability, conversational actionability, and self-modification. These capabilities jointly ensure that agents’ goals are aligned with organizational goals and that agents behave in a framed yet proactive manner in pursuing those goals. We discuss the extent to which the capabilities can be realized and identify research challenges whose resolution requires further advances in BPM, AI, and multi-agent systems. The manifesto thus serves as a roadmap for bridging these communities and for guiding the development of APM systems in practice.
[321] Unmasking Algorithmic Bias in Predictive Policing: A GAN-Based Simulation Framework with Multi-City Temporal Analysis
Pronob Kumar Barman, Pronoy Kumar Barman
Main category: cs.AI
TL;DR: A simulation framework using GANs and patrol models reveals racial bias amplification in predictive policing systems across Baltimore and Chicago, showing extreme disparities in enforcement outcomes.
Details
Motivation: To quantitatively understand how predictive policing systems encode and amplify racial disparities through the full enforcement pipeline from crime occurrence to police contact.Method: Developed a reproducible simulation framework coupling a Generative Adversarial Network (GAN) with a Noisy OR patrol detection model, using 145,000+ crime records from Baltimore (2017-2019) and 233,000+ records from Chicago (2022), augmented with US Census ACS demographic data. Computed four monthly bias metrics across 264 city-year-mode observations.
Result: Revealed extreme and year-variant bias in Baltimore’s detected mode (mean annual DIR up to 157.14 in 2019), moderate under-detection of Black residents in Chicago (DIR = 0.22), and persistent Gini coefficients of 0.43 to 0.62. CTGAN debiasing partially redistributes detection rates but cannot eliminate structural disparity without policy intervention. Socioeconomic regression shows strong correlations between neighborhood racial composition and detection likelihood.
Conclusion: Predictive policing systems systematically amplify racial disparities, and algorithmic debiasing alone is insufficient without accompanying policy interventions. Outcomes are most sensitive to officer deployment levels.
Abstract: Predictive policing systems that direct patrol resources based on algorithmically generated crime forecasts have been widely deployed across US cities, yet their tendency to encode and amplify racial disparities remains poorly understood in quantitative terms. We present a reproducible simulation framework that couples a Generative Adversarial Network GAN with a Noisy OR patrol detection model to measure how racial bias propagates through the full enforcement pipeline from crime occurrence to police contact. Using 145000 plus Part 1 crime records from Baltimore 2017 to 2019 and 233000 plus records from Chicago 2022, augmented with US Census ACS demographic data, we compute four monthly bias metrics across 264 city year mode observations: the Disparate Impact Ratio DIR, Demographic Parity Gap, Gini Coefficient, and a composite Bias Amplification Score. Our experiments reveal extreme and year variant bias in Baltimores detected mode, with mean annual DIR up to 15714 in 2019, moderate under detection of Black residents in Chicago DIR equals 0.22, and persistent Gini coefficients of 0.43 to 0.62 across all conditions. We further demonstrate that a Conditional Tabular GAN CTGAN debiasing approach partially redistributes detection rates but cannot eliminate structural disparity without accompanying policy intervention. Socioeconomic regression analysis confirms strong correlations between neighborhood racial composition and detection likelihood Pearson r equals 0.83 for percent White and r equals negative 0.81 for percent Black. A sensitivity analysis over patrol radius, officer count, and citizen reporting probability reveals that outcomes are most sensitive to officer deployment levels. The code and data are publicly available at this repository.
[322] Evaluating Game Difficulty in Tetris Block Puzzle
Chun-Jui Wang, Jian-Ting Guo, Hung Guei, Chung-Chin Shih, Ti-Rong Wu, I-Chen Wu
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting) when querying arXiv API
Details
Motivation: Unable to determine motivation as paper content could not be retrieved due to API rate limitingMethod: No method information available - arXiv API request failed with HTTP 429 status
Result: No results available - paper summary retrieval failed due to rate limiting
Conclusion: Cannot provide analysis as the paper content could not be accessed from arXiv
Abstract: Failed to fetch summary for 2603.18994: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.18994&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[323] Exact MAP inference in general higher-order graphical models using linear programming
Ikhlef Bechar
Main category: cs.AI
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting) when querying arXiv API for paper ID 1709.09051
Details
Motivation: Cannot determine motivation without access to the paper content due to API rate limiting errorMethod: Cannot determine method without access to the paper content due to API rate limiting error
Result: Cannot determine results without access to the paper content due to API rate limiting error
Conclusion: Cannot determine conclusion without access to the paper content due to API rate limiting error
Abstract: Failed to fetch summary for 1709.09051: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=1709.09051&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[324] A Comprehensive Survey on Vector Database: Storage and Retrieval Technique, Challenge
Le Ma, Ran Zhang, Yikun Han, Shirui Yu, Zaitian Wang, Zhiyuan Ning, Jinghan Zhang, Ping Xu, Pengjiang Li, Wei Ju, Chong Chen, Dongjie Wang, Kunpeng Liu, Pengyang Wang, Pengfei Wang, Yanjie Fu, Chunjiang Liu, Yuanchun Zhou, Chang-Tien Lu
Main category: cs.AI
TL;DR: Failed to fetch paper summary - HTTP 429 error indicates rate limiting from arXiv API
Details
Motivation: Unable to determine motivation due to API rate limiting preventing access to paper contentMethod: Cannot analyze method as paper content is unavailable due to HTTP 429 error
Result: No results available - paper summary retrieval failed
Conclusion: Cannot draw conclusions about paper due to technical limitations in accessing content
Abstract: Failed to fetch summary for 2310.11703: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2310.11703&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[325] LISAA: A Framework for Large Language Model Information Security Awareness Assessment
Ofir Cohen, Gil Ari Agmon, Asaf Shabtai, Rami Puzis
Main category: cs.AI
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation as paper content is unavailable due to API rate limitingMethod: Cannot determine method as paper content is unavailable due to API rate limiting
Result: Cannot determine results as paper content is unavailable due to API rate limiting
Conclusion: Cannot determine conclusion as paper content is unavailable due to API rate limiting
Abstract: Failed to fetch summary for 2411.13207: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2411.13207&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[326] HALO: Hierarchical Reinforcement Learning for Large-Scale Adaptive Traffic Signal Control
Yaqiao Zhu, Hongkai Wen, Geyong Min, Man Luo
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to access errorMethod: Unable to determine method due to access error
Result: Unable to determine results due to access error
Conclusion: Unable to determine conclusion due to access error
Abstract: Failed to fetch summary for 2506.14391: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2506.14391&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[327] Mapping Caregiver Needs to AI Chatbot Design: Strengths and Gaps in Mental Health Support for Alzheimer’s and Dementia Caregivers
Jiayue Melissa Shi, Dong Whi Yoo, Keran Wang, Violeta J. Rodriguez, Ravi Karkar, Koustuv Saha
Main category: cs.AI
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot draw conclusions without access to paper content
Abstract: Failed to fetch summary for 2506.15047: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2506.15047&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[328] Adaptive Relative Pose Estimation Framework with Dual Noise Tuning for Safe Approaching Maneuvers
Batu Candan, Murat Berke Oktay, Simone Servadio
Main category: cs.AI
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting). The paper ID 2507.16214 could not be retrieved from arXiv API.
Details
Motivation: Cannot determine motivation as paper content is unavailable due to API rate limiting.Method: Cannot determine method as paper content is unavailable due to API rate limiting.
Result: Cannot determine results as paper content is unavailable due to API rate limiting.
Conclusion: Cannot determine conclusion as paper content is unavailable due to API rate limiting.
Abstract: Failed to fetch summary for 2507.16214: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2507.16214&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[329] World4RL: Diffusion World Models for Policy Refinement with Reinforcement Learning for Robotic Manipulation
Zhennan Jiang, Kai Liu, Yuxin Qin, Shuai Tian, Yupeng Zheng, Mingcai Zhou, Chao Yu, Haoran Li, Dongbin Zhao
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2509.19080: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.19080&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[330] Sensing Without Colocation: Operator-Based Virtual Instrumentation for Domains Beyond Physical Reach
Jay Phil Yoo, Kazuma Kobayashi, Souvik Chakraborty, Syed Bahauddin Alam
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2510.18041: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.18041&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[331] Taming the Long-Tail: Efficient Reasoning RL Training with Adaptive Drafter
Qinghao Hu, Shang Yang, Junxian Guo, Xiaozhe Yao, Yujun Lin, Yuxian Gu, Han Cai, Chuang Gan, Ana Klimovic, Song Han
Main category: cs.AI
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation without access to paper contentMethod: Cannot determine method without access to paper content
Result: Cannot determine results without access to paper content
Conclusion: Cannot draw conclusions without access to paper content
Abstract: Failed to fetch summary for 2511.16665: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.16665&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[332] FORWARD: Dataset of a forwarder operating in rough terrain
Mikael Lundbäck, Erik Wallin, Carola Häggström, Mattias Nyström, Andreas Grönlund, Mats Richardson, Petrus Jönsson, William Arnvik, Lucas Hedström, Arvid Fälldin, Martin Servin
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to failed paper fetchMethod: Unable to determine method due to failed paper fetch
Result: Unable to determine results due to failed paper fetch
Conclusion: Unable to draw conclusions due to failed paper fetch
Abstract: Failed to fetch summary for 2511.17318: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.17318&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[333] The Phish, The Spam, and The Valid: Generating Feature-Rich Emails for Benchmarking LLMs
Rebeka Toth, Tamas Bisztray, Nils Gruschka
Main category: cs.AI
TL;DR: Paper 2511.21448: Unable to fetch summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation due to missing abstractMethod: Cannot determine method due to missing abstract
Result: Cannot determine results due to missing abstract
Conclusion: Cannot determine conclusion due to missing abstract
Abstract: Failed to fetch summary for 2511.21448: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.21448&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[334] A Sheaf-Theoretic and Topological Perspective on Complex Network Modeling and Attention Mechanisms in Graph Neural Models
Chuan-Shen Hu
Main category: cs.AI
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting).
Details
Motivation: Cannot determine motivation without access to paper content.Method: Cannot determine method without access to paper content.
Result: Cannot determine results without access to paper content.
Conclusion: Cannot determine conclusion without access to paper content.
Abstract: Failed to fetch summary for 2601.21207: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.21207&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[335] IFNSO: Iteration-Free Newton-Schulz Orthogonalization
Chen Hu, Qianxi Zhao, Xiaochen Yuan, Hong Zhang, Ding Yuan, Yanbin Wu, Xiying Li
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to access errorMethod: Unable to determine method due to access error
Result: Unable to determine results due to access error
Conclusion: Unable to draw conclusions due to access error
Abstract: Failed to fetch summary for 2602.02500: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.02500&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[336] StealthRL: Reinforcement Learning Paraphrase Attacks for Multi-Detector Evasion of AI-Text Detectors
Suraj Ranganath, Atharv Ramesh
Main category: cs.AI
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot draw conclusions as paper content is unavailable
Abstract: Failed to fetch summary for 2602.08934: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.08934&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[337] Spectral Convolution on Orbifolds for Geometric Deep Learning
Tim Mangliers, Bernhard Mössner, Benjamin Himpel
Main category: cs.AI
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot draw conclusions without access to paper content
Abstract: Failed to fetch summary for 2602.14997: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.14997&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[338] Federated Learning Playground
Bryan Shan, Alysa Ziying Tan, Han Yu
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to access errorMethod: Unable to determine method due to access error
Result: Unable to determine results due to access error
Conclusion: Unable to determine conclusion due to access error
Abstract: Failed to fetch summary for 2602.19489: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.19489&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[339] On the Structural Non-Preservation of Epistemic Behaviour under Policy Transformation
Alexander Galozy
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to access errorMethod: Unable to determine method due to access error
Result: Unable to determine results due to access error
Conclusion: Unable to determine conclusion due to access error
Abstract: Failed to fetch summary for 2602.21424: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.21424&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[340] FEAT: A Linear-Complexity Foundation Model for Extremely Large Structured Data
Zhenghang Song, Tang Qian, Lu Chen, Yushuai Li, Zhengke Hu, Bingbing Fang, Yumeng Song, Junbo Zhao, Sheng Zhang, Tianyi Li
Main category: cs.AI
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting). Need to try again later or use alternative methods to access the paper content.
Details
Motivation: Cannot determine motivation as paper content is unavailable due to HTTP 429 error from arXiv API.Method: Cannot determine method as paper content is unavailable due to HTTP 429 error from arXiv API.
Result: Cannot determine results as paper content is unavailable due to HTTP 429 error from arXiv API.
Conclusion: Cannot draw conclusions as paper content is unavailable due to HTTP 429 error from arXiv API.
Abstract: Failed to fetch summary for 2603.16513: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.16513&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[341] R2-Dreamer: Redundancy-Reduced World Models without Decoders or Augmentation
Naoki Morihira, Amal Nahar, Kartik Bharadwaj, Yasuhiro Kato, Akinobu Hayashi, Tatsuya Harada
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2603.18202: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.18202&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[342] PlanTwin: Privacy-Preserving Planning Abstractions for Cloud-Assisted LLM Agents
Guangsheng Yu, Qin Wang, Rui Lang, Shuai Su, Xu Wang
Main category: cs.AI
TL;DR: Unable to analyze paper 2603.18377 due to HTTP 429 error when fetching summary from arXiv API
Details
Motivation: Cannot determine motivation without access to paper abstractMethod: Cannot determine method without access to paper abstract
Result: Cannot determine results without access to paper abstract
Conclusion: Cannot draw conclusions without access to paper abstract
Abstract: Failed to fetch summary for 2603.18377: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.18377&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[343] An SO(3)-equivariant reciprocal-space neural potential for long-range interactions
Lingfeng Zhang, Taoyong Cui, Dongzhan Zhou, Lei Bai, Sufei Zhang, Luca Rossi, Mao Su, Wanli Ouyang, Pheng-Ann Heng
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to technical error in fetching paper contentMethod: Unable to determine method due to technical error in fetching paper content
Result: Unable to determine results due to technical error in fetching paper content
Conclusion: Unable to determine conclusion due to technical error in fetching paper content
Abstract: Failed to fetch summary for 2603.18389: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.18389&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[344] Agent Control Protocol: Admission Control for Agent Actions
Marcelo Fernandez
Main category: cs.AI
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: N/A - Paper content not accessibleMethod: N/A - Paper content not accessible
Result: N/A - Paper content not accessible
Conclusion: N/A - Paper content not accessible
Abstract: Failed to fetch summary for 2603.18829: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.18829&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
cs.SD
[345] Listen First, Then Answer: Timestamp-Grounded Speech Reasoning
Jihoon Jeong, Pooneh Mousavi, Mirco Ravanelli, Cem Subakan
Main category: cs.SD
TL;DR: RL-based method grounds audio-language model reasoning chains with explicit timestamp annotations to improve faithfulness and performance
Details
Motivation: Current large audio-language models generate reasoning chains but it's unclear if they remain grounded in the input audio, raising concerns about faithfulness and reliabilityMethod: Proposes RL-based strategy that grounds reasoning outputs with explicit timestamp annotations referring to relevant audio segments, encouraging models to attend more to audio tokens during reasoning
Result: Experiments on four speech benchmark datasets show improved performance over zero-shot reasoning and fine-tuning without timestamp grounding; grounding amplifies desirable reasoning behaviors like region exploration, audiology verification, and consistency
Conclusion: Timestamp grounding improves faithfulness and performance in audio-language models, highlighting the importance of grounding mechanisms for reliable multimodal reasoning
Abstract: Large audio-language models (LALMs) can generate reasoning chains for their predictions, but it remains unclear whether these reasoning chains remain grounded in the input audio. In this paper, we propose an RL-based strategy that grounds the reasoning outputs of LALMs with explicit timestamp annotations referring to relevant segments of the audio signal. Our analysis shows that timestamp grounding leads the model to attend more strongly to audio tokens during reasoning generation. Experiments on four speech-based benchmark datasets demonstrate that our approach improves performance compared to both zero-shot reasoning and fine-tuning without timestamp grounding. Additionally, grounding amplifies desirable reasoning behaviors, such as region exploration, audiology verification, and consistency, underscoring the importance of grounding mechanisms for faithful multimodal reasoning.
[346] CAF-Score: Calibrating CLAP with LALMs for Reference-free Audio Captioning Evaluation
Insung Lee, Taeyoung Jeong, Haejun Yoo, Du-Seong Chang, Myoung-Wan Koo
Main category: cs.SD
TL;DR: CAF-Score is a reference-free evaluation metric for audio captioning that combines CLAP’s semantic alignment with LALM’s fine-grained comprehension to detect syntactic errors and hallucinations.
Details
Motivation: Current evaluation methods for audio captioning have limitations: reference-based metrics are expensive and miss acoustic fidelity, while CLAP-based approaches overlook syntactic errors and fine-grained details.Method: Proposes CAF-Score that calibrates CLAP’s coarse-grained semantic alignment with LALM’s fine-grained comprehension and syntactic awareness, combining contrastive audio-text embeddings with LALM reasoning.
Result: Experiments on BRACE benchmark show CAF-Score achieves highest correlation with human judgments, outperforming reference-based baselines in challenging scenarios.
Conclusion: CAF-Score is an effective reference-free metric for audio captioning evaluation that addresses limitations of existing approaches.
Abstract: While Large Audio-Language Models (LALMs) have advanced audio captioning, robust evaluation remains difficult. Reference-based metrics are expensive and often fail to assess acoustic fidelity, while Contrastive Language-Audio Pretraining (CLAP)-based approaches frequently overlook syntactic errors and fine-grained details. We propose CAF-Score, a reference-free metric that calibrates CLAP’s coarse-grained semantic alignment with the fine-grained comprehension and syntactic awareness of LALMs. By combining contrastive audio-text embeddings with LALM reasoning, CAF-Score effectively detects syntactic inconsistencies and subtle hallucinations. Experiments on the BRACE benchmark demonstrate that our approach achieves the highest correlation with human judgments, even outperforming reference-based baselines in challenging scenarios. These results highlight the efficacy of CAF-Score for reference-free audio captioning evaluation. Code and results are available at https://github.com/inseong00/CAF-Score.
[347] Borderless Long Speech Synthesis
Xingchen Song, Di Wu, Dinghao Zhou, Pengyu Cheng, Hongwu Ding, Yunchao He, Jie Wang, Shengfan Shen, Sixiang Lv, Lichun Fan, Hang Su, Yifeng Wang, Shuai Wang, Meng Meng, Jian Luan
Main category: cs.SD
TL;DR: A unified framework for borderless long speech synthesis that goes beyond traditional TTS by incorporating global context, multi-speaker interactions, and paralinguistic cues through hierarchical annotation and LLM-agent integration.
Details
Motivation: Existing TTS systems lack understanding of global context and paralinguistic cues, making it hard to capture real-world phenomena like multi-speaker interactions, emotional arcs, and varied acoustic environments.Method: Proposes Borderless Long Speech Synthesis framework with: 1) “Labeling over filtering/cleaning” data strategy, 2) Global-Sentence-Token hierarchical annotation schema, 3) Continuous tokenizer backbone with Chain-of-Thought reasoning and Dimension Dropout, 4) Native Agentic design with Structured Semantic Interface between LLM Agent and synthesis engine.
Result: The system enables unified capabilities spanning VoiceDesigner, multi-speaker synthesis, Instruct TTS, and long-form text synthesis, with improved instruction following under complex conditions.
Conclusion: The framework extends Text2Speech to borderless long speech synthesis by creating a layered control protocol stack that enables front-end LLMs to convert any modality inputs into structured generation commands.
Abstract: Most existing text-to-speech (TTS) systems either synthesize speech sentence by sentence and stitch the results together, or drive synthesis from plain-text dialogues alone. Both approaches leave models with little understanding of global context or paralinguistic cues, making it hard to capture real-world phenomena such as multi-speaker interactions (interruptions, overlapping speech), evolving emotional arcs, and varied acoustic environments. We introduce the Borderless Long Speech Synthesis framework for agent-centric, borderless long audio synthesis. Rather than targeting a single narrow task, the system is designed as a unified capability set spanning VoiceDesigner, multi-speaker synthesis, Instruct TTS, and long-form text synthesis. On the data side, we propose a “Labeling over filtering/cleaning” strategy and design a top-down, multi-level annotation schema we call Global-Sentence-Token. On the model side, we adopt a backbone with a continuous tokenizer and add Chain-of-Thought (CoT) reasoning together with Dimension Dropout, both of which markedly improve instruction following under complex conditions. We further show that the system is Native Agentic by design: the hierarchical annotation doubles as a Structured Semantic Interface between the LLM Agent and the synthesis engine, creating a layered control protocol stack that spans from scene semantics down to phonetic detail. Text thereby becomes an information-complete, wide-band control channel, enabling a front-end LLM to convert inputs of any modality into structured generation commands, extending the paradigm from Text2Speech to borderless long speech synthesis.
[348] MOSS-TTSD: Text to Spoken Dialogue Generation
Yuqian Zhang, Donghua Yu, Zhengyuan Lin, Botian Jiang, Mingshu Chen, Yaozhou Jiang, Yiwei Zhao, Yiyang Zhang, Yucheng Yuan, Hanfu Chen, Kexin Huang, Jun Zhan, Cheng Chang, Zhaoye Fei, Shimin Li, Xiaogui Yang, Qinyuan Cheng, Xipeng Qiu
Main category: cs.SD
TL;DR: MOSS-TTSD is a spoken dialogue synthesis model that generates expressive multi-party conversational speech with accurate turn-taking, cross-turn acoustic consistency, and long-form stability across multiple languages.
Details
Motivation: Current spoken dialogue generation models lack proper dialogue context modeling, failing to address key requirements like accurate turn-taking, cross-turn acoustic consistency, and long-form stability needed for applications like podcasts, dynamic commentary, and entertainment content.Method: MOSS-TTSD uses enhanced long-context modeling to generate long-form spoken conversations from dialogue scripts with speaker tags, supporting up to 60 minutes of single-pass synthesis, multi-party dialogue with up to 5 speakers, and zero-shot voice cloning from short reference audio. The paper also introduces TTSD-eval, an objective evaluation framework based on forced alignment for measuring speaker attribution accuracy and similarity.
Result: MOSS-TTSD surpasses strong open-source and proprietary baselines in dialogue synthesis according to both objective and subjective evaluations, demonstrating superior performance in generating expressive multi-party conversational speech.
Conclusion: MOSS-TTSD effectively addresses the challenges of spoken dialogue generation through enhanced context modeling and supports practical applications requiring long-form, multi-party conversational speech with consistent speaker characteristics.
Abstract: Spoken dialogue generation is crucial for applications like podcasts, dynamic commentary, and entertainment content, but poses significant challenges compared to single-utterance text-to-speech (TTS). Key requirements include accurate turn-taking, cross-turn acoustic consistency, and long-form stability, which current models often fail to address due to a lack of dialogue context modeling. To bridge this gap, we present MOSS-TTSD, a spoken dialogue synthesis model designed for expressive, multi-party conversational speech across multiple languages. With enhanced long-context modeling, MOSS-TTSD generates long-form spoken conversations from dialogue scripts with explicit speaker tags, supporting up to 60 minutes of single-pass synthesis, multi-party dialogue with up to 5 speakers, and zero-shot voice cloning from a short reference audio clip. The model supports various mainstream languages, including English and Chinese, and is adapted to several long-form scenarios. Additionally, to address limitations of existing evaluation methods, we propose TTSD-eval, an objective evaluation framework based on forced alignment that measures speaker attribution accuracy and speaker similarity without relying on speaker diarization tools. Both objective and subjective evaluation results show that MOSS-TTSD surpasses strong open-source and proprietary baselines in dialogue synthesis.
[349] Audio Avatar Fingerprinting: An Approach for Authorized Use of Voice Cloning in the Era of Synthetic Audio
Candice R. Gerstner
Main category: cs.SD
TL;DR: Paper introduces audio avatar fingerprinting - verifying if synthetic speech is authorized by a specific user, and extends speaker verification models for fake speech detection with a new dataset.
Details
Motivation: With AI speech synthesis making it easy to generate realistic audio in any voice, there's a need to detect synthetic speech for security (authentication systems, videoconferencing) while also enabling legitimate uses like low-bandwidth communication. The paper addresses the novel problem of verifying whether synthesized audio is authorized by a specific user.Method: Analyzes and extends an off-the-shelf speaker verification model for fake speech detection and audio avatar fingerprinting. Introduces a new speech forensics dataset specifically designed for verifying authorized use of synthetic audio, addressing limitations of existing datasets.
Result: Presents the first experimentation of its kind for audio avatar fingerprinting. The extended speaker verification model shows potential for detecting synthetic speech and verifying authorized use. The new dataset enables research in this novel forensic task.
Conclusion: Audio avatar fingerprinting is an important emerging forensic task for security applications. Off-the-shelf speaker verification models can be adapted for this purpose, and the new dataset addresses a critical gap in synthetic audio forensics research.
Abstract: With the advancements in AI speech synthesis, it is easier than ever before to generate realistic audio in a target voice. One only needs a few seconds of reference audio from the target, quite literally putting words in the target person’s mouth. This imposes a new set of forensics-related challenges on speech-based authentication systems, videoconferencing, and audio-visual broadcasting platforms, where we want to detect synthetic speech. At the same time, leveraging AI speech synthesis can enhance the different modes of communication through features such as low-bandwidth communication and audio enhancements - leading to ever-increasing legitimate use-cases of synthetic audio. In this case, we want to verify if the synthesized voice is actually spoken by the user. This will require a mechanism to verify whether a given synthetic audio is driven by an authorized identity, or not. We term this task audio avatar fingerprinting. As a step towards audio forensics in these new and emerging situations, we analyze and extend an off-the-shelf speaker verification model developed outside of forensics context for the task of fake speech detection and audio avatar fingerprinting, the first experimentation of its kind. Furthermore, we observe that no existing dataset allows for the novel task of verifying the authorized use of synthetic audio - a limitation which we address by introducing a new speech forensics dataset for this novel task.
[350] FoleyDirector: Fine-Grained Temporal Steering for Video-to-Audio Generation via Structured Scripts
You Li, Dewei Zhou, Fan Ma, Fu Li, Dongliang He, Yi Yang
Main category: cs.SD
TL;DR: FoleyDirector enables precise temporal control in video-to-audio generation using structured temporal scripts and bi-frame synthesis for multi-event scenarios.
Details
Motivation: Current V2A methods struggle with fine-grained temporal control in multi-event scenarios or when visual cues are insufficient (small regions, off-screen sounds, occluded objects). There's a need for precise temporal guidance while maintaining audio quality.Method: Introduces Structured Temporal Scripts (STS) - captions for short temporal segments; Script-Guided Temporal Fusion Module with Temporal Script Attention; Bi-Frame Sound Synthesis for parallel in-frame/out-of-frame audio generation; and new datasets DirectorSound, VGGSoundDirector, and DirectorBench.
Result: FoleyDirector substantially enhances temporal controllability while maintaining high audio fidelity, enabling users to act as Foley directors and advancing V2A toward more expressive and controllable generation.
Conclusion: The framework enables precise temporal guidance in DiT-based V2A generation while preserving audio quality, allowing seamless switching between V2A generation and temporally controlled synthesis.
Abstract: Recent Video-to-Audio (V2A) methods have achieved remarkable progress, enabling the synthesis of realistic, high-quality audio. However, they struggle with fine-grained temporal control in multi-event scenarios or when visual cues are insufficient, such as small regions, off-screen sounds, or occluded or partially visible objects. In this paper, we propose FoleyDirector, a framework that, for the first time, enables precise temporal guidance in DiT-based V2A generation while preserving the base model’s audio quality and allowing seamless switching between V2A generation and temporally controlled synthesis. FoleyDirector introduces Structured Temporal Scripts (STS), a set of captions corresponding to short temporal segments, to provide richer temporal information. These features are integrated via the Script-Guided Temporal Fusion Module, which employs Temporal Script Attention to fuse STS features coherently. To handle complex multi-event scenarios, we further propose Bi-Frame Sound Synthesis, enabling parallel in-frame and out-of-frame audio generation and improving controllability. To support training and evaluation, we construct the DirectorSound dataset and introduce VGGSoundDirector and DirectorBench. Experiments demonstrate that FoleyDirector substantially enhances temporal controllability while maintaining high audio fidelity, empowering users to act as Foley directors and advancing V2A toward more expressive and controllable generation.
[351] AC-Foley: Reference-Audio-Guided Video-to-Audio Synthesis with Acoustic Transfer
Pengjun Fang, Yingqing He, Yazhou Xing, Qifeng Chen, Ser-Nam Lim, Harry Yang
Main category: cs.SD
TL;DR: AC-Foley: Audio-conditioned video-to-audio generation model that uses reference audio instead of text prompts for fine-grained sound synthesis, addressing semantic granularity gaps and textual ambiguity in existing methods.
Details
Motivation: Existing V2A methods rely on text prompts which have two critical bottlenecks: 1) semantic granularity gaps in training data (conflating acoustically distinct sounds under coarse labels), and 2) textual ambiguity in describing micro-acoustic features, making fine-grained sound synthesis difficult.Method: Proposes AC-Foley, an audio-conditioned V2A model that directly leverages reference audio instead of text prompts. This approach bypasses semantic ambiguities of text descriptions and enables precise manipulation of acoustic attributes through direct audio conditioning.
Result: Achieves state-of-the-art performance for Foley generation when conditioned on reference audio, while remaining competitive with state-of-the-art video-to-audio methods even without audio conditioning. Enables fine-grained sound synthesis, timbre transfer, zero-shot sound generation, and improved audio quality.
Conclusion: Audio-conditioned approach effectively addresses limitations of text-based V2A methods, providing precise control over generated sounds and enabling fine-grained audio synthesis capabilities that overcome semantic granularity and textual ambiguity issues.
Abstract: Existing video-to-audio (V2A) generation methods predominantly rely on text prompts alongside visual information to synthesize audio. However, two critical bottlenecks persist: semantic granularity gaps in training data, such as conflating acoustically distinct sounds under coarse labels, and textual ambiguity in describing micro-acoustic features. These bottlenecks make it difficult to perform fine-grained sound synthesis using text-controlled modes. To address these limitations, we propose AC-Foley, an audio-conditioned V2A model that directly leverages reference audio to achieve precise and fine-grained control over generated sounds. This approach enables fine-grained sound synthesis, timbre transfer, zero-shot sound generation, and improved audio quality. By directly conditioning on audio signals, our approach bypasses the semantic ambiguities of text descriptions while enabling precise manipulation of acoustic attributes. Empirically, AC-Foley achieves state-of-the-art performance for Foley generation when conditioned on reference audio, while remaining competitive with state-of-the-art video-to-audio methods even without audio conditioning. Code and demo are available at: https://ff2416.github.io/AC-Foley-Page
[352] MOSS-TTS Technical Report
Yitian Gong, Botian Jiang, Yiwei Zhao, Yucheng Yuan, Kuangwei Chen, Yaozhou Jiang, Cheng Chang, Dong Hong, Mingshu Chen, Ruixiao Li, Yiyang Zhang, Yang Gao, Hanfu Chen, Ke Chen, Songlin Wang, Xiaogui Yang, Yuqian Zhang, Kexin Huang, ZhengYuan Lin, Kang Yu, Ziqi Chen, Jin Wang, Zhaoye Fei, Qinyuan Cheng, Shimin Li, Xipeng Qiu
Main category: cs.SD
TL;DR: MOSS-TTS is a scalable speech generation foundation model using discrete audio tokens and autoregressive modeling, featuring two variants for different deployment needs with multilingual support and various control capabilities.
Details
Motivation: To create a scalable speech generation foundation model that can handle multilingual, open-domain settings with various control capabilities like voice cloning, duration control, and stable long-form generation.Method: Uses MOSS-Audio-Tokenizer (causal Transformer tokenizer) to compress 24 kHz audio to 12.5 fps with variable-bitrate RVQ and unified semantic-acoustic representations. Two generators: MOSS-TTS (simple, scalable) and MOSS-TTS-Local-Transformer (with frame-local autoregressive module for efficiency).
Result: Supports zero-shot voice cloning, token-level duration control, phoneme-/pinyin-level pronunciation control, smooth code-switching, and stable long-form generation across multilingual and open-domain settings.
Conclusion: MOSS-TTS presents a scalable recipe for speech generation foundation models with practical deployment variants offering different trade-offs between simplicity, efficiency, and control capabilities.
Abstract: This technical report presents MOSS-TTS, a speech generation foundation model built on a scalable recipe: discrete audio tokens, autoregressive modeling, and large-scale pretraining. Built on MOSS-Audio-Tokenizer, a causal Transformer tokenizer that compresses 24 kHz audio to 12.5 fps with variable-bitrate RVQ and unified semantic-acoustic representations, we release two complementary generators: MOSS-TTS, which emphasizes structural simplicity, scalability, and long-context/control-oriented deployment, and MOSS-TTS-Local-Transformer, which introduces a frame-local autoregressive module for higher modeling efficiency, stronger speaker preservation, and a shorter time to first audio. Across multilingual and open-domain settings, MOSS-TTS supports zero-shot voice cloning, token-level duration control, phoneme-/pinyin-level pronunciation control, smooth code-switching, and stable long-form generation. This report summarizes the design, training recipe, and empirical characteristics of the released models.
cs.LG
[353] Speculating Experts Accelerates Inference for Mixture-of-Experts
Vivan Madan, Prajwal Singhania, Abhinav Bhatele, Tom Goldstein, Ashwinee Panda
Main category: cs.LG
TL;DR: Proposes expert prefetching for MoE models to reduce CPU-GPU transfer bottlenecks by predicting future experts using internal representations, enabling computation-memory overlap.
Details
Motivation: MoE models face performance bottlenecks in memory-constrained inference settings where expert weights must be offloaded to CPU, causing CPU-GPU transfer delays during decoding.Method: Uses currently computed internal model representations to speculate future experts, enabling memory transfers to overlap with computation. Includes lightweight estimators to improve expert prediction hit rates when needed.
Result: Achieves up to 14% reduction in time per output token (TPOT) over on-demand loading of experts from CPU memory, while generally maintaining downstream task accuracy.
Conclusion: Expert prefetching effectively reduces inference latency for MoE models in memory-constrained settings by enabling better compute-memory overlap through reliable expert prediction.
Abstract: Mixture-of-Experts (MoE) models have gained popularity as a means of scaling the capacity of large language models (LLMs) while maintaining sparse activations and reduced per-token compute. However, in memory-constrained inference settings, expert weights must be offloaded to CPU, creating a performance bottleneck from CPU-GPU transfers during decoding. We propose an expert prefetching scheme that leverages currently computed internal model representations to speculate future experts, enabling memory transfers to overlap with computation. Across multiple MoE architectures, we demonstrate that future experts can be reliably predicted by these internal representations. We also demonstrate that executing speculated experts generally maintains downstream task accuracy, thus preserving more effective compute-memory overlap by eliminating the need to re-fetch true router-selected experts. Integrated into an optimized inference engine, our approach achieves up to 14% reduction in time per output token (TPOT) over on-demand loading of experts from CPU memory. For MoEs where speculative execution alone yields suboptimal accuracy, we further examine lightweight estimators that improve expert prediction hit rates, thereby reducing performance degradation. Our code is released in open-source at https://github.com/axonn-ai/yalis/tree/offload_prefetch.
[354] A Visualization for Comparative Analysis of Regression Models
Nassime Mountasir, Baptiste Lafabregue, Bruno Albert, Nicolas Lachiche
Main category: cs.LG
TL;DR: A novel visualization method for comparing regression models by plotting residuals in 2D space with Mahalanobis distance and percentile-based colormaps.
Details
Motivation: Traditional regression metrics like MAE, RMSE, and R² aggregate too much information and obscure important patterns in model performance. There's a need for better visualization tools to compare regression models and understand error distributions more comprehensively.Method: Three main contributions: (1) Plot residuals in 2D space to compare two models simultaneously, (2) Use Mahalanobis distance to account for correlations and scale differences, (3) Apply percentile-based colormaps to visualize error distributions and identify dense regions and outliers.
Result: The visualization approach provides more detailed insights into regression model performance than traditional aggregate metrics, revealing patterns that numerical summaries obscure and enabling better model comparison.
Conclusion: The proposed visualization method offers a more comprehensive way to evaluate and compare regression models by graphically representing error distributions and correlations, enhancing the model evaluation process.
Abstract: As regression is a widely studied problem, many methods have been proposed to solve it, each of them often requiring setting different hyper-parameters. Therefore, selecting the proper method for a given application may be very difficult and relies on comparing their performances. Performance is usually measured using various metrics such as Mean Absolute Error (MAE), Root Mean Squared Error (RMSE), or R-squared (R${}^2$). These metrics provide a numerical summary of predictive accuracy by quantifying the difference between predicted and actual values. However, while these metrics are widely used in the literature for summarizing model performance and useful to distinguish between models performing poorly and well, they often aggregate too much information. This article addresses these limitations by introducing a novel visualization approach that highlights key aspects of regression model performance. The proposed method builds upon three main contributions: (1) considering the residuals in a 2D space, which allows for simultaneous evaluation of errors from two models, (2) leveraging the Mahalanobis distance to account for correlations and differences in scale within the data, and (3) employing a colormap to visualize the percentile-based distribution of errors, making it easier to identify dense regions and outliers. By graphically representing the distribution of errors and their correlations, this approach provides a more detailed and comprehensive view of model performance, enabling users to uncover patterns that traditional aggregate metrics may obscure. The proposed visualization method facilitates a deeper understanding of regression model performance differences and error distributions, enhancing the evaluation and comparison process.
[355] Maximizing mutual information between user-contexts and responses improve LLM personalization with no additional data
Hyunji Nam, Haoran Li, Natasha Jaques
Main category: cs.LG
TL;DR: MIPO is a self-supervised preference optimization method that constructs contrastive pairs by generating positive responses conditioned on correct prompts and negative responses on random prompts, enabling LLM self-improvement without external supervision.
Details
Motivation: Current LLM improvement relies heavily on expensive human-labeled data or external verifiers, which limits scalability. True intelligence requires self-improvement frameworks that work without external oversight, especially for tasks that aren't easily verifiable.Method: Mutual Information Preference Optimization (MIPO) constructs preference pairs by: 1) generating positive responses conditioned on correct prompts, 2) generating negative responses conditioned on random, unrelated prompts. Uses Direct Preference Optimization (DPO) to learn from these pairs, maximizing pointwise conditional mutual information between prompts and responses.
Result: MIPO achieves 3-40% improvements on personalization tasks using real-user datasets compared to strong baselines. Surprisingly, also improves performance on math and multiple-choice problems by 1-18% without any additional data or human supervision.
Conclusion: MIPO provides an effective self-supervised framework for LLM improvement that works without external oversight, demonstrating promising results for personalization and reasoning tasks, suggesting a viable direction for autonomous model self-improvement.
Abstract: While post-training has successfully improved large language models (LLMs) across a variety of domains, these gains heavily rely on human-labeled data or external verifiers. Existing data has already been exploited, and new high-quality data is expensive to collect. More fundamentally, true intelligence goes far beyond tasks that are easily verifiable. Therefore, we need self-improvement frameworks that allow models to improve without external oversight. We propose Mutual Information Preference Optimization (MIPO), a contrastive data augmentation method that constructs preference pairs by generating a positive response conditioning on the correct prompt, and a negative response by conditioning on a random, unrelated prompt. We show that using Direct Preference Optimization (DPO) to learn from this paired data maximizes pointwise conditional mutual information (MI) (under the base LLM) between prompts and model responses. Empirical results with various-sized Llama- and Qwen-Instruct models show that when used to maximize MI between user context and response, MIPO provides an effective personalization technique, achieving 3-40% improvements on personalization tasks using real-user datasets compared to strong baselines. Surprisingly, MIPO can also be applied to improve performance on math and multiple-choice problems, yielding 1-18% without any additional data or human supervision. These results suggest a promising direction for self-improvement.
[356] GT-Space: Enhancing Heterogeneous Collaborative Perception with Ground Truth Feature Space
Wentao Wang, Haoran Xu, Guang Tan
Main category: cs.LG
TL;DR: GT-Space: A flexible framework for heterogeneous multi-agent collaborative perception that creates a common feature space from ground-truth labels for scalable feature alignment across different sensing modalities.
Details
Motivation: Multi-agent collaborative perception in autonomous driving faces challenges with heterogeneous features from agents with different sensing modalities or model architectures, making data fusion difficult. Existing approaches require retraining encoders or pairwise feature alignment, which doesn't scale well.Method: Proposes GT-Space framework that constructs a common feature space from ground-truth labels, providing unified reference for feature alignment. Agents only need a single adapter module to project features into this shared space. Includes fusion network trained with contrastive losses across diverse modality combinations.
Result: Extensive experiments on simulation datasets (OPV2V and V2XSet) and real-world dataset (RCooper) show GT-Space consistently outperforms baselines in detection accuracy while delivering robust performance.
Conclusion: GT-Space provides a scalable solution for heterogeneous collaborative perception that eliminates need for pairwise interactions and achieves superior performance across diverse sensing modalities.
Abstract: In autonomous driving, multi-agent collaborative perception enhances sensing capabilities by enabling agents to share perceptual data. A key challenge lies in handling {\em heterogeneous} features from agents equipped with different sensing modalities or model architectures, which complicates data fusion. Existing approaches often require retraining encoders or designing interpreter modules for pairwise feature alignment, but these solutions are not scalable in practice. To address this, we propose {\em GT-Space}, a flexible and scalable collaborative perception framework for heterogeneous agents. GT-Space constructs a common feature space from ground-truth labels, providing a unified reference for feature alignment. With this shared space, agents only need a single adapter module to project their features, eliminating the need for pairwise interactions with other agents. Furthermore, we design a fusion network trained with contrastive losses across diverse modality combinations. Extensive experiments on simulation datasets (OPV2V and V2XSet) and a real-world dataset (RCooper) demonstrate that GT-Space consistently outperforms baselines in detection accuracy while delivering robust performance. Our code will be released at https://github.com/KingScar/GT-Space.
[357] BrainSCL: Subtype-Guided Contrastive Learning for Brain Disorder Diagnosis
Xiaolong Li, Guiliang Guo, Guangqi Wen, Peng Cao, Jinzhu Yang, Honglin Wu, Xiaoli Liu, Fei Wang, Osmar R. Zaiane
Main category: cs.LG
TL;DR: Subtype-guided contrastive learning framework for mental disorder analysis using multi-view representations from clinical text and fMRI data to model patient heterogeneity as latent subtypes.
Details
Motivation: Mental disorder populations exhibit significant heterogeneity, making it challenging to define positive pairs in contrastive learning. Traditional approaches struggle with patient variability, requiring a method that can model this heterogeneity as latent subtypes to guide representation learning.Method: 1) Construct multi-view representations combining clinical text with graph structure learned from BOLD signals; 2) Use unsupervised spectral clustering to uncover latent subtypes; 3) Propose dual-level attention mechanism to create prototypes capturing stable subtype-specific connectivity patterns; 4) Implement subtype-guided contrastive learning that pulls samples toward their subtype prototype graph to reinforce intra-subtype consistency.
Result: The method was evaluated on Major Depressive Disorder (MDD), Bipolar Disorder (BD), and Autism Spectrum Disorders (ASD). Experimental results confirm the effectiveness of subtype prototype graphs in guiding contrastive learning and demonstrate that the proposed approach outperforms state-of-the-art approaches.
Conclusion: The subtype-guided contrastive learning framework successfully addresses patient heterogeneity in mental disorders by modeling latent subtypes as structural priors, improving discriminative representation learning and outperforming existing methods across multiple disorders.
Abstract: Mental disorder populations exhibit pronounced heterogeneity – that is, the significant differences between samples – poses a significant challenge to the definition of positive pairs in contrastive learning. To address this, we propose a subtype-guided contrastive learning framework that models patient heterogeneity as latent subtypes and incorporates them as structural priors to guide discriminative representation learning. Specifically, we construct multi-view representations by combining patients’ clinical text with graph structure adaptively learned from BOLD signals, to uncover latent subtypes via unsupervised spectral clustering. A dual-level attention mechanism is proposed to construct prototypes for capturing stable subtype-specific connectivity patterns. We further propose a subtype-guided contrastive learning strategy that pulls samples toward their subtype prototype graph, reinforcing intra-subtype consistency for providing effective supervisory signals to improve model performance. We evaluate our method on Major Depressive Disorder (MDD), Bipolar Disorder (BD), and Autism Spectrum Disorders (ASD). Experimental results confirm the effectiveness of subtype prototype graphs in guiding contrastive learning and demonstrate that the proposed approach outperforms state-of-the-art approaches. Our code is available at https://anonymous.4open.science/r/BrainSCL-06D7.
[358] TTQ: Activation-Aware Test-Time Quantization to Accelerate LLM Inference On The Fly
Toshiaki Koike-Akino, Jing Liu, Ye Wang
Main category: cs.LG
TL;DR: Test-time quantization framework for large models that performs activation-aware compression at inference time without retraining, adapting to each prompt for improved performance across downstream tasks.
Details
Motivation: Existing activation-aware compression techniques for large foundation models rely heavily on calibration data, which can cause domain shift issues when applied to unseen downstream tasks. There's a need for quantization methods that can adapt to different tasks without retraining.Method: Proposes a test-time quantization (TTQ) framework that compresses large models on the fly during inference. Uses efficient online calibration to perform instant activation-aware quantization that adapts to every prompt, regardless of downstream tasks, while maintaining inference speedup.
Result: Experiments demonstrate that TTQ improves quantization performance over state-of-the-art baselines, showing better adaptation to different tasks without the domain shift issues of previous methods.
Conclusion: TTQ provides an effective solution for compressing large foundation models by enabling prompt-specific quantization at inference time, overcoming domain shift limitations of calibration-dependent methods.
Abstract: To tackle the huge computational demand of large foundation models, activation-aware compression techniques without retraining have been introduced. However, since these methods highly rely on calibration data, domain shift issues may arise for unseen downstream tasks. We propose a test-time quantization (TTQ) framework which compresses large models on the fly at inference time to resolve this issue. With an efficient online calibration, instant activation-aware quantization can adapt every prompt regardless of the downstream tasks, yet achieving inference speedup. Several experiments demonstrate that TTQ can improve the quantization performance over state-of-the-art baselines.
[359] GoAgent: Group-of-Agents Communication Topology Generation for LLM-based Multi-Agent Systems
Hongjiang Chen, Xin Zheng, Yixin Liu, Pengfei Jiao, Shiyuan Li, Huan Liu, Zhidong Zhao, Ziqi Xu, Ibrahim Khalil, Shirui Pan
Main category: cs.LG
TL;DR: GoAgent is a method for generating communication topologies in LLM-based multi-agent systems by explicitly modeling collaborative groups as atomic units, using conditional information bottleneck to reduce communication overhead.
Details
Motivation: Existing LLM-based multi-agent systems generate communication topologies in a node-centric manner, leading to implicit group structures that often result in suboptimal coordination and unnecessary communication overhead. The paper aims to address this by explicitly modeling collaborative groups.Method: GoAgent first enumerates task-relevant candidate groups through an LLM, then autoregressively selects and connects these groups as atomic units to construct the final communication graph. It introduces a conditional information bottleneck objective to compress inter-group communication and filter redundant noise.
Result: Extensive experiments on six benchmarks demonstrate state-of-the-art performance with 93.84% average accuracy while reducing token consumption by about 17%.
Conclusion: Explicitly modeling groups as atomic units in MAS communication topology generation leads to better coordination and reduced communication overhead, with GoAgent showing superior performance and efficiency.
Abstract: Large language model (LLM)-based multi-agent systems (MAS) have demonstrated exceptional capabilities in solving complex tasks, yet their effectiveness depends heavily on the underlying communication topology that coordinates agent interactions. Within these systems, successful problem-solving often necessitates task-specific group structures to divide and conquer subtasks. However, most existing approaches generate communication topologies in a node-centric manner, leaving group structures to emerge implicitly from local connectivity decisions rather than modeling them explicitly, often leading to suboptimal coordination and unnecessary communication overhead. To address this limitation, we propose GoAgent (Group-of-Agents), a communication topology generation method that explicitly treats collaborative groups as the atomic units of MAS construction. Specifically, GoAgent first enumerates task-relevant candidate groups through an LLM and then autoregressively selects and connects these groups as atomic units to construct the final communication graph, jointly capturing intra-group cohesion and inter-group coordination. To mitigate communication redundancy and noise propagation inherent in expanding topologies, we further introduce a conditional information bottleneck (CIB) objective that compresses inter-group communication, preserving task-relevant signals while filtering out redundant historical noise. Extensive experiments on six benchmarks demonstrate the state-of-the-art performance of GoAgent with 93.84% average accuracy while reducing token consumption by about 17%.
[360] CLaRE-ty Amid Chaos: Quantifying Representational Entanglement to Predict Ripple Effects in LLM Editing
Manit Baser, Alperen Yildiz, Dinil Mon Divakaran, Mohan Gurusamy
Main category: cs.LG
TL;DR: CLaRE is a lightweight representation-level technique that identifies ripple effects in model editing by quantifying fact entanglement using forward activations from a single intermediate layer.
Details
Motivation: LLMs' static knowledge becomes outdated, and model-editing techniques often cause unpredictable ripple effects (unintended behavioral changes). Current gradient-based methods are computationally expensive, so a lightweight approach is needed to identify where these ripple effects may occur.Method: CLaRE quantifies entanglement between facts using forward activations from a single intermediate layer, avoiding costly backward passes. It computes large-scale entanglement graphs for a corpus of 11,427 facts from three existing datasets, capturing how local edits propagate through representational space.
Result: CLaRE achieves 62.2% average improvement in Spearman correlation with ripple effects while being 2.74× faster and using 2.85× less peak GPU memory than baselines. It also requires only a fraction of the storage needed by baselines to compute and preserve fact representations.
Conclusion: CLaRE provides an efficient method for identifying ripple effects in model editing, enabling stronger preservation sets, audit trails, efficient red-teaming, and scalable post-edit evaluation through large-scale entanglement graphs.
Abstract: The static knowledge representations of large language models (LLMs) inevitably become outdated or incorrect over time. While model-editing techniques offer a promising solution by modifying a model’s factual associations, they often produce unpredictable ripple effects, which are unintended behavioral changes that propagate even to the hidden space. In this work, we introduce CLaRE, a lightweight representation-level technique to identify where these ripple effects may occur. Unlike prior gradient-based methods, CLaRE quantifies entanglement between facts using forward activations from a single intermediate layer, avoiding costly backward passes. To enable systematic study, we prepare and analyse a corpus of 11,427 facts drawn from three existing datasets. Using CLaRE, we compute large-scale entanglement graphs of this corpus for multiple models, capturing how local edits propagate through representational space. These graphs enable stronger preservation sets for model editing, audit trails, efficient red-teaming, and scalable post-edit evaluation. In comparison to baselines, CLaRE achieves an average of 62.2% improvement in Spearman correlation with ripple effects while being $2.74\times$ faster, and using $2.85\times$ less peak GPU memory. Besides, CLaRE requires only a fraction of the storage needed by the baselines to compute and preserve fact representations. Our entanglement graphs and corpus are available at https://anonymous.4open.science/r/CLaRE-488E.
[361] A Dynamic Bayesian and Machine Learning Framework for Quantitative Evaluation and Prediction of Operator Situation Awareness in Nuclear Power Plants
Shuai Chen, Huiqiao Jia, Tao Qing, Li Zhang, Xingyu Xiao
Main category: cs.LG
TL;DR: DBML SA framework combines Bayesian reasoning and neural networks to dynamically model operator situation awareness in nuclear control rooms using operational event data.
Details
Motivation: Current situation awareness assessment methods (SAGAT, SART) are static, retrospective, and disconnected from evolving cognitive dynamics in complex nuclear control environments, limiting effective human reliability management.Method: Dynamic Bayesian machine learning framework that fuses probabilistic reasoning with data-driven intelligence, analyzing 212 operational event reports (2007-2021) to reconstruct causal temporal structure of 11 performance shaping factors across cognitive layers.
Result: Achieved mean absolute percentage error of 13.8% in predicting SART scores, with statistical consistency to subjective evaluations (p > 0.05), identifying training quality and stress dynamics as primary drivers of situation awareness degradation.
Conclusion: DBML SA enables real-time cognitive monitoring, sensitivity analysis, and early-warning prediction, advancing intelligent human-machine reliability management in next-generation digital control rooms beyond traditional questionnaire-based assessments.
Abstract: Operator situation awareness is a pivotal yet elusive determinant of human reliability in complex nuclear control environments. Existing assessment methods, such as SAGAT and SART, remain static, retrospective, and detached from the evolving cognitive dynamics that drive operational risk. To overcome these limitations, this study introduces the dynamic Bayesian machine learning framework for situation awareness (DBML SA), a unified approach that fuses probabilistic reasoning and data driven intelligence to achieve quantitative, interpretable, and predictive situation awareness modeling. Leveraging 212 operational event reports (2007 to 2021), the framework reconstructs the causal temporal structure of 11 performance shaping factors across multiple cognitive layers. The Bayesian component enables time evolving inference of situation awareness reliability under uncertainty, while the neural component establishes a nonlinear predictive mapping from PSFs to SART scores, achieving a mean absolute percentage error of 13.8 % with statistical consistency to subjective evaluations (p > 0.05). Results highlight training quality and stress dynamics as primary drivers of situation awareness degradation. Overall, DBML SA transcends traditional questionnaire-based assessments by enabling real-time cognitive monitoring, sensitivity analysis, and early-warning prediction, paving the way toward intelligent human machine reliability management in next-generation digital main control rooms.
[362] PRIME-CVD: A Parametrically Rendered Informatics Medical Environment for Education in Cardiovascular Risk Modelling
Nicholas I-Hsien Kuo, Marzia Hoque Tania, Blanca Gallego, Louisa Jorm
Main category: cs.LG
TL;DR: PRIME-CVD is a synthetic medical dataset for cardiovascular disease education, generated from population statistics rather than real patient data to avoid privacy issues.
Details
Motivation: Patient-level electronic medical record data are rarely available for teaching or methodological development due to privacy, governance, and re-identification risks, limiting reproducibility and hands-on training in cardiovascular risk modelling.Method: Created two synthetic data assets representing 50,000 adults undergoing primary prevention for CVD, generated from a user-specified causal directed acyclic graph parameterized using publicly available Australian population statistics and published epidemiologic effect estimates.
Result: Two openly accessible synthetic datasets: Data Asset 1 provides a clean, analysis-ready cohort for exploratory analysis and survival modelling; Data Asset 2 restructures into a relational, EMR-style database with realistic structural and lexical heterogeneity.
Conclusion: PRIME-CVD enables instruction in data cleaning, harmonisation, causal reasoning, and policy-relevant risk modelling without exposing sensitive information, supporting reproducible research and scalable medical education.
Abstract: In recent years, progress in medical informatics and machine learning has been accelerated by the availability of openly accessible benchmark datasets. However, patient-level electronic medical record (EMR) data are rarely available for teaching or methodological development due to privacy, governance, and re-identification risks. This has limited reproducibility, transparency, and hands-on training in cardiovascular risk modelling. Here we introduce PRIME-CVD, a parametrically rendered informatics medical environment designed explicitly for medical education. PRIME-CVD comprises two openly accessible synthetic data assets representing a cohort of 50,000 adults undergoing primary prevention for cardiovascular disease. The datasets are generated entirely from a user-specified causal directed acyclic graph parameterised using publicly available Australian population statistics and published epidemiologic effect estimates, rather than from patient-level EMR data or trained generative models. Data Asset 1 provides a clean, analysis-ready cohort suitable for exploratory analysis, stratification, and survival modelling, while Data Asset 2 restructures the same cohort into a relational, EMR-style database with realistic structural and lexical heterogeneity. Together, these assets enable instruction in data cleaning, harmonisation, causal reasoning, and policy-relevant risk modelling without exposing sensitive information. Because all individuals and events are generated de novo, PRIME-CVD preserves realistic subgroup imbalance and risk gradients while ensuring negligible disclosure risk. PRIME-CVD is released under a Creative Commons Attribution 4.0 licence to support reproducible research and scalable medical education.
[363] Parameter-Efficient Token Embedding Editing for Clinical Class-Level Unlearning
Iyad Ait Hou, Shrenik Borad, Harsh Sharma, Pooja Srinivasan, Rebecca Hwa, Aya Zirikly
Main category: cs.LG
TL;DR: STEU is a parameter-efficient machine unlearning method for clinical language models that updates only selected token embeddings and a small classifier head to forget sensitive information while preserving model utility.
Details
Motivation: Clinical language models need to comply with privacy regulations and institutional policies that may require removing sensitive information from deployed systems without expensive retraining from scratch. There's a need for methods that balance effective forgetting of targeted information with preservation of model utility and minimal parameter modification.Method: STEU (Sparse Token Embedding Unlearning) is a parameter-efficient method for behavioral class-level unlearning. It updates only PMI-selected token embeddings together with a small classifier head while keeping all encoder layers frozen. This approach modifies only a sparse subset of parameters to achieve targeted forgetting.
Result: Experiments on MIMIC-IV, MIMIC-III, and eICU datasets using BioClinicalBERT, BERT-base, and DistilBERT show STEU consistently suppresses target classes while largely preserving retained task performance. In the primary MIMIC-IV setting, STEU achieves near-complete forgetting (forget F1 = 0.0004) while maintaining competitive retained utility (retain avg F1 = 0.4766) after modifying only 0.19% of model parameters.
Conclusion: Targeted behavioral unlearning can be achieved through sparse embedding edits without modifying deeper encoder representations, providing an efficient solution for privacy compliance in clinical language models.
Abstract: Machine unlearning is increasingly important for clinical language models, where privacy regulations and institutional policies may require removing sensitive information from deployed systems without retraining from scratch. In practice, deletion requests must balance effective forgetting of targeted information with preservation of model utility and minimal parameter modification. We introduce Sparse Token Embedding Unlearning (STEU), a parameter-efficient method for behavioral class-level unlearning that updates only PMI-selected token embeddings together with a small classifier head while keeping all encoder layers frozen. Across experiments on MIMIC-IV, MIMIC-III, and eICU using BioClinicalBERT, BERT-base, and DistilBERT, STEU consistently suppresses the target class while largely preserving retained task performance. In the primary MIMIC-IV setting, STEU achieves near-complete forgetting (forget F1 = 0.0004) while maintaining competitive retained utility (retain avg F1 = 0.4766) after modifying only 0.19% of model parameters. These results suggest that targeted behavioral unlearning can be achieved through sparse embedding edits without modifying deeper encoder representations.
[364] MeanFlow Meets Control: Scaling Sampled-Data Control for Swarms
Anqi Dong, Yongxin Chen, Karl H. Johansson, Johan Karlsson
Main category: cs.LG
TL;DR: A control-space learning framework for few-step swarm steering under sampled-data control systems using finite-horizon minimum-energy control coefficients.
Details
Motivation: Real swarm systems operate in sampled-data form with intermittent control updates, requiring finite-window control quantities rather than instantaneous velocity fields for effective steering with few control updates.Method: Introduces a control-space learning framework that learns coefficients parameterizing finite-horizon minimum-energy control over each sampling interval, using integral representation and local differential identity along bridge trajectories with stop-gradient training.
Result: The framework provides a scalable approach to few-step swarm steering that respects the sampled-data structure of real control systems by construction.
Conclusion: The proposed method offers a practical solution for steering large-scale swarms with few control updates while maintaining consistency with real-world sampled-data control systems.
Abstract: Steering large-scale swarms in only a few control updates is challenging because real systems operate in sampled-data form: control inputs are updated intermittently and applied over finite intervals. In this regime, the natural object is not an instantaneous velocity field, but a finite-window control quantity that captures the system response over each sampling interval. Inspired by MeanFlow, we introduce a control-space learning framework for swarm steering under linear time-invariant dynamics. The learned object is the coefficient that parameterizes the finite-horizon minimum-energy control over each interval. We show that this coefficient admits both an integral representation and a local differential identity along bridge trajectories, which leads to a simple stop-gradient training objective. At implementation time, the learned coefficient is used directly in sampled-data updates, so the prescribed dynamics and actuation map are respected by construction. The resulting framework provides a scalable approach to few-step swarm steering that is consistent with the sampled-data structure of real control systems.
[365] Exploring Subnetwork Interactions in Heterogeneous Brain Network via Prior-Informed Graph Learning
Siyu Liu, Guangqi Wen, Peng Cao, Jinzhu Yang, Xiaoli Liu, Fei Wang, Osmar R. Zaiane
Main category: cs.LG
TL;DR: KD-Brain is a prior-informed graph learning framework that uses semantic priors and pathology constraints to model functional subnetwork interactions for mental disorder diagnosis, achieving state-of-the-art performance.
Details
Motivation: Existing Transformer-based methods struggle to learn complex interactions among functional brain subnetworks due to limited training samples, making it challenging to identify functional pathways for mental disorder diagnosis.Method: Proposes KD-Brain with two key components: 1) Semantic-Conditioned Interaction mechanism that injects semantic priors into attention queries to guide subnetwork interactions based on functional identities, and 2) Pathology-Consistent Constraint that regularizes model optimization by aligning learned interaction distributions with clinical priors.
Result: Achieves state-of-the-art performance on a wide range of disorder diagnosis tasks and identifies interpretable biomarkers consistent with psychiatric pathophysiology.
Conclusion: KD-Brain effectively addresses the challenge of learning functional subnetwork interactions with limited data by incorporating prior knowledge, leading to improved diagnostic performance and clinically interpretable results.
Abstract: Modeling the complex interactions among functional subnetworks is crucial for the diagnosis of mental disorders and the identification of functional pathways. However, learning the interactions of the underlying subnetworks remains a significant challenge for existing Transformer-based methods due to the limited number of training samples. To address these challenges, we propose KD-Brain, a Prior-Informed Graph Learning framework for explicitly encoding prior knowledge to guide the learning process. Specifically, we design a Semantic-Conditioned Interaction mechanism that injects semantic priors into the attention query, explicitly navigating the subnetwork interactions based on their functional identities. Furthermore, we introduce a Pathology-Consistent Constraint, which regularizes the model optimization by aligning the learned interaction distributions with clinical priors. Additionally, KD-Brain leads to state-of-the-art performance on a wide range of disorder diagnosis tasks and identifies interpretable biomarkers consistent with psychiatric pathophysiology. Our code is available at https://anonymous.4open.science/r/KDBrain.
[366] MemReward: Graph-Based Experience Memory for LLM Reward Prediction with Limited Labels
Tianyang Luo, Tao Feng, Zhigang Hua, Yan Xie, Shuang Yang, Ge Liu, Jiaxuan You
Main category: cs.LG
TL;DR: MemReward: Graph-based experience memory framework for RL fine-tuning of LLMs that propagates rewards from limited labeled data to unlabeled rollouts using heterogeneous graphs and GNNs.
Details
Motivation: Reinforcement learning fine-tuning of LLMs requires reward labels, but obtaining them at scale is expensive (human labeling) or time-consuming (expert verification). Limited reward labels constrain RL effectiveness.Method: Initial LLM generates rollouts (thinking process + answer) stored as experience memory. Queries, thinking processes, answers form nodes in heterogeneous graph with similarity/structural edges. GNN trained on labeled nodes propagates rewards to unlabeled rollouts during online optimization.
Result: With only 20% labels, achieves 97.3% of Oracle performance on Qwen2.5-3B and 96.6% on 1.5B across math, QA, and code generation. Surpasses Oracle on out-of-domain tasks. Reaches 99.4% of Oracle at 70% labels.
Conclusion: MemReward effectively addresses reward label scarcity for RL fine-tuning of LLMs, enabling near-Oracle performance with limited labels and demonstrating strong generalization to out-of-domain tasks.
Abstract: Training large language models (LLMs) for complex reasoning via reinforcement learning requires reward labels that specify whether the generated rollouts are correct. However, obtaining reward labels at scale often requires expensive human labeling or time-consuming verification procedures; for instance, evaluating mathematical proofs demands expert review, while open-ended question answering lacks definitive ground truth. When reward labels are limited, the effectiveness of reinforcement learning fine-tuning is constrained by the scarcity of reward labels. We introduce MemReward, a graph-based experience memory framework: an initial LLM policy generates rollouts for each query, each comprising a thinking process and a final answer, and these rollouts are stored as experience memory. Queries, thinking processes, and answers form nodes in a heterogeneous graph with similarity and structural edges; a GNN trained on labeled nodes propagates rewards to unlabeled rollouts during online optimization. Experiments on Qwen2.5-3B and 1.5B across mathematics, question answering, and code generation demonstrate that MemReward, with only 20% labels, achieves 97.3% of Oracle performance on 3B and 96.6% on 1.5B, surpassing Oracle on out-of-domain tasks. Performance scales smoothly with label budget, reaching 99.4% of Oracle at 70% labels.
[367] LeWorldModel: Stable End-to-End Joint-Embedding Predictive Architecture from Pixels
Lucas Maes, Quentin Le Lidec, Damien Scieur, Yann LeCun, Randall Balestriero
Main category: cs.LG
TL;DR: LeWM is a stable Joint Embedding Predictive Architecture that learns world models from raw pixels using only two loss terms, achieving efficient planning and meaningful physical structure encoding without complex regularization.
Details
Motivation: Existing JEPA methods for learning world models are fragile and require complex multi-term losses, exponential moving averages, pre-trained encoders, or auxiliary supervision to avoid representation collapse. There's a need for simpler, more stable end-to-end training approaches.Method: LeWM uses only two loss terms: a next-embedding prediction loss and a regularizer enforcing Gaussian-distributed latent embeddings. This reduces tunable hyperparameters from six to one compared to existing alternatives. The model has ~15M parameters and trains end-to-end from raw pixels.
Result: LeWM plans up to 48x faster than foundation-model-based world models while remaining competitive across diverse 2D and 3D control tasks. The latent space encodes meaningful physical structure, and surprise evaluation confirms reliable detection of physically implausible events.
Conclusion: LeWM demonstrates that stable JEPA training is possible with minimal loss terms, enabling efficient world model learning that captures physical structure and supports fast planning for control tasks.
Abstract: Joint Embedding Predictive Architectures (JEPAs) offer a compelling framework for learning world models in compact latent spaces, yet existing methods remain fragile, relying on complex multi-term losses, exponential moving averages, pre-trained encoders, or auxiliary supervision to avoid representation collapse. In this work, we introduce LeWorldModel (LeWM), the first JEPA that trains stably end-to-end from raw pixels using only two loss terms: a next-embedding prediction loss and a regularizer enforcing Gaussian-distributed latent embeddings. This reduces tunable loss hyperparameters from six to one compared to the only existing end-to-end alternative. With ~15M parameters trainable on a single GPU in a few hours, LeWM plans up to 48x faster than foundation-model-based world models while remaining competitive across diverse 2D and 3D control tasks. Beyond control, we show that LeWM’s latent space encodes meaningful physical structure through probing of physical quantities. Surprise evaluation confirms that the model reliably detects physically implausible events.
[368] DPxFin: Adaptive Differential Privacy for Anti-Money Laundering Detection via Reputation-Weighted Federated Learning
Renuga Kanagavelu, Manjil Nepal, Ning Peiyan, Cai Kangning, Xu Jiming, Fei Gao, Yong Liu, Goh Siow Mong Rick, Qingsong Wei
Main category: cs.LG
TL;DR: DPxFin: A federated learning framework with reputation-guided adaptive differential privacy for anti-money laundering, dynamically adjusting privacy noise based on client reputation to balance privacy and accuracy.
Details
Motivation: Money laundering detection faces challenges with data privacy concerns and complex fraud patterns. Federated learning helps institutions train models without sharing data, but is prone to privacy leakage in tabular financial data. Need for better privacy-utility trade-off in financial applications.Method: Proposes DPxFin framework that integrates reputation-guided adaptive differential privacy. Computes client reputation by evaluating alignment between local and global models, then dynamically assigns differential privacy noise to client updates based on reputation. High-reputation clients get lower noise, low-reputation clients get stronger noise.
Result: Validated on Anti-Money Laundering dataset under IID and non-IID settings using MLP. Shows better trade-off between accuracy and privacy than traditional FL and fixed-noise DP baselines. Performance improvements consistent though modest. Withstands tabular data leakage attacks.
Conclusion: DPxFin effectively balances privacy and utility in financial applications, proving robust under real-world conditions with adaptive noise allocation based on client reputation.
Abstract: In the modern financial system, combating money laundering is a critical challenge complicated by data privacy concerns and increasingly complex fraud transaction patterns. Although federated learning (FL) is a promising problem-solving approach as it allows institutions to train their models without sharing their data, it has the drawback of being prone to privacy leakage, specifically in tabular data forms like financial data. To address this, we propose DPxFin, a novel federated framework that integrates reputation-guided adaptive differential privacy. Our approach computes client reputation by evaluating the alignment between locally trained models and the global model. Based on this reputation, we dynamically assign differential privacy noise to client updates, enhancing privacy while maintaining overall model utility. Clients with higher reputations receive lower noise to amplify their trustworthy contributions, while low-reputation clients are allocated stronger noise to mitigate risk. We validate DPxFin on the Anti-Money Laundering (AML) dataset under both IID and non-IID settings using Multi Layer Perceptron (MLP). Experimental analysis established that our approach has a more desirable trade-off between accuracy and privacy than those of traditional FL and fixed-noise Differential Privacy (DP) baselines, where performance improvements were consistent, even though on a modest scale. Moreover, DPxFin does withstand tabular data leakage attacks, proving its effectiveness under real-world financial conditions.
[369] MSNet and LS-Net: Scalable Multi-Scale Multi-Representation Networks for Time Series Classification
Celal Alagöz, Mehmet Kurnaz, Farhan Aadil
Main category: cs.LG
TL;DR: Multi-scale convolutional framework for time series classification using multiple input representations, with three architectures optimized for different objectives: accuracy, calibration, and efficiency.
Details
Motivation: Time series classification performance depends on both architectural design and input representation diversity. The paper aims to develop a scalable framework that systematically integrates multiple input representations for univariate time series.Method: Proposes a multi-scale convolutional framework with three architectures: 1) MSNet - hierarchical multi-scale network for robustness and calibration, 2) LS-Net - lightweight variant for efficiency, and 3) LiteMV adaptation for multi-representation univariate signals with cross-representation interaction.
Result: Evaluation on 142 benchmark datasets shows LiteMV achieves highest mean accuracy, MSNet provides best probabilistic calibration (lowest NLL), and LS-Net offers best efficiency-accuracy tradeoff. Multi-representation multi-scale modeling creates flexible design space for different objectives.
Conclusion: Scalable multi-representation multi-scale learning is established as a principled and practical direction for modern time series classification, with architectures tunable for accuracy, calibration, or resource constraints.
Abstract: Time series classification (TSC) performance depends not only on architectural design but also on the diversity of input representations. In this work, we propose a scalable multi-scale convolutional framework that systematically integrates structured multi-representation inputs for univariate time series. We introduce two architectures: MSNet, a hierarchical multi-scale convolutional network optimized for robustness and calibration, and LS-Net, a lightweight variant designed for efficiency-aware deployment. In addition, we adapt LiteMV – originally developed for multivariate inputs – to operate on multi-representation univariate signals, enabling cross-representation interaction. We evaluate all models across 142 benchmark datasets under a unified experimental protocol. Critical Difference analysis confirms statistically significant performance differences among the top models. Results show that LiteMV achieves the highest mean accuracy, MSNet provides superior probabilistic calibration (lowest NLL), and LS-Net offers the best efficiency-accuracy tradeoff. Pareto analysis further demonstrates that multi-representation multi-scale modeling yields a flexible design space that can be tuned for accuracy-oriented, calibration-oriented, or resource-constrained settings. These findings establish scalable multi-representation multi-scale learning as a principled and practical direction for modern TSC. Reference implementation of MSNet and LS-Net is available at: https://github.com/alagoz/msnet-lsnet-tsc
[370] Ternary Gamma Semirings: From Neural Implementation to Categorical Foundations
Ruoqi Sun
Main category: cs.LG
TL;DR: Neural networks fail on compositional generalization without algebraic constraints, but achieve perfect accuracy when guided by ternary Gamma semiring structure, revealing that successful learning approximates mathematically natural algebraic forms.
Details
Motivation: To understand why neural networks succeed or fail at compositional generalization tasks, and to establish a theoretical connection between neural network learning and abstract algebraic structures.Method: Presented a minimal counterexample showing standard neural networks fail completely (0% accuracy) on compositional generalization. Introduced a logical constraint called the Ternary Gamma Semiring, which guides the same architecture to learn perfectly structured feature spaces. Proved the learned feature space constitutes a finite commutative ternary Γ-semiring implementing majority vote rule.
Result: With the ternary Gamma semiring constraint, the neural network achieved 100% accuracy on novel combinations. The learned structure corresponds precisely to the Boolean-type ternary Γ-semiring with |T|=4, |Γ|=1, which is unique up to isomorphism in existing classifications.
Conclusion: Neural network success can be understood as approximation of mathematically natural structures; learned representations generalize because they internalize algebraic axioms; logical constraints guide networks to converge to canonical forms. Establishes Computational Γ-Algebra as a new interdisciplinary direction.
Abstract: This paper establishes a theoretical framework connecting neural network learning with abstract algebraic structures. We first present a minimal counterexample demonstrating that standard neural networks completely fail on compositional generalization tasks (0% accuracy). By introducing a logical constraint – the Ternary Gamma Semiring – the same architecture learns a perfectly structured feature space, achieving 100% accuracy on novel combinations. We prove that this learned feature space constitutes a finite commutative ternary $Γ$-semiring, whose ternary operation implements the majority vote rule. Comparing with the recently established classification of Gokavarapu et al., we show that this structure corresponds precisely to the Boolean-type ternary $Γ$-semiring with $|T|=4$, $|Γ|=1$}, which is unique up to isomorphism in their enumeration. Our findings reveal three profound conclusions: (i) the success of neural networks can be understood as an approximation of mathematically ``natural’’ structures; (ii) learned representations generalize because they internalize algebraic axioms (symmetry, idempotence, majority property); (iii) logical constraints guide networks to converge to these canonical forms. This work provides a rigorous mathematical framework for understanding neural network generalization and inaugurates the new interdisciplinary direction of Computational $Γ$-Algebra.
[371] A General Deep Learning Framework for Wireless Resource Allocation under Discrete Constraints
Yikun Wang, Yang Li, Yik-Chung Wu, Rui Zhang
Main category: cs.LG
TL;DR: A novel deep learning framework for discrete wireless resource allocation problems using probabilistic modeling of support sets to overcome zero-gradient, constraint enforcement, and non-SPSD challenges.
Details
Motivation: Deep learning methods struggle with discrete optimization problems due to zero-gradient issues in backpropagation, difficulty enforcing discrete constraints, and inability to handle non-same-parameter-same-decision scenarios in wireless resource allocation.Method: Introduces support sets to represent discrete variables, models elements as random variables with learned joint probability distributions, factorizes into conditional probabilities with sequential learning, uses masking for constraint enforcement, and employs dynamic context embedding.
Result: The framework outperforms existing baselines in both system performance and computational efficiency for joint user association/beamforming in cell-free systems and joint antenna positioning/beamforming in movable antenna systems.
Conclusion: The proposed probabilistic DL framework effectively addresses fundamental challenges in discrete optimization for wireless resource allocation, providing a general solution applicable to various mixed-discrete problems.
Abstract: While deep learning (DL)-based methods have achieved remarkable success in continuous wireless resource allocation, efficient solutions for problems involving discrete variables remain challenging. This is primarily due to the zero-gradient issue in backpropagation, the difficulty of enforcing intricate constraints with discrete variables, and the inability in generating solutions with non-same-parameter-same-decision (non-SPSD) property. To address these challenges, this paper proposes a general DL framework by introducing the support set to represent the discrete variables. We model the elements of the support set as random variables and learn their joint probability distribution. By factorizing the joint probability as the product of conditional probabilities, each conditional probability is sequentially learned. This probabilistic modeling directly tackles all the aforementioned challenges of DL for handling discrete variables. By operating on probability distributions instead of hard binary decisions, the framework naturally avoids the zero-gradient issue. During the learning of the conditional probabilities, discrete constraints can be seamlessly enforced by masking out infeasible solutions. Moreover, with a dynamic context embedding that captures the evolving discrete solutions, the non-SPSD property is inherently provided by the proposed framework. We apply the proposed framework to two representative mixed-discrete wireless resource allocation problems: (a) joint user association and beamforming in cell-free systems, and (b) joint antenna positioning and beamforming in movable antenna-aided systems. Simulation results demonstrate that the proposed DL framework consistently outperforms existing baselines in terms of both system performance and computational efficiency.
[372] Target Concept Tuning Improves Extreme Weather Forecasting
Shijie Ren, Xinyue Gu, Ziheng Peng, Haifan Zhang, Peisong Niu, Bo Wu, Xiting Wang, Liang Sun, Jirong Wen
Main category: cs.LG
TL;DR: TaCT: A concept-gated fine-tuning framework that selectively improves deep learning models for rare high-impact meteorological events (like typhoons) while preserving overall performance by updating only failure-related parameters.
Details
Motivation: Deep learning models for meteorological forecasting often fail in rare but high-impact events like typhoons due to scarce data. Existing fine-tuning methods face a trade-off between ignoring extreme events or overfitting them at the expense of overall performance.Method: TaCT uses Sparse Autoencoders and counterfactual analysis to automatically discover failure-related internal concepts in models. It then performs selective parameter updates only when corresponding concepts are activated, rather than applying uniform adaptation across all scenarios.
Result: Experiments show consistent improvements in typhoon forecasting across different regions without degrading performance on other meteorological variables. The identified concepts correspond to physically meaningful circulation patterns, revealing model biases.
Conclusion: TaCT provides an interpretable, trustworthy adaptation framework for scientific forecasting tasks that can selectively improve models for rare events while preserving overall performance, with discovered concepts offering insights into model biases.
Abstract: Deep learning models for meteorological forecasting often fail in rare but high-impact events such as typhoons, where relevant data is scarce. Existing fine-tuning methods typically face a trade-off between overlooking these extreme events and overfitting them at the expense of overall performance. We propose TaCT, an interpretable concept-gated fine-tuning framework that solves the aforementioned issue by selective model improvement: models are adapted specifically for failure cases while preserving performance in common scenarios. To this end, TaCT automatically discovers failure-related internal concepts using Sparse Autoencoders and counterfactual analysis, and updates parameters only when the corresponding concepts are activated, rather than applying uniform adaptation. Experiments show consistent improvements in typhoon forecasting across different regions without degrading other meteorological variables. The identified concepts correspond to physically meaningful circulation patterns, revealing model biases and supporting trustworthy adaptation in scientific forecasting tasks. The code is available at https://anonymous.4open.science/r/Concept-Gated-Fine-tune-62AC.
[373] FalconBC: Flow matching for Amortized inference of Latent-CONditioned physiologic Boundary Conditions
Chloe H. Choi, Alison L. Marsden, Daniele E. Schiavazzi
Main category: cs.LG
TL;DR: A probabilistic flow-based framework for amortized inference of boundary conditions in patient-specific cardiovascular models, handling complex scenarios like open-loop models and vascular lesions.
Details
Motivation: Current data-driven variational inference methods for boundary condition tuning in cardiovascular modeling fail in two important scenarios: open-loop models with known mean flow/assumed waveform shapes, and anatomies with vascular lesions where segmentation affects pressure/flow target reachability. Boundary conditions cannot be tuned in isolation in these cases.Method: Introduces a general amortized inference framework based on probabilistic flow that treats clinical targets, inflow features, and point cloud embeddings of patient-specific anatomies as either conditioning variables or quantities to be jointly estimated.
Result: Demonstrated on two patient-specific models: an aorto-iliac bifurcation with varying stenosis locations and severity, and a coronary arterial tree.
Conclusion: The framework provides a general approach for handling complex boundary condition tuning scenarios in cardiovascular modeling where traditional methods fall short.
Abstract: Boundary condition tuning is a fundamental step in patient-specific cardiovascular modeling. Despite an increase in offline training cost, recent methods in data-driven variational inference can efficiently estimate the joint posterior distribution of boundary conditions, with amortization of training efforts over clinical targets. However, even the most modern approaches fall short in two important scenarios: open-loop models with known mean flow and assumed waveform shapes, and anatomies affected by vascular lesions where segmentation influences the reachability of pressure or flow split targets. In both cases, boundary conditions cannot be tuned in isolation. We introduce a general amortized inference framework based on probabilistic flow that treats clinical targets, inflow features, and point cloud embeddings of patient-specific anatomies as either conditioning variables or quantities to be jointly estimated. We demonstrate the approach on two patient-specific models: an aorto-iliac bifurcation with varying stenosis locations and severity, and a coronary arterial tree.
[374] Do Post-Training Algorithms Actually Differ? A Controlled Study Across Model Scales Uncovers Scale-Dependent Ranking Inversions
Xiaoyi Li
Main category: cs.LG
TL;DR: Large-scale systematic comparison of 51 post-training alignment algorithms reveals that model scale matters most, algorithm rankings change with scale, loss function modifications provide minimal gains, and algorithm effectiveness is task-specific.
Details
Motivation: There are many competing post-training alignment algorithms (DPO, SimPO, KTO, GRPO, etc.) but practitioners lack controlled comparisons to guide algorithm selection, creating confusion in the field.Method: Developed OXRL, a unified framework implementing 51 post-training algorithms with identical infrastructure. Conducted large-scale evaluation across 8 algorithms, 4 model scales (0.5B-7B), 3 evaluation domains, and 20 DPO variants, totaling ~240 training runs on H100 GPUs with proper statistical controls.
Result: 1) Algorithm rankings are unstable across model scales (complete inversion observed). 2) Loss function modifications yield negligible gains (no DPO variant significantly outperforms vanilla DPO). 3) Algorithm effectiveness is highly task-specific (large spreads on GSM8K collapse on MATH and general benchmarks). Hierarchy of leverage: model scale (~50pp) > training paradigm (~10pp) > online vs offline (~9pp) > loss function (~1pp).
Conclusion: Model scale is the most important factor in post-training alignment, algorithm choice matters primarily within training distribution, and loss function modifications provide minimal benefits. Provides evidence-based guidance for practitioners and releases comprehensive benchmark.
Abstract: Post-training alignment has produced dozens of competing algorithms – DPO, SimPO, KTO, GRPO, and others – yet practitioners lack controlled comparisons to guide algorithm selection. We present OXRL, a unified framework implementing 51 post-training algorithms with identical infrastructure, enabling the first large-scale apples-to-apples evaluation. Our study spans 8 algorithms across 4 model scales (0.5B–7B), 3 evaluation domains, and a 20-variant DPO taxonomy (100 runs at 1.5B, 5 seeds each), totaling $\sim$240 training runs on H100 GPUs. Three headline findings emerge. (1)Algorithm rankings are unstable across scale: at 1.5B, online RL (SGRPO) tops all methods at 58.0%$\pm$0.57 on GSM8K; by 7B, the worst small-scale method (SimPO) becomes the best (85.8%), a complete ranking inversion driven by model scale rather than LoRA regularization (confirmed via 2$\times$2 factorial). (2)Loss function modifications yield negligible gains: none of 20 DPO variants significantly outperform vanilla DPO after Bonferroni correction; the sole significant outlier, SimPO, is worse ($-$11.5pp, $p < 10^{-4}$). (3)Algorithm leverage is task-specific: the 19.3pp GSM8K spread collapses to 0.54pp on MATH ($36\times$) and 0.47pp on general-domain benchmarks ($41\times$), confirming that algorithm choice matters primarily within the training distribution. These findings yield a hierarchy of leverage for practitioners: model scale (${\sim}$50pp) $\gg$ training paradigm (${\sim}$10pp) $\gg$ online vs.\ offline (${\sim}$9pp) $\gg$ loss function (${\sim}$1pp). We release all code, configs, and evaluation data as a living community benchmark.
[375] DAPA: Distribution Aware Piecewise Activation Functions for On-Device Transformer Inference and Training
Maoyang Xiang, Bo Wang
Main category: cs.LG
TL;DR: DAPA is a hardware-friendly activation function for Transformers using distribution-aware piecewise approximation for efficient on-device inference.
Details
Motivation: Non-linear activation functions consume substantial hardware resources and impact system performance/energy efficiency in on-device inference and training, requiring more efficient solutions.Method: Proposes Distribution-Aware Piecewise Activation (DAPA) that uses non-uniform piecewise approximation based on pre-activation data distribution, with finer segments in high-probability regions, and quantizes using Distribution-Weighted Mean Square Error.
Result: DAPA speeds up GELU computation by 16×, decreases DSP utilization by 16× while maintaining comparable or better performance across vision Transformers and GPT-2 models.
Conclusion: DAPA provides a differentiable, hardware-friendly activation function that significantly improves efficiency for Transformer architectures on edge devices.
Abstract: Non-linear activation functions play a pivotal role in on-device inference and training, as they not only consume substantial hardware resources but also impose a significant impact on system performance and energy efficiency. In this work, we propose Distribution-Aware Piecewise Activation (DAPA), a differentiable and hardware-friendly activation function for Transformer architectures by exploiting the distribution of pre-activation data. DAPA employs a non-uniform piecewise approximation that allocates finer segments to high-probability regions of the distribution, improving generalizability over prior piecewise linear methods. The resulting approximation is further quantized using Distribution-Weighted Mean Square Error to reduce latency and resource utilization for hardware deployment. Our HLS implementation demonstrates that DAPA speeds up GELU computation by 16$\times$ and decreases DSP utilization by 16$\times$ while maintaining comparable or better performance across vision Transformers and GPT-2 models.
[376] Beyond Weighted Summation: Learnable Nonlinear Aggregation Functions for Robust Artificial Neurons
Berke Deniz Bozyigit
Main category: cs.LG
TL;DR: Learnable nonlinear neuron aggregation mechanisms (F-Mean and Gaussian Support) improve neural network robustness to noisy inputs while maintaining trainability through hybrid designs.
Details
Motivation: Standard weighted summation in neurons behaves like mean-based estimators, making them sensitive to noisy or extreme inputs. The paper explores whether learnable nonlinear aggregation alternatives can improve robustness without sacrificing optimization stability.Method: Introduces two differentiable aggregation mechanisms: F-Mean neuron (learnable power-weighted aggregation) and Gaussian Support neuron (distance-aware affinity weighting). Proposes hybrid neurons that interpolate between linear and nonlinear aggregation via learnable blending parameters to preserve optimization stability.
Result: Evaluated on MLPs and CNNs with CIFAR-10 and noisy CIFAR-10. Hybrid neurons consistently improve robustness under noise, with F-Mean hybrids also showing modest gains on clean data. Three-way hybrid achieves robustness scores up to 0.991 vs 0.890 baseline. Learned parameters converge to sub-linear aggregation (p ≈ 0.43-0.50) and high novelty utilization (α ≈ 0.69-0.79).
Conclusion: Neuron-level aggregation is a meaningful and underexplored design dimension for building more noise-tolerant neural networks. Learnable nonlinear aggregation mechanisms can significantly improve robustness while maintaining trainability.
Abstract: Weighted summation has remained the default input aggregation mechanism in artificial neurons since the earliest neural network models. While computationally efficient, this design implicitly behaves like a mean-based estimator and is therefore sensitive to noisy or extreme inputs. This paper investigates whether replacing fixed linear aggregation with learnable nonlinear alternatives can improve neural network robustness without sacrificing trainability. Two differentiable aggregation mechanisms are introduced: an F-Mean neuron based on a learnable power-weighted aggregation rule, and a Gaussian Support neuron based on distance-aware affinity weighting. To preserve the optimisation stability of standard neurons, hybrid neurons are proposed that interpolate between linear and nonlinear aggregation through a learnable blending parameter. Evaluated in multilayer perceptrons and convolutional neural networks on CIFAR-10 and a noisy CIFAR-10 variant with additive Gaussian corruption, hybrid neurons consistently improve robustness under noise while F-Mean hybrids also yield modest gains on clean data. The three-way hybrid achieves robustness scores of up to 0.991 compared to 0.890 for the standard baseline, and learned parameters converge consistently to sub-linear aggregation (p $\approx$ 0.43–0.50) and high novelty utilisation ($α$ $\approx$ 0.69–0.79). These findings suggest that neuron-level aggregation is a meaningful and underexplored design dimension for building more noise-tolerant neural networks.
[377] Anatomical Heterogeneity in Transformer Language Models
Tomasz Wietrzykowski
Main category: cs.LG
TL;DR: The paper reveals profound anatomical heterogeneity in transformer language models, challenging the assumption of layer homogeneity and proposing Growth Transformer Training for more efficient training.
Details
Motivation: Current transformer language models are trained with uniform computational budgets across all layers, implicitly assuming layer homogeneity. The authors challenge this assumption through empirical analysis to understand layer heterogeneity and optimize training efficiency.Method: Analyzed SmolLM2-135M (30-layer, 135M-parameter causal language model) using five diagnostic metrics: weight predictability (R2), ablation degradation, recovery speed, weight manipulation robustness, and structural analysis. Proposed Growth Transformer Training that allocates budget by layer importance.
Result: Found profound anatomical heterogeneity: layer weights follow strong mathematical regularity (R2 = 0.91) with universal oscillatory delta pattern; layer importance spans 10^7 range with critical core layers and anti-layers; recovery speed correlates with layer importance; only weight scaling preserves model quality; Growth Transformer Training achieves ~54% cost reduction and 4.7x lower validation loss than uniform training.
Conclusion: Transformer layers are not homogeneous, and training efficiency can be significantly improved by allocating computational budgets based on layer importance rather than using uniform training approaches.
Abstract: Current transformer language models are trained with uniform computational budgets across all layers, implicitly assuming layer homogeneity. We challenge this assumption through empirical analysis of SmolLM2-135M, a 30-layer, 135M-parameter causal language model, using five diagnostic metrics: weight predictability (R2), ablation degradation, recovery speed, weight manipulation robustness, and structural analysis. We find profound anatomical heterogeneity: (1) Layer weights follow strong mathematical regularity (R2 = 0.91) with a universal oscillatory delta pattern (correlation ~= -0.50), yet predicted weights cause catastrophic failure due to nonlinear error accumulation. (2) Layer importance spans a 10^7 range, from a critical core (L8-11, up to +63,419% PPL degradation) to anti-layers (L14, L17) whose removal improves performance. (3) Recovery speed correlates with layer importance, indicating differential training requirements. (4) Only weight scaling (alpha = 0.9) preserves model quality among five tested manipulation strategies. (5) Growth Transformer Training, allocating budget by layer importance, achieves ~54% cost reduction. A proof-of-concept experiment confirms this: 4.7x lower validation loss than uniform training at identical parameter count, while being 13% faster.
[378] A Mathematical Theory of Understanding
Bahar Taşkesen
Main category: cs.LG
TL;DR: A mathematical model of learning bottlenecks where instructional signals are only usable when learners have acquired necessary prerequisites, creating structural and epistemic limits on learning speed.
Details
Motivation: While generative AI has made information production cheap, the value of information depends on whether learners can absorb it. Explanations that clarify concepts for some users may be noise to others lacking prerequisites, creating a learner-side bottleneck that limits effective knowledge transfer.Method: Develops a mathematical model where learners are abstract learning systems with prerequisite structures over concepts. Teaching is modeled as sequential communication with a latent target, where instructional signals are only usable when learners have acquired necessary prerequisites to parse them.
Result: The model identifies two limits on learning speed: structural (prerequisite reachability) and epistemic (uncertainty about target). Shows threshold effects in training - when teaching horizon is below prerequisite depth, additional instruction fails; once depth reached, completion becomes feasible. Broadcast curriculum can be linearly slower than personalized instruction across heterogeneous learners.
Conclusion: Learner-side structural capacity creates fundamental bottlenecks in knowledge transfer, even with abundant information. Personalized instruction accounting for prerequisite structures is crucial for effective learning, with implications for AI training, education, and technology adoption.
Abstract: Generative AI has transformed the economics of information production, making explanations, proofs, examples, and analyses available at very low cost. Yet the value of information still depends on whether downstream users can absorb and act on it. A signal conveys meaning only to a learner with the structural capacity to decode it: an explanation that clarifies a concept for one user may be indistinguishable from noise to another who lacks the relevant prerequisites. This paper develops a mathematical model of that learner-side bottleneck. We model the learner as a mind, an abstract learning system characterized by a prerequisite structure over concepts. A mind may represent a human learner, an artificial learner such as a neural network, or any agent whose ability to interpret signals depends on previously acquired concepts. Teaching is modeled as sequential communication with a latent target. Because instructional signals are usable only when the learner has acquired the prerequisites needed to parse them, the effective communication channel depends on the learner’s current state of knowledge and becomes more informative as learning progresses. The model yields two limits on the speed of learning and adoption: a structural limit determined by prerequisite reachability and an epistemic limit determined by uncertainty about the target. The framework implies threshold effects in training and capability acquisition. When the teaching horizon lies below the prerequisite depth of the target, additional instruction cannot produce successful completion of teaching; once that depth is reached, completion becomes feasible. Across heterogeneous learners, a common broadcast curriculum can be slower than personalized instruction by a factor linear in the number of learner types.
[379] Warm-Start Flow Matching for Guaranteed Fast Text/Image Generation
Minyoung Kim
Main category: cs.LG
TL;DR: WS-FM: Warm-Start Flow Matching accelerates sample generation by using fast, low-quality draft models as initial distributions instead of pure noise, reducing required time steps while maintaining output quality.
Details
Motivation: Current generative models (AR LLMs, diffusion models, flow matching) require many function evaluations during inference, making them computationally expensive and slow. There's a need to reduce sample generation time without sacrificing quality.Method: Proposes Warm-Start Flow Matching (WS-FM) that uses computationally lightweight generative models to produce draft samples as initial distributions. Instead of starting from pure noise at time 0, WS-FM starts closer to the end time with draft samples, reducing the number of required time steps while guaranteeing speed-up.
Result: Demonstrated on synthetic toy data and real-world text and image generation tasks. Shows guaranteed speed-up in sample generation without sacrificing quality of generated samples.
Conclusion: WS-FM provides an effective approach to accelerate flow matching algorithms by leveraging fast draft models, offering guaranteed speed-up while maintaining sample quality across various generation tasks.
Abstract: Current auto-regressive (AR) LLMs, diffusion-based text/image generative models, and recent flow matching (FM) algorithms are capable of generating premium quality text/image samples. However, the inference or sample generation in these models is often very time-consuming and computationally demanding, mainly due to large numbers of function evaluations corresponding to the lengths of tokens or the numbers of diffusion steps. This also necessitates heavy GPU resources, time, and electricity. In this work we propose a novel solution to reduce the sample generation time of flow matching algorithms by a guaranteed speed-up factor, without sacrificing the quality of the generated samples. Our key idea is to utilize computationally lightweight generative models whose generation time is negligible compared to that of the target AR/FM models. The draft samples from a lightweight model, whose quality is not satisfactory but fast to generate, are regarded as an initial distribution for a FM algorithm. Unlike conventional usage of FM that takes a pure noise (e.g., Gaussian or uniform) initial distribution, the draft samples are already of decent quality, so we can set the starting time to be closer to the end time rather than 0 in the pure noise FM case. This will significantly reduce the number of time steps to reach the target data distribution, and the speed-up factor is guaranteed. Our idea, dubbed {\em Warm-Start FM} or WS-FM, can essentially be seen as a {\em learning-to-refine} generative model from low-quality draft samples to high-quality samples. As a proof of concept, we demonstrate the idea on some synthetic toy data as well as real-world text and image generation tasks, illustrating that our idea offers guaranteed speed-up in sample generation without sacrificing the quality of the generated samples.
[380] Optimizing Resource-Constrained Non-Pharmaceutical Interventions for Multi-Cluster Outbreak Control Using Hierarchical Reinforcement Learning
Xueqiao Peng, Andrew Perrault
Main category: cs.LG
TL;DR: Hierarchical RL framework for optimal resource allocation across multiple disease outbreak clusters under budget constraints, outperforming baselines by 20-30% in outbreak control.
Details
Motivation: Real-world public health faces challenges in allocating limited resources (like testing and quarantine) across multiple asynchronous disease outbreak clusters that vary in size and risk, requiring decisions under uncertainty and operational constraints.Method: Formulates the problem as a constrained restless multi-armed bandit and proposes a hierarchical reinforcement learning framework with a global controller learning a continuous action cost multiplier and local policies estimating marginal value of resource allocation to individuals within clusters.
Result: The method consistently outperforms RMAB-inspired and heuristic baselines by 20-30% in outbreak control effectiveness across various system scales and testing budgets, and scales well to up to 40 concurrently active clusters with faster decision-making.
Conclusion: The hierarchical reinforcement learning framework provides an effective and scalable solution for optimal resource allocation in multi-cluster disease outbreak scenarios under budget constraints.
Abstract: Non-pharmaceutical interventions (NPIs), such as diagnostic testing and quarantine, are crucial for controlling infectious disease outbreaks but are often constrained by limited resources, particularly in early outbreak stages. In real-world public health settings, resources must be allocated across multiple outbreak clusters that emerge asynchronously, vary in size and risk, and compete for a shared resource budget. Here, a cluster corresponds to a group of close contacts generated by a single infected index case. Thus, decisions must be made under uncertainty and heterogeneous demands, while respecting operational constraints. We formulate this problem as a constrained restless multi-armed bandit and propose a hierarchical reinforcement learning framework. A global controller learns a continuous action cost multiplier that adjusts global resource demand, while a generalized local policy estimates the marginal value of allocating resources to individuals within each cluster. We evaluate the proposed framework in a realistic agent-based simulator of SARS-CoV-2 with dynamically arriving clusters. Across a wide range of system scales and testing budgets, our method consistently outperforms RMAB-inspired and heuristic baselines, improving outbreak control effectiveness by 20%-30%. Experiments on up to 40 concurrently active clusters further demonstrate that the hierarchical framework is highly scalable and enables faster decision-making than the RMAB-inspired method.
[381] GeoLAN: Geometric Learning of Latent Explanatory Directions in Large Language Models
Tianyu Bell Pan, Damon L. Woodard
Main category: cs.LG
TL;DR: GeoLAN is a training framework that treats token representations as geometric trajectories and applies stickiness conditions inspired by the Kakeya Conjecture to improve LLM transparency and interpretability.
Details
Motivation: Large language models demonstrate strong performance but often lack transparency and interpretability. The authors aim to enhance mechanistic interpretability by introducing geometric perspectives to token representations.Method: Introduces GeoLAN framework treating token representations as geometric trajectories with stickiness conditions inspired by the Kakeya Conjecture. Develops two differentiable regularizers: Katz-Tao Convex Wolff (KT-CW) for promoting isotropy and Katz-Tao Attention (KT-Attn) for encouraging diverse attention patterns.
Result: Experiments with Gemma-3 (1B, 4B, 12B) and Llama-3-8B show GeoLAN frequently maintains task accuracy while improving geometric metrics and reducing certain fairness biases. Benefits are most significant in mid-sized models, revealing scale-dependent trade-offs between geometric precision and performance.
Conclusion: Geometry-aware training is a promising approach to enhance mechanistic interpretability in LLMs, with the GeoLAN framework showing effectiveness in improving transparency while maintaining performance, particularly for mid-sized models.
Abstract: Large language models (LLMs) demonstrate strong performance, but they often lack transparency. We introduce GeoLAN, a training framework that treats token representations as geometric trajectories and applies stickiness conditions inspired by recent developments related to the Kakeya Conjecture. We have developed two differentiable regularizers, Katz-Tao Convex Wolff (KT-CW) and Katz-Tao Attention (KT-Attn), that promote isotropy and encourage diverse attention. Our experiments with Gemma-3 (1B, 4B, 12B) and Llama-3-8B show that GeoLAN frequently maintains task accuracy while improving geometric metrics and reducing certain fairness biases. These benefits are most significant in mid-sized models. Our findings reveal scale-dependent trade-offs between geometric precision and performance, suggesting that geometry-aware training is a promising approach to enhance mechanistic interpretability.
[382] Deep Hilbert–Galerkin Methods for Infinite-Dimensional PDEs and Optimal Control
Samuel N. Cohen, Filippo de Feo, Jackson Hebner, Justin Sirignano
Main category: cs.LG
TL;DR: Deep learning methods for solving fully nonlinear second-order PDEs on Hilbert spaces using Hilbert-Galerkin Neural Operators, with theoretical approximation guarantees and numerical methods for control problems.
Details
Motivation: To develop approximation methods for fully nonlinear second-order PDEs on separable Hilbert spaces, which arise in many applied sciences including control problems, functional differential equations, and stochastic systems.Method: Parameterize solutions via Hilbert-Galerkin Neural Operators (HGNOs), prove Universal Approximation Theorems for functions on Hilbert spaces with derivatives up to second order, and develop Deep Hilbert-Galerkin and Hilbert Actor-Critic methods by minimizing the L²-norm of PDE residuals.
Result: First UATs for these problems with novel topologies for Hessian terms, successful numerical solution of Kolmogorov and HJB PDEs related to optimal control of deterministic and stochastic heat and Burgers’ equations.
Conclusion: The approach demonstrates promise for solving complex PDEs on Hilbert spaces using deep learning, with applications to control problems and various stochastic systems.
Abstract: We develop deep learning-based approximation methods for fully nonlinear second-order PDEs on separable Hilbert spaces, such as HJB equations for infinite-dimensional control, by parameterizing solutions via Hilbert–Galerkin Neural Operators (HGNOs). We prove the first Universal Approximation Theorems (UATs) which are sufficiently powerful to address these problems, based on novel topologies for Hessian terms and corresponding novel continuity assumptions on the fully nonlinear operator. These topologies are non-sequential and non-metrizable, making the problem delicate. In particular, we prove UATs for functions on Hilbert spaces, together with their Fréchet derivatives up to second order, and for unbounded operators applied to the first derivative, ensuring that HGNOs are able to approximate all the PDE terms. For control problems, we further prove UATs for optimal feedback controls in terms of our approximating value function HGNO. We develop numerical training methods, which we call Deep Hilbert–Galerkin and Hilbert Actor-Critic (reinforcement learning) Methods, for these problems by minimizing the $L^2_μ(H)$-norm of the residual of the PDE on the whole Hilbert space, not just a projected PDE to finite dimensions. This is the first paper to propose such an approach. The models considered arise in many applied sciences, such as functional differential equations in physics and Kolmogorov and HJB PDEs related to controlled PDEs, SPDEs, path-dependent systems, partially observed stochastic systems, and mean-field SDEs. We numerically solve examples of Kolmogorov and HJB PDEs related to the optimal control of deterministic and stochastic heat and Burgers’ equations, demonstrating the promise of our deep learning-based approach.
[383] Global Convergence of Multiplicative Updates for the Matrix Mechanism: A Collaborative Proof with Gemini 3
Keith Rush
Main category: cs.LG
TL;DR: The paper proves convergence of a fixed-point iteration for a nuclear norm optimization problem in private ML, with significant AI assistance in the proof process.
Details
Motivation: The paper aims to close an open problem from previous work on optimization over algorithm spaces in private machine learning, specifically proving convergence of a fixed-point iteration for a regularized nuclear norm objective with Hadamard product structure.Method: The authors analyze the fixed-point iteration v ← φ(v) and prove that v^{(k+1)} = diag((D_{v^{(k)}}^{1/2} M D_{v^{(k)}}^{1/2})^{1/2}) converges monotonically to the unique global optimizer of the potential function J(v). The proof was significantly assisted by Gemini 3 AI, with human corrections and interventions.
Result: The paper proves the convergence of the fixed-point iteration to the unique global optimizer, closing the open problem from the cited work. The proof demonstrates monotonic convergence of the iteration.
Conclusion: The paper successfully closes a mathematical gap in private ML optimization while also serving as commentary on AI-assisted mathematical proof development, including insights on prompting strategies and principles for working with AI in mathematics.
Abstract: We analyze a fixed-point iteration $v \leftarrow φ(v)$ arising in the optimization of a regularized nuclear norm objective involving the Hadamard product structure, posed in~\cite{denisov} in the context of an optimization problem over the space of algorithms in private machine learning. We prove that the iteration $v^{(k+1)} = \text{diag}((D_{v^{(k)}}^{1/2} M D_{v^{(k)}}^{1/2})^{1/2})$ converges monotonically to the unique global optimizer of the potential function $J(v) = 2 \text{Tr}((D_v^{1/2} M D_v^{1/2})^{1/2}) - \sum v_i$, closing a problem left open there. The bulk of this proof was provided by Gemini 3, subject to some corrections and interventions. Gemini 3 also sketched the initial version of this note. Thus, it represents as much a commentary on the practical use of AI in mathematics as it represents the closure of a small gap in the literature. As such, we include a small narrative description of the prompting process, and some resulting principles for working with AI to prove mathematics.
[384] Adaptive Layerwise Perturbation: Unifying Off-Policy Corrections for LLM RL
Chenlu Ye, Xuanchang Zhang, Yifan Hao, Zhou Yu, Ziji Zhang, Abhinav Gullapalli, Hao Chen, Jing Huang, Tong Zhang
Main category: cs.LG
TL;DR: ALP addresses off-policy RL training instability in LLMs by injecting learnable perturbations into hidden states to flatten importance ratio distributions and reduce policy divergence.
Details
Motivation: Off-policy RL for LLMs suffers from policy staleness and training-inference mismatch, causing heavy-tailed importance ratios that destabilize training and limit exploration.Method: Adaptive Layerwise Perturbation (ALP) injects small learnable perturbations into input hidden states of each layer during updates, creating a numerator for importance ratios against the unchanged inference policy.
Result: ALP improves final performance on math and tool-integrated reasoning tasks, prevents importance ratio tail blow-up and KL spikes, and boosts exploration. Full-layer perturbations work best.
Conclusion: ALP effectively stabilizes off-policy RL training for LLMs by controlling policy divergence through representation-level perturbations, enabling more stable and exploratory training.
Abstract: Off-policy problems such as policy staleness and training-inference mismatch, has become a major bottleneck for training stability and further exploration for LLM RL. To enhance inference efficiency, the distribution gap between the inference and updated policy grows, leading to heavy-tailed importance ratios. Heavy-tailed ratios arise when the policy is locally sharp, which further inflates sharp gradients and can push updates outside the trust region. To address this, we propose Adaptive Layerwise Perturbation(ALP) by injecting small learnable perturbations into input hidden states of each layer during updates, which is used as the numerator of the importance ratio against the unchanged inference policy in the objective. Intuitively, by adding controlled noise to intermediate representations, ALP prevents the updated policy from deviating too sharply from the inference policy, and enlarges the policy family to cover the inference policy family with mismatch noises. Hence, the flattened distribution can naturally tighten the updated and inference policy gap and reduce the tail of importance ratios, thus maintaining training stability. This is further validated empirically. Experiments on single-turn math and multi-turn tool-integrated reasoning tasks show that ALP not only improves final performance, but also avoid blow up of importance ratio tail and KL spikes during iterative training, along with boosted exploration. Ablations show that representation-level perturbations across all layers are most effective, substantially outperforming partial-layer and logits-only variants.
[385] TRACE: Trajectory Recovery with State Propagation Diffusion for Urban Mobility
Jinming Wang, Hai Wang, Hongkai Wen, Geyong Min, Man Luo
Main category: cs.LG
TL;DR: TRACE is a diffusion model for GPS trajectory recovery that reconstructs dense, continuous trajectories from sparse, incomplete inputs using a State Propagation Diffusion Model with memory mechanism.
Details
Motivation: Real-world GPS trajectories are often sparse and uneven due to low sampling rates and limited infrastructure coverage, making trajectory recovery challenging but essential for location-based services and smart city applications.Method: Proposes TRACE with State Propagation Diffusion Model (SPDM) that integrates a memory mechanism to retain and leverage intermediate results from previous denoising steps, enabling effective reconstruction of hard-to-recover trajectory segments.
Result: Extensive experiments on multiple real-world datasets show TRACE outperforms state-of-the-art methods with >26% accuracy improvement without significant inference overhead.
Conclusion: TRACE strengthens the foundation for mobile and web-connected location services, advancing the quality and fairness of data-driven urban applications.
Abstract: High-quality GPS trajectories are essential for location-based web services and smart city applications, including navigation, ride-sharing and delivery. However, due to low sampling rates and limited infrastructure coverage during data collection, real-world trajectories are often sparse and feature unevenly distributed location points. Recovering these trajectories into dense and continuous forms is essential but challenging, given their complex and irregular spatio-temporal patterns. In this paper, we introduce a novel diffusion model for trajectory recovery named TRACE, which reconstruct dense and continuous trajectories from sparse and incomplete inputs. At the core of TRACE, we propose a State Propagation Diffusion Model (SPDM), which integrates a novel memory mechanism, so that during the denoising process, TRACE can retain and leverage intermediate results from previous steps to effectively reconstruct those hard-to-recover trajectory segments. Extensive experiments on multiple real-world datasets show that TRACE outperforms the state-of-the-art, offering $>$26% accuracy improvement without significant inference overhead. Our work strengthens the foundation for mobile and web-connected location services, advancing the quality and fairness of data-driven urban applications. Code is available at: https://github.com/JinmingWang/TRACE
[386] Any-Subgroup Equivariant Networks via Symmetry Breaking
Abhinav Goel, Derek Lim, Hannah Lawrence, Stefanie Jegelka, Ningyuan Huang
Main category: cs.LG
TL;DR: ASEN is a single neural network that can be equivariant to multiple permutation subgroups simultaneously by using symmetry-breaking auxiliary inputs, enabling flexible multi-modal processing without designing separate architectures for each symmetry.
Details
Motivation: Current equivariant architectures are constrained to specific symmetries chosen a priori, limiting their applicability to diverse datasets and preventing the development of flexible multi-modal foundation models that can process various data types equivariantly.Method: Start with a fully permutation-equivariant base model, then achieve subgroup equivariance by modulating with symmetry-breaking auxiliary inputs whose automorphism group matches the desired subgroup. Use 2-closure theory to enable approximate symmetry breaking with fast algorithms when exact solutions are computationally hard.
Result: Theoretically, ASEN can simulate equivariant MLPs and guarantees universality if the base model is universal. Empirically, it outperforms both separate equivariant models and single non-equivariant models on symmetry selection for graph/image tasks and multitask/transfer learning for sequence tasks.
Conclusion: ASEN provides a flexible framework for building multi-modal foundation models that can handle diverse symmetries within a single architecture, overcoming limitations of traditional equivariant networks while maintaining strong theoretical guarantees and empirical performance.
Abstract: The inclusion of symmetries as an inductive bias, known as equivariance, often improves generalization on geometric data (e.g. grids, sets, and graphs). However, equivariant architectures are usually highly constrained, designed for symmetries chosen a priori, and not applicable to datasets with other symmetries. This precludes the development of flexible, multi-modal foundation models capable of processing diverse data equivariantly. In this work, we build a single model – the Any-Subgroup Equivariant Network (ASEN) – that can be simultaneously equivariant to several groups, simply by modulating a certain auxiliary input feature. In particular, we start with a fully permutation-equivariant base model, and then obtain subgroup equivariance by using a symmetry-breaking input whose automorphism group is that subgroup. However, finding an input with the desired automorphism group is computationally hard. We overcome this by relaxing from exact to approximate symmetry breaking, leveraging the notion of 2-closure to derive fast algorithms. Theoretically, we show that our subgroup-equivariant networks can simulate equivariant MLPs, and their universality can be guaranteed if the base model is universal. Empirically, we validate our method on symmetry selection for graph and image tasks, as well as multitask and transfer learning for sequence tasks, showing that a single network equivariant to multiple permutation subgroups outperforms both separate equivariant models and a single non-equivariant model.
[387] ICLAD: In-Context Learning for Unified Tabular Anomaly Detection Across Supervision Regimes
Jack Yi Wei, Narges Armanfard
Main category: cs.LG
TL;DR: ICLAD is an in-context learning foundation model for tabular anomaly detection that generalizes across datasets and supervision regimes using meta-learning on synthetic tasks.
Details
Motivation: Existing deep learning approaches for tabular anomaly detection are typically dataset-specific and limited to single supervision regimes, preventing them from leveraging shared structures across tasks and adapting to different supervision levels.Method: ICLAD is trained via meta-learning on synthetic tabular anomaly detection tasks. At inference, it assigns anomaly scores by conditioning on the training set without updating model weights, using in-context learning.
Result: Comprehensive experiments on 57 tabular datasets from ADBench show state-of-the-art performance across three supervision regimes (one-class, fully unsupervised, and semi-supervised).
Conclusion: ICLAD establishes a unified framework for tabular anomaly detection that generalizes across both datasets and supervision regimes.
Abstract: Anomaly detection on tabular data is commonly studied under three supervision regimes, including one-class settings that assume access to anomaly-free training samples, fully unsupervised settings with unlabeled and potentially contaminated training data, and semi-supervised settings with limited anomaly labels. Existing deep learning approaches typically train dataset-specific models under the assumption of a single supervision regime, which limits their ability to leverage shared structures across anomaly detection tasks and to adapt to different supervision levels. We propose ICLAD, an in-context learning foundation model for tabular anomaly detection that generalizes across both datasets and supervision regimes. ICLAD is trained via meta-learning on synthetic tabular anomaly detection tasks, and at inference time, the model assigns anomaly scores by conditioning on the training set without updating model weights. Comprehensive experiments on 57 tabular datasets from ADBench show that our method achieves state-of-the-art performance across three supervision regimes, establishing a unified framework for tabular anomaly detection.
[388] Stochastic Sequential Decision Making over Expanding Networks with Graph Filtering
Zhan Gao, Bishwadeep Das, Elvin Isufi
Main category: cs.LG
TL;DR: A stochastic sequential decision-making framework for graph filtering that adapts to expanding graphs using multi-agent reinforcement learning, with applications to cold-start recommendation and COVID prediction.
Details
Motivation: Existing graph filters mainly study fixed graphs, ignoring that real-world graphs often expand as nodes continually attach with unknown patterns. Current approaches rely on either pre-designed filters or online learning, which only consider past or present information without accounting for future impacts.Method: Propose a stochastic sequential decision-making framework for filtering networked data with expanding graphs. Represent filter shifts as agents, model the filter as a multi-agent system, and train the policy using multi-agent reinforcement learning to account for long-term rewards and capture expansion dynamics. Develop a context-aware graph neural network to parameterize the policy, which tunes filter parameters based on both graph and agent information.
Result: Experiments on synthetic and real datasets (including cold-start recommendation and COVID prediction) demonstrate the benefits of using a sequential decision-making perspective over batch and online filtering alternatives.
Conclusion: The proposed framework effectively handles expanding graphs by incorporating future impacts through sequential decision-making and multi-agent reinforcement learning, outperforming traditional batch and online filtering methods.
Abstract: Graph filters leverage topological information to process networked data with existing methods mainly studying fixed graphs, ignoring that graphs often expand as nodes continually attach with an unknown pattern. The latter requires developing filter-based decision-making paradigms that take evolution and uncertainty into account. Existing approaches rely on either pre-designed filters or online learning, limited to a myopic view considering only past or present information. To account for future impacts, we propose a stochastic sequential decision-making framework for filtering networked data with a policy that adapts filtering to expanding graphs. By representing filter shifts as agents, we model the filter as a multi-agent system and train the policy following multi-agent reinforcement learning. This accounts for long-term rewards and captures expansion dynamics through sequential decision-making. Moreover, we develop a context-aware graph neural network to parameterize the policy, which tunes filter parameters based on information of both the graph and agents. Experiments on synthetic and real datasets from cold-start recommendation to COVID prediction highlight the benefits of using a sequential decision-making perspective over batch and online filtering alternatives.
[389] Scalable Cross-Facility Federated Learning for Scientific Foundation Models on Multiple Supercomputers
Yijiang Li, Zilinghan Li, Kyle Chard, Ian Foster, Todd Munson, Ravi Madduri, Kibaek Kim
Main category: cs.LG
TL;DR: A federated learning framework for cross-HPC facility training that addresses privacy constraints in scientific AI applications, demonstrated by fine-tuning large language models on chemistry data.
Details
Motivation: Scientific AI applications increasingly require training large models on data that cannot be centralized due to privacy, data sovereignty, or volume constraints. Federated learning enables collaborative training without centralizing raw data, but deploying FL across HPC facilities presents unique challenges beyond cloud/enterprise settings.Method: Developed a comprehensive cross-facility FL framework for heterogeneous HPC environments, built on Advanced Privacy-Preserving Federated Learning (APPFL) framework with Globus Compute and Transfer orchestration. Evaluated across four U.S. DOE leadership-class supercomputers.
Result: Demonstrated that FL experiments across HPC facilities are practically achievable. Characterized key sources of heterogeneity impacting training performance and showed algorithmic choices matter significantly under realistic HPC scheduling conditions. Validated scientific applicability by fine-tuning a large language model on a chemistry instruction dataset.
Conclusion: Cross-facility FL for scientific applications is feasible but requires careful consideration of HPC heterogeneity and scheduling. Identified scheduler-aware algorithm design as a critical open challenge for future deployments of large-scale scientific AI models.
Abstract: Artificial Intelligence for scientific applications increasingly requires training large models on data that cannot be centralized due to privacy constraints, data sovereignty, or the sheer volume of data generated. Federated learning (FL) addresses this by enabling collaborative training without centralizing raw data, but scientific applications demand model scales that requires extensive computing resources, typically offered at High Performance Computing (HPC) facilities. Deploying FL experiments across HPC facilities introduces challenges beyond cloud or enterprise settings. We present a comprehensive cross-facility FL framework for heterogeneous HPC environments, built on Advanced Privacy-Preserving Federated Learning (APPFL) framework with Globus Compute and Transfer orchestration, and evaluate it across four U.S. Department of Energy (DOE) leadership-class supercomputers. We demonstrate that FL experiments across HPC facilities are practically achievable, characterize key sources of heterogeneity impacting the training performance, and show that algorithmic choices matter significantly under realistic HPC scheduling conditions. We validate the scientific applicability by fine-tuning a large language model on a chemistry instruction dataset, and identify scheduler-aware algorithm design as a critical open challenge for future deployments.
[390] Subspace Kernel Learning on Tensor Sequences
Lei Wang, Xi Ding, Yongsheng Gao, Piotr Koniusz
Main category: cs.LG
TL;DR: UKTL is a novel kernel framework for M-mode tensors that learns uncertainty-driven subspace comparisons with scalable Nyström linearization for large-scale tensor data.
Details
Motivation: Learning from structured multi-way tensor data requires capturing complex interactions across tensor modes while maintaining computational efficiency. Existing methods often lack robustness and interpretability in handling unreliable mode components.Method: Proposes UKTL with uncertainty-driven kernel tensor learning that compares mode-wise subspaces from tensor unfoldings. Uses scalable Nyström kernel linearization with dynamically learned pivot tensors via soft k-means clustering. Features uncertainty-aware subspace weighting that adaptively down-weights unreliable mode components based on estimated confidence.
Result: Achieves state-of-the-art performance on action recognition benchmarks (NTU-60, NTU-120, Kinetics-Skeleton), shows superior generalization, and provides meaningful mode-wise insights.
Conclusion: Establishes a principled, scalable, and interpretable kernel learning paradigm for structured multi-way and multi-modal tensor sequences with robust uncertainty handling.
Abstract: Learning from structured multi-way data, represented as higher-order tensors, requires capturing complex interactions across tensor modes while remaining computationally efficient. We introduce Uncertainty-driven Kernel Tensor Learning (UKTL), a novel kernel framework for $M$-mode tensors that compares mode-wise subspaces derived from tensor unfoldings, enabling expressive and robust similarity measure. To handle large-scale tensor data, we propose a scalable Nyström kernel linearization with dynamically learned pivot tensors obtained via soft $k$-means clustering. A key innovation of UKTL is its uncertainty-aware subspace weighting, which adaptively down-weights unreliable mode components based on estimated confidence, improving robustness and interpretability in comparisons between input and pivot tensors. Our framework is fully end-to-end trainable and naturally incorporates both multi-way and multi-mode interactions through structured kernel compositions. Extensive evaluations on action recognition benchmarks (NTU-60, NTU-120, Kinetics-Skeleton) show that UKTL achieves state-of-the-art performance, superior generalization, and meaningful mode-wise insights. This work establishes a principled, scalable, and interpretable kernel learning paradigm for structured multi-way and multi-modal tensor sequences.
[391] Neural Uncertainty Principle: A Unified View of Adversarial Fragility and LLM Hallucination
Dong-Xiao Zhang, Hu Lou, Jun-Jie Zhang, Jun Zhu, Deyu Meng
Main category: cs.LG
TL;DR: The paper reveals a common geometric origin between adversarial vulnerability in vision models and hallucination in LLMs, formalized as a Neural Uncertainty Principle (NUP) where input and loss gradient are conjugate observables with irreducible uncertainty bounds.
Details
Motivation: To unify seemingly separate problems of adversarial vulnerability in vision models and hallucination in large language models by revealing their common geometric origin, moving beyond modality-specific patches to a principled theoretical framework.Method: Formalizes a Neural Uncertainty Principle (NUP) showing input and loss gradient as conjugate observables with irreducible uncertainty bound. Uses a single-backward probe to measure input-gradient correlation. Proposes ConjMask (masking high-contribution components) and LogitReg (logit-side regularization) for vision robustness, and uses the probe as decoding-free risk signal for LLM hallucination detection.
Result: Shows that adversarial fragility and hallucination share common geometric origin. In vision, masking highly coupled components improves robustness without adversarial training. In language, prefill-stage probe detects hallucination risk before generating tokens, enabling early detection and prompt selection.
Conclusion: NUP provides a unified theoretical framework for diagnosing and mitigating boundary anomalies across perception and generation tasks, turning separate failure taxonomies into a shared uncertainty-budget view with practical applications for both vision and language models.
Abstract: Adversarial vulnerability in vision and hallucination in large language models are conventionally viewed as separate problems, each addressed with modality-specific patches. This study first reveals that they share a common geometric origin: the input and its loss gradient are conjugate observables subject to an irreducible uncertainty bound. Formalizing a Neural Uncertainty Principle (NUP) under a loss-induced state, we find that in near-bound regimes, further compression must be accompanied by increased sensitivity dispersion (adversarial fragility), while weak prompt-gradient coupling leaves generation under-constrained (hallucination). Crucially, this bound is modulated by an input-gradient correlation channel, captured by a specifically designed single-backward probe. In vision, masking highly coupled components improves robustness without costly adversarial training; in language, the same prefill-stage probe detects hallucination risk before generating any answer tokens. NUP thus turns two seemingly separate failure taxonomies into a shared uncertainty-budget view and provides a principled lens for reliability analysis. Guided by this NUP theory, we propose ConjMask (masking high-contribution input components) and LogitReg (logit-side regularization) to improve robustness without adversarial training, and use the probe as a decoding-free risk signal for LLMs, enabling hallucination detection and prompt selection. NUP thus provides a unified, practical framework for diagnosing and mitigating boundary anomalies across perception and generation tasks.
[392] Wearable Foundation Models Should Go Beyond Static Encoders
Yu Yvonne Wu, Yuwei Zhang, Hyungjun Yoon, Ting Dang, Dimitris Spathis, Tong Xia, Qiang Yang, Jing Han, Dong Ma, Sung-Ju Lee, Cecilia Mascolo
Main category: cs.LG
TL;DR: Wearable foundation models need to evolve from static, short-term health monitoring to longitudinal, anticipatory systems that reason over personal health trajectories and support proactive interventions.
Details
Motivation: Current wearable foundation models focus on short-term, retrospective prediction of well-defined health tasks but are inadequate for modeling chronic, progressive, or episodic conditions that unfold over longer timeframes. There's a need to move beyond static encoders to systems that can reason over evolving personal history, context, and future risk trajectories.Method: The paper proposes three foundational shifts: 1) Structurally rich data with integrated multimodal, long-term personal trajectories and contextual metadata; 2) Longitudinal-aware multimodal modeling with long-context inference, temporal abstraction, and personalization; 3) Agentic inference systems that support planning, decision-making, and clinically grounded interventions under uncertainty.
Result: The proposed framework reframes wearable health monitoring from retrospective signal interpretation toward continuous, anticipatory, and human-aligned health support, enabling modeling of chronic conditions that unfold over weeks, months, or years.
Conclusion: Wearable foundation models must be explicitly designed for longitudinal, anticipatory health reasoning by incorporating structurally rich data, longitudinal-aware multimodal modeling, and agentic inference systems to effectively support chronic health management.
Abstract: Wearable foundation models (WFMs), trained on large volumes of data collected by affordable, always-on devices, have demonstrated strong performance on short-term, well-defined health monitoring tasks, including activity recognition, fitness tracking, and cardiovascular signal assessment. However, most existing WFMs primarily map short temporal windows to predefined labels via static encoders, emphasizing retrospective prediction rather than reasoning over evolving personal history, context, and future risk trajectories. As a result, they are poorly suited for modeling chronic, progressive, or episodic health conditions that unfold over weeks, months or years. Hence, we argue that WFMs must move beyond static encoders and be explicitly designed for longitudinal, anticipatory health reasoning. We identify three foundational shifts required to enable this transition: (1) Structurally rich data, which goes beyond isolated datasets or outcome-conditioned collection to integrated multimodal, long-term personal trajectories, and contextual metadata, ideally supported by open and interoperable data ecosystems; (2) Longitudinal-aware multimodal modeling, which prioritizes long-context inference, temporal abstraction, and personalization over cross-sectional or population-level prediction; and (3) Agentic inference systems, which move beyond static prediction to support planning, decision-making, and clinically grounded intervention under uncertainty. Together, these shifts reframe wearable health monitoring from retrospective signal interpretation toward continuous, anticipatory, and human-aligned health support.
[393] ARMOR: Adaptive Resilience Against Model Poisoning Attacks in Continual Federated Learning for Mobile Indoor Localization
Danish Gufran, Akhil Singampalli, Sudeep Pasricha
Main category: cs.LG
TL;DR: ARMOR: A continual federated learning framework for indoor localization that protects global models from corruption during updates using state-space modeling to detect and mitigate erroneous or adversarial updates.
Details
Motivation: Indoor localization needs privacy-preserving federated learning, but continual updates in dynamic environments with device heterogeneity make global models vulnerable to erroneous updates and adversarial poisoning attacks, degrading localization performance.Method: Proposes ARMOR framework with a novel state-space model (SSM) that learns historical evolution of global model weight tensors, predicts expected next states, compares incoming local updates to these projections, and selectively mitigates corrupted updates before aggregation.
Result: ARMOR achieves up to 8.0x reduction in mean error and 4.97x reduction in worst-case error compared to state-of-the-art indoor localization frameworks, demonstrating strong resilience against model corruption in real-world conditions.
Conclusion: ARMOR effectively safeguards global models in continual federated learning for indoor localization by detecting and mitigating corrupted updates, enabling robust adaptation to temporal dynamics while preventing model degradation from adversarial attacks.
Abstract: Indoor localization has become increasingly essential for applications ranging from asset tracking to delivering personalized services. Federated learning (FL) offers a privacy-preserving approach by training a centralized global model (GM) using distributed data from mobile devices without sharing raw data. However, real-world deployments require a continual federated learning (CFL) setting, where the GM receives continual updates under device heterogeneity and evolving indoor environments. In such dynamic conditions, erroneous or biased updates can cause the GM to deviate from its expected learning trajectory, gradually degrading internal GM representations and GM localization performance. This vulnerability is further exacerbated by adversarial model poisoning attacks. To address this challenge, we propose ARMOR, a novel CFL-based framework that monitors and safeguards the GM during continual updates. ARMOR introduces a novel state-space model (SSM) that learns the historical evolution of GM weight tensors and predicts the expected next state of weight tensors of the GM. By comparing incoming local updates with this SSM projection, ARMOR detects deviations and selectively mitigates corrupted updates before local updates are aggregated with the GM. This mechanism enables robust adaptation to temporal environmental dynamics and mitigate the effects of model poisoning attacks while preventing GM corruption. Experimental evaluations in real-world conditions indicate that ARMOR achieves notable improvements, with up to 8.0x reduction in mean error and 4.97x reduction in worst-case error compared to state-of-the-art indoor localization frameworks, demonstrating strong resilience against model corruption tested using real-world data and mobile devices.
[394] Demonstrations, CoT, and Prompting: A Theoretical Analysis of ICL
Xuhan Tong, Yuchen Zeng, Jiawei Zhang
Main category: cs.LG
TL;DR: Theoretical analysis of In-Context Learning (ICL) in LLMs linking design choices (demonstration selection, CoT prompting, number of demos, templates) to generalization behavior, with insights on how pretraining enables task generalization and CoT enables task composition.
Details
Motivation: Existing theoretical explanations of ICL either rely on strong assumptions or fail to capture practical factors like demonstration selection, CoT prompting, number of demonstrations, and prompt templates. This paper aims to address this gap with a theoretical analysis under mild assumptions.Method: Establishes theoretical analysis of ICL under mild assumptions, derives upper bound on ICL test loss governed by demonstration quality (Lipschitz constants), intrinsic ICL capability of pretrained model, and distribution shift. Analyzes CoT prompting as inducing task decomposition, and characterizes sensitivity to prompt templates based on number of demonstrations.
Result: Shows that pretraining equips models with ability to generalize beyond observed tasks, CoT enables composition of simpler subtasks into complex ones, and demonstrations/instructions enable retrieval of similar/complex tasks. All theoretical insights are corroborated by experiments.
Conclusion: Provides comprehensive theoretical framework linking practical ICL design choices to generalization behavior, showing how pretraining, CoT, and demonstrations jointly support generalization to unseen tasks.
Abstract: In-Context Learning (ICL) enables pretrained LLMs to adapt to downstream tasks by conditioning on a small set of input-output demonstrations, without any parameter updates. Although there have been many theoretical efforts to explain how ICL works, most either rely on strong architectural or data assumptions, or fail to capture the impact of key practical factors such as demonstration selection, Chain-of-Thought (CoT) prompting, the number of demonstrations, and prompt templates. We address this gap by establishing a theoretical analysis of ICL under mild assumptions that links these design choices to generalization behavior. We derive an upper bound on the ICL test loss, showing that performance is governed by (i) the quality of selected demonstrations, quantified by Lipschitz constants of the ICL loss along paths connecting test prompts to pretraining samples, (ii) an intrinsic ICL capability of the pretrained model, and (iii) the degree of distribution shift. Within the same framework, we analyze CoT prompting as inducing a task decomposition and show that it is beneficial when demonstrations are well chosen at each substep and the resulting subtasks are easier to learn. Finally, we characterize how ICL performance sensitivity to prompt templates varies with the number of demonstrations. Together, our study shows that pretraining equips the model with the ability to generalize beyond observed tasks, while CoT enables the model to compose simpler subtasks into more complex ones, and demonstrations and instructions enable it to retrieve similar or complex tasks, including those that can be composed into more complex ones, jointly supporting generalization to unseen tasks. All theoretical insights are corroborated by experiments.
[395] On Performance Guarantees for Federated Learning with Personalized Constraints
Mohammadjavad Ebrahimi, Daniel Burbano, Farzad Yousefian
Main category: cs.LG
TL;DR: PC-FedAvg: A federated optimization method for personalized constrained problems where each agent has private constraints, using cross-estimates to enable personalization without consensus or constraint sharing.
Details
Motivation: Standard FL formulations handle unconstrained or globally constrained problems, but practical settings often involve heterogeneous resource/model constraints with agent-specific feasible sets, requiring personalized constrained federated optimization.Method: PC-FedAvg method where each agent maintains cross-estimates of other agents’ variables through multi-block local decision vectors, updates all blocks locally while penalizing infeasibility only in its own block, enabling personalization without consensus or constraint sharing.
Result: Established communication-complexity rates of O(ε^{-2}) for suboptimality and O(ε^{-1}) for agent-wise infeasibility, with preliminary experiments on MNIST and CIFAR-10 datasets validating theoretical findings.
Conclusion: PC-FedAvg provides an effective solution for personalized constrained federated optimization problems with private agent constraints, achieving theoretical guarantees while maintaining privacy of constraint information.
Abstract: Federated learning (FL) has emerged as a communication-efficient algorithmic framework for distributed learning across multiple agents. While standard FL formulations capture unconstrained or globally constrained problems, many practical settings involve heterogeneous resource or model constraints, leading to optimization problems with agent-specific feasible sets. Here, we study a personalized constrained federated optimization problem in which each agent is associated with a convex local objective and a private constraint set. We propose PC-FedAvg, a method in which each agent maintains cross-estimates of the other agents’ variables through a multi-block local decision vector. Each agent updates all blocks locally, penalizing infeasibility only in its own block. Moreover, the cross-estimate mechanism enables personalization without requiring consensus or sharing constraint information among agents. We establish communication-complexity rates of $\mathcal{O}(ε^{-2})$ for suboptimality and $\mathcal{O}(ε^{-1})$ for agent-wise infeasibility. Preliminary experiments on the MNIST and CIFAR-10 datasets validate our theoretical findings.
[396] DeepStock: Reinforcement Learning with Policy Regularizations for Inventory Management
Yaqi Xie, Xinru Hao, Jiaxi Liu, Will Ma, Linwei Xin, Lei Cao, Yidong Zhang
Main category: cs.LG
TL;DR: DRL for inventory management benefits from policy regularizations based on classical inventory concepts like Base Stock, improving training stability and performance.
Details
Motivation: Off-the-shelf DRL implementations for inventory management are highly sensitive to hyperparameters, leading to mixed success in practical applications. The paper aims to improve DRL's reliability and performance by incorporating domain knowledge from classical inventory theory.Method: The authors impose policy regularizations grounded in classical inventory concepts (e.g., Base Stock policies) to accelerate hyperparameter tuning and improve DRL performance. They validate their approach through deployment on Alibaba’s Tmall platform and extensive synthetic experiments.
Result: Policy regularizations significantly accelerate hyperparameter tuning and improve final performance of DRL methods. The approach was successfully deployed on Alibaba’s e-commerce platform and reshapes understanding of which DRL methods work best for inventory management.
Conclusion: Incorporating domain-specific knowledge through policy regularizations makes DRL more practical and effective for inventory management, addressing previous limitations of hyperparameter sensitivity.
Abstract: Deep Reinforcement Learning (DRL) provides a general-purpose methodology for training inventory policies that can leverage big data and compute. However, off-the-shelf implementations of DRL have seen mixed success, often plagued by high sensitivity to the hyperparameters used during training. In this paper, we show that by imposing policy regularizations, grounded in classical inventory concepts such as “Base Stock”, we can significantly accelerate hyperparameter tuning and improve the final performance of several DRL methods. We report details from a 100% deployment of DRL with policy regularizations on Alibaba’s e-commerce platform, Tmall. We also include extensive synthetic experiments, which show that policy regularizations reshape the narrative on what is the best DRL method for inventory management.
[397] Continual Learning for Food Category Classification Dataset: Enhancing Model Adaptability and Performance
Piyush Kaushik Bhattacharyya, Devansh Tomar, Shubham Mishra, Divyanshu Rai, Yug Pratap Singh, Harsh Yadav, Krutika Verma, Vishal Meena, N Sangita Achary
Main category: cs.LG
TL;DR: A continual learning framework for text-guided food classification that enables incremental updates to add new food categories without retraining from scratch or degrading prior knowledge.
Details
Motivation: Conventional ML pipelines struggle with recognizing categories absent from original training sets, reducing accuracy since fixed datasets rarely capture full domain diversity. Need for adaptive systems that can learn new food categories over time.Method: Proposes a continual learning framework for text-guided food classification that enables incremental updates. Unlike retraining from scratch, this method integrates new categories without degrading prior knowledge, allowing models to expand their recognition capabilities over time.
Result: The design shows promise for adaptive food recognition, enabling models trained on Western cuisines to later learn to classify dishes like dosa or kimchi. Further refinements are needed but the approach demonstrates viability for incremental learning.
Conclusion: The continual learning framework offers a promising approach for adaptive food recognition systems that can evolve over time, with applications in dietary monitoring and personalized nutrition planning.
Abstract: Conventional machine learning pipelines often struggle to recognize categories absent from the original trainingset. This gap typically reduces accuracy, as fixed datasets rarely capture the full diversity of a domain. To address this, we propose a continual learning framework for text-guided food classification. Unlike approaches that require retraining from scratch, our method enables incremental updates, allowing new categories to be integrated without degrading prior knowledge. For example, a model trained on Western cuisines could later learn to classify dishes such as dosa or kimchi. Although further refinements are needed, this design shows promise for adaptive food recognition, with applications in dietary monitoring and personalized nutrition planning.
[398] Alternating Diffusion for Proximal Sampling with Zeroth Order Queries
Hirohane Takagi, Atsushi Nitanda
Main category: cs.LG
TL;DR: A new proximal sampler using only zeroth-order information of potential function, treating intermediate distributions as Gaussian mixtures for Monte Carlo score estimation, with exponential convergence under isoperimetric conditions.
Details
Motivation: To develop a sampling method that avoids rejection sampling and learned score models, operates with deterministic runtime, and leverages parallel computation while maintaining theoretical guarantees.Method: Proximal sampling with alternating forward/backward heat flow iterations, treating intermediate distributions as Gaussian mixtures to enable Monte Carlo score estimation directly from samplable distributions without rejection sampling.
Result: The method achieves exponential convergence under isoperimetric conditions, avoids rejection sampling, permits flexible step sizes, runs with deterministic runtime, and demonstrates rapid convergence in numerical experiments through particle interactions and parallel computation.
Conclusion: The proposed zeroth-order proximal sampler provides an efficient alternative to diffusion-based methods, with strong theoretical guarantees and practical advantages including deterministic runtime and parallelization capabilities.
Abstract: This work introduces a new approximate proximal sampler that operates solely with zeroth-order information of the potential function. Prior theoretical analyses have revealed that proximal sampling corresponds to alternating forward and backward iterations of the heat flow. The backward step was originally implemented by rejection sampling, whereas we directly simulate the dynamics. Unlike diffusion-based sampling methods that estimate scores via learned models or by invoking auxiliary samplers, our method treats the intermediate particle distribution as a Gaussian mixture, thereby yielding a Monte Carlo score estimator from directly samplable distributions. Theoretically, when the score estimation error is sufficiently controlled, our method inherits the exponential convergence of proximal sampling under isoperimetric conditions on the target distribution. In practice, the algorithm avoids rejection sampling, permits flexible step sizes, and runs with a deterministic runtime budget. Numerical experiments demonstrate that our approach converges rapidly to the target distribution, driven by interactions among multiple particles and by exploiting parallel computation.
[399] RiboSphere: Learning Unified and Efficient Representations of RNA Structures
Zhou Zhang, Hanqun Cao, Cheng Tan, Fang Wu, Pheng Ann Heng, Tianfan Fu
Main category: cs.LG
TL;DR: RiboSphere learns discrete geometric representations of RNA structures using vector quantization and flow matching, capturing modular RNA motifs for structure generation and downstream tasks.
Details
Motivation: RNA structure modeling is challenging due to flexible backbones, prevalent non-canonical interactions, and scarcity of experimental 3D structures. The paper aims to develop a framework that captures the modular organization of RNA architecture where complex folds are composed from recurring structural motifs.Method: RiboSphere combines vector quantization with flow matching. It uses a geometric transformer encoder to produce SE(3)-invariant features, which are discretized with finite scalar quantization (FSQ) into a finite vocabulary of latent codes. Conditioned on these discrete codes, a flow-matching decoder reconstructs atomic coordinates for high-fidelity structure generation.
Result: The model achieves strong performance in structure reconstruction (RMSD 1.25Å, TM-score 0.84). Learned code indices are enriched for specific RNA motifs, indicating the model captures motif-level compositional structure. The pretrained discrete representations transfer effectively to inverse folding and RNA-ligand binding prediction with robust generalization in data-scarce regimes.
Conclusion: RiboSphere provides an effective framework for learning discrete geometric representations of RNA that capture modular structural motifs, enabling accurate structure generation and transfer to various downstream tasks in RNA structural biology.
Abstract: Accurate RNA structure modeling remains difficult because RNA backbones are highly flexible, non-canonical interactions are prevalent, and experimentally determined 3D structures are comparatively scarce. We introduce \emph{RiboSphere}, a framework that learns \emph{discrete} geometric representations of RNA by combining vector quantization with flow matching. Our design is motivated by the modular organization of RNA architecture: complex folds are composed from recurring structural motifs. RiboSphere uses a geometric transformer encoder to produce SE(3)-invariant (rotation/translation-invariant) features, which are discretized with finite scalar quantization (FSQ) into a finite vocabulary of latent codes. Conditioned on these discrete codes, a flow-matching decoder reconstructs atomic coordinates, enabling high-fidelity structure generation. We find that the learned code indices are enriched for specific RNA motifs, suggesting that the model captures motif-level compositional structure rather than acting as a purely compressive bottleneck. Across benchmarks, RiboSphere achieves strong performance in structure reconstruction (RMSD 1.25,Å, TM-score 0.84), and its pretrained discrete representations transfer effectively to inverse folding and RNA–ligand binding prediction, with robust generalization in data-scarce regimes.
[400] Heavy-Tailed and Long-Range Dependent Noise in Stochastic Approximation: A Finite-Time Analysis
Siddharth Chandak, Anuj Yadav, Ayfer Ozgur, Nicholas Bambos
Main category: cs.LG
TL;DR: SA with heavy-tailed and long-range dependent noise: first finite-time moment bounds for strongly monotone operators, applied to SGD and gradient play.
Details
Motivation: Classical SA analyses assume bounded second moments or Markov noise, but many real-world applications (finance, communications) involve heavy-tailed and long-range dependent noise, requiring new theoretical frameworks.Method: Develops SA analysis for strongly monotone operators under heavy-tailed and LRD noise using noise-averaging arguments that regularize noise impact without modifying iterations.
Result: Establishes first finite-time moment bounds with explicit convergence rates quantifying heavy-tail and temporal dependence effects, validated through numerical experiments.
Conclusion: Provides general framework for SA under non-classical noise, applicable to SGD and gradient play, with practical relevance for real-world heavy-tailed and dependent noise scenarios.
Abstract: Stochastic approximation (SA) is a fundamental iterative framework with broad applications in reinforcement learning and optimization. Classical analyses typically rely on martingale difference or Markov noise with bounded second moments, but many practical settings, including finance and communications, frequently encounter heavy-tailed and long-range dependent (LRD) noise. In this work, we study SA for finding the root of a strongly monotone operator under these non-classical noise models. We establish the first finite-time moment bounds in both settings, providing explicit convergence rates that quantify the impact of heavy tails and temporal dependence. Our analysis employs a noise-averaging argument that regularizes the impact of noise without modifying the iteration. Finally, we apply our general framework to stochastic gradient descent (SGD) and gradient play, and corroborate our finite-time analysis through numerical experiments.
[401] Ensembles-based Feature Guided Analysis
Federico Formica, Stefano Gregis, Andrea Rota, Aurora Francesca Zanenga, Mark Lawford, Claudio Menghi
Main category: cs.LG
TL;DR: EFGA improves neural network interpretability by combining feature-guided analysis rules into ensembles to increase recall while maintaining high precision.
Details
Motivation: Existing feature-guided analysis (FGA) techniques for explaining DNN behavior have good precision but limited recall. Need methods that can provide explanations applicable to more situations while maintaining accuracy.Method: Proposes Ensembles-based Feature Guided Analysis (EFGA) that combines rules extracted by FGA into ensembles using aggregation criteria/policies. Three different aggregation criteria are explored to combine rules for broader applicability.
Result: EFGA significantly improves recall (+28.51% on MNIST, +33.15% on LSC for train; +25.76% on MNIST, +30.81% on LSC for test) with minimal precision reduction (-0.89% on MNIST, -0.69% on LSC). Different aggregation criteria offer trade-offs between precision and recall.
Conclusion: EFGA effectively addresses the recall limitation of FGA by using rule ensembles, providing more comprehensive explanations of DNN behavior with minimal precision trade-off.
Abstract: Recent Deep Neural Networks (DNN) applications ask for techniques that can explain their behavior. Existing solutions, such as Feature Guided Analysis (FGA), extract rules on their internal behaviors, e.g., by providing explanations related to neurons activation. Results from the literature show that these rules have considerable precision (i.e., they correctly predict certain classes of features), but the recall (i.e., the number of situations these rule apply) is more limited. To mitigate this problem, this paper presents Ensembles-based Feature Guided Analysis (EFGA). EFGA combines rules extracted by FGA into ensembles. Ensembles aggregate different rules to increase their applicability depending on an aggregation criterion, a policy that dictates how to combine rules into ensembles. Although our solution is extensible, and different aggregation criteria can be developed by users, in this work, we considered three different aggregation criteria. We evaluated how the choice of the criterion influences the effectiveness of EFGA on two benchmarks (i.e., the MNIST and LSC datasets), and found that different aggregation criteria offer alternative trade-offs between precision and recall. We then compare EFGA with FGA. For this experiment, we selected an aggregation criterion that provides a reasonable trade-off between precision and recall. Our results show that EFGA has higher train recall (+28.51% on MNIST, +33.15% on LSC), and test recall (+25.76% on MNIST, +30.81% on LSC) than FGA, with a negligible reduction on the test precision (-0.89% on MNIST, -0.69% on LSC).
[402] The Residual Stream Is All You Need: On the Redundancy of the KV Cache in Transformer Inference
Kaleem Ullah Qasim, Jiashu Zhang, Muhammad Kafeel Shaheen, Razan Alharith, Heying Zhang
Main category: cs.LG
TL;DR: KV cache in transformers is redundant - keys and values can be perfectly recomputed from residual stream vectors, enabling memory-efficient inference with KV-Direct.
Details
Motivation: Current transformer inference relies heavily on KV cache storage, which consumes significant memory and has led to extensive research on compression/eviction techniques. The authors question whether this cache is fundamentally necessary or if it represents redundant state.Method: Prove mathematically that KV cache is redundant by showing keys/values are deterministic projections of residual stream. Verify across 6 models (135M-4B parameters) using residual patching experiments. Develop KV-Direct: bounded-memory inference that checkpoints residual vectors (5KB/token) instead of full KV pairs (136KB), recomputing keys/values on demand.
Result: KV-Direct maintains 100% token match at all cache budgets while baselines degrade to 5-28%. Over 20 conversation turns, holds peak memory at 42MB vs standard cache >103MB. Recomputation runs up to 5x faster than reading cached tensors at moderate batch sizes.
Conclusion: KV cache is entirely redundant state; residual stream satisfies Markov property and is the sole information-carrying state. KV-Direct enables memory-efficient transformer inference with perfect reconstruction accuracy.
Abstract: The key-value (KV) cache is widely treated as essential state in transformer inference, and a large body of work engineers policies to compress, evict, or approximate its entries. We prove that this state is entirely redundant: keys and values at every layer are deterministic projections of the residual stream, and recomputing them from a single residual vector per token incurs exactly zero reconstruction error, not approximately, but bit-identically. We verify this across six models from four architecture families (135M to 4B parameters). Cross-task residual patching at every layer produces D_KL = 0 between patched and original output distributions, confirming that the residual stream satisfies a Markov property and is the sole information-carrying state. Removing the cache entirely and recomputing from scratch yields token-identical output under greedy decoding on all models tested. We build on this result with KV-Direct, a bounded-memory inference scheme that checkpoints residual vectors (5 KB per token on Gemma 3-4B) instead of full KV pairs (136 KB), recomputing keys and values on demand. Over 20 conversation turns, KV-Direct holds peak memory at 42 MB while the standard cache grows past 103 MB. Against five eviction baselines (H2O, StreamingLLM, SnapKV, TOVA, window-only), KV-Direct maintains 100% token match at every cache budget; all baselines degrade to 5-28%. A per-operation latency analysis shows recomputation runs up to 5x faster than reading cached tensors at moderate batch sizes. Code is available at https://github.com/Kaleemullahqasim/KV-Direct.
[403] Scale-Dependent Radial Geometry and Metric Mismatch in Wasserstein Propagation for Reverse Diffusion
Zicheng Lyu, Zengfeng Huang
Main category: cs.LG
TL;DR: Theoretical analysis of reverse diffusion sampling error propagation using radial contraction metrics and a one-switch routing argument to obtain explicit Wasserstein distance guarantees.
Details
Motivation: Existing analyses propagate sampling error in Euclidean geometry along entire reverse trajectories, but Gaussian smoothing creates contraction first at large separations while short separations remain non-dissipative, creating a metric mismatch between early contraction geometry and terminal error measurement geometry.Method: Formalize metric mismatch through explicit radial lower profile for learned reverse drift; use one-switch routing argument: reflection coupling yields contraction in concave transport metric adapted to radial profile before switch, then convert to Wasserstein distance and propagate in Euclidean geometry after switch.
Result: Obtain explicit non-asymptotic end-to-end Wasserstein distance guarantees for discretizations of learned reverse SDE under L² score-error control, one-sided Lipschitz condition of score error, and standard well-posedness and coupling hypotheses.
Conclusion: The radial contraction approach with one-switch routing provides improved theoretical guarantees for reverse diffusion sampling by addressing the metric mismatch between early contraction geometry and terminal error measurement.
Abstract: Existing analyses of reverse diffusion often propagate sampling error in the Euclidean geometry underlying (\Wtwo) along the entire reverse trajectory. Under weak log-concavity, however, Gaussian smoothing can create contraction first at large separations while short separations remain non-dissipative. The first usable contraction is therefore radial rather than Euclidean, creating a metric mismatch between the geometry that contracts early and the geometry in which the terminal error is measured. We formalize this mismatch through an explicit radial lower profile for the learned reverse drift. Its far-field limit gives a contraction reserve, its near-field limit gives the Euclidean load governing direct (\Wtwo) propagation, and admissible switch times are characterized by positivity of the reserve on the remaining smoothing window. We exploit this structure with a one-switch routing argument. Before the switch, reflection coupling yields contraction in a concave transport metric adapted to the radial profile. At the switch, we convert once from this metric back to (\Wtwo) under a (p)-moment budget, and then propagate the converted discrepancy over the remaining short window in Euclidean geometry. For discretizations of the learned reverse SDE under (L^2) score-error control, a one-sided Lipschitz condition of score error, and standard well-posedness and coupling hypotheses, we obtain explicit non-asymptotic end-to-end (\Wtwo) guarantees, a scalar switch-selection objective, and a sharp structural limit on the conversion exponent within the affine-tail concave class.
[404] Ontology-Based Knowledge Modeling and Uncertainty-Aware Outdoor Air Quality Assessment Using Weighted Interval Type-2 Fuzzy Logic
Md Inzmam, Ritesh Chandra, Sadhana Tiwari, Sonali Agarwal, Triloki Pant
Main category: cs.LG
TL;DR: Proposes a hybrid ontology-based uncertainty-aware framework using Weighted Interval Type-2 Fuzzy Logic with semantic knowledge modeling for improved air quality index classification and decision support.
Details
Motivation: Traditional AQI calculation uses crisp thresholds and deterministic rules that are inadequate for handling uncertainty and transitions between air quality classes, especially near boundaries.Method: Hybrid framework integrating Weighted Interval Type-2 Fuzzy Logic with semantic knowledge modeling. Uses Interval Type-2 fuzzy sets for uncertainty near boundaries, IT2-FAHP for pollutant weighting, and OWL-based ontology extending SSN with SWRL rules and SPARQL queries.
Result: Experimental evaluation with CPCB datasets shows improved AQI classification reliability and uncertainty handling compared to traditional crisp and Type-1 fuzzy approaches, enabling explainable semantic reasoning.
Conclusion: The proposed framework effectively addresses uncertainty in AQI classification while providing intelligent decision support through semantic reasoning for air quality monitoring systems.
Abstract: Outdoor air pollution is a major concern for the environment and public health, especially in areas where urbanization is taking place rapidly. The Indian Air Quality Index (IND-AQI), developed by the Central Pollution Control Board (CPCB), is a standardized reporting system for air quality based on pollutants such as PM2.5, PM10), nitrogen dioxide (NO2), sulfur dioxide (SO2), ozone (O3), carbon monoxide (CO), and ammonia (NH3). However, the traditional calculation of the AQI uses crisp thresholds and deterministic aggregation rules, which are not suitable for handling uncertainty and transitions between classes. To address these limitations, this study proposes a hybrid ontology-based uncertainty-aware framework integrating Weighted Interval Type-2 Fuzzy Logic with semantic knowledge modeling. Interval Type-2 fuzzy sets are used to model uncertainty near AQI class boundaries, while pollutant importance weights are determined using Interval Type-2 Fuzzy Analytic Hierarchy Process (IT2-FAHP) to reflect their relative health impacts. In addition, an OWL-based air quality ontology extending the Semantic Sensor Network (SSN) ontology is developed to represent pollutants, monitoring stations, AQI categories, regulatory standards, and environmental governance actions. Semantic reasoning is implemented using SWRL rules and validated through SPARQL queries to infer AQI categories, health risks, and recommended mitigation actions. Experimental evaluation using CPCB air quality datasets demonstrates that the proposed framework improves AQI classification reliability and uncertainty handling compared with traditional crisp and Type-1 fuzzy approaches, while enabling explainable semantic reasoning and intelligent decision support for air quality monitoring systems
[405] Regret Analysis of Sleeping Competing Bandits
Shinnosuke Uba, Yutaro Yamaguchi
Main category: cs.LG
TL;DR: Sleeping Competing Bandits extends multi-armed bandits with stable matching to handle time-varying availability of players and arms, achieving asymptotically optimal regret bounds.
Details
Motivation: Real-world applications of competing bandits often involve players and arms with varying availability over time, which existing models don't account for. The paper aims to address this limitation by introducing a sleeping bandits framework for competing scenarios.Method: Proposes Sleeping Competing Bandits model with extended regret definition, develops an algorithm that achieves O(NK log T_i/Δ²) regret bound under reasonable assumptions, and provides matching lower bound Ω(N(K-N+1) log T_i/Δ²).
Result: The algorithm achieves asymptotic optimality when number of arms K is relatively larger than number of players N. The regret bounds show the algorithm is asymptotically optimal in this regime.
Conclusion: Sleeping Competing Bandits successfully extends competing bandits to handle time-varying availability, with provable optimal regret bounds in relevant parameter regimes.
Abstract: The Competing Bandits framework is a recently emerging area that integrates multi-armed bandits in online learning with stable matching in game theory. While conventional models assume that all players and arms are constantly available, in real-world problems, their availability can vary arbitrarily over time. In this paper, we formulate this setting as Sleeping Competing Bandits. To analyze this problem, we naturally extend the regret definition used in existing competing bandits and derive regret bounds for the proposed model. We propose an algorithm that simultaneously achieves an asymptotic regret bound of $\mathrm{O}\left(NK\log T_{i}/Δ^2\right)$ under reasonable assumptions, where $N$ is the number of players, $K$ is the number of arms, $T_{i}$ is the number of rounds of each player $p_i$, and $Δ$ is the minimum reward gap. We also provide a regret lower bound of $\mathrmΩ\left( N(K-N+1)\log T_{i}/Δ^2 \right)$ under the same assumptions. This implies that our algorithm is asymptotically optimal in the regime where the number of arms $K$ is relatively larger than the number of players $N$.
[406] Learning from Similarity/Dissimilarity and Pairwise Comparison
Tomoya Tate, Kosuke Sugiyama, Masato Uchida
Main category: cs.LG
TL;DR: Proposes SD-Pcomp classification, a binary judgment-based weakly supervised learning framework that uses only relative judgments (class agreement and pairwise preference) instead of probabilistic supervision, improving robustness to label noise and uncertainty.
Details
Motivation: Existing SconfConfDiff framework requires continuous probabilistic supervision which involves subjective uncertainty quantification, leading to unstable supervision. Need for more robust approach using only relative binary judgments.Method: Uses Similarity/Dissimilarity (SD) labels and Pairwise Comparison (Pcomp) labels as weak supervision. Develops two unbiased risk estimators: (1) convex combination of SD and Pcomp, and (2) unified estimator integrating both labels by modeling their relationship.
Result: The proposed approach improves classification performance over methods using single weak labels, and demonstrates robustness to label noise and uncertainty in class prior estimation.
Conclusion: SD-Pcomp classification provides a more robust weakly supervised learning framework that avoids subjective probabilistic labeling by relying only on relative binary judgments, with theoretical guarantees and practical benefits.
Abstract: This paper addresses binary classification in scenarios where obtaining explicit instance level labels is impractical, by exploiting multiple weak labels defined on instance pairs. The existing SconfConfDiff classification framework relies on continuous valued probabilistic supervision, including similarity-confidence, the probability of class agreement, and confidence-difference, the difference in positive class probabilities. However, probabilistic labeling requires subjective uncertainty quantification, often leading to unstable supervision. We propose SD-Pcomp classification, a binary judgment based weakly supervised learning framework that relies only on relative judgments, namely class agreement between two instances and pairwise preference toward the positive class. The method employs Similarity/Dissimilarity (SD) labels and Pairwise Comparison (Pcomp) labels, and develops two unbiased risk estimators, (i) a convex combination of SD and Pcomp and (ii) a unified estimator that integrates both labels by modeling their relationship. Theoretical analysis and experimental results show that the proposed approach improves classification performance over methods using a single weak label, and is robust to label noise and uncertainty in class prior estimation.
[407] FedRG: Unleashing the Representation Geometry for Federated Learning with Noisy Clients
Tian Wen, Zhiqin Yang, Yonggang Zhang, Xuefeng Jiang, Hao Peng, Yuwei Wang, Bo Han
Main category: cs.LG
TL;DR: FedRG improves federated learning with noisy labels by using representation geometry and self-supervision to identify noisy samples, outperforming existing methods in heterogeneous settings.
Details
Motivation: Federated learning suffers from performance degradation due to noisy annotations in distributed scenarios. Existing approaches relying on scalar loss values are unreliable for FL under heterogeneous settings, motivating a representation-based approach.Method: FedRG uses self-supervision to create label-agnostic spherical representations, fits a spherical von Mises-Fisher mixture model to capture semantic clusters, integrates geometric evidence with semantic-label soft mapping to identify noisy samples, and employs a personalized noise absorption matrix for robust optimization.
Result: Extensive experiments show FedRG significantly outperforms state-of-the-art methods for federated learning with data heterogeneity under diverse noisy client scenarios.
Conclusion: The representation geometry priority principle effectively addresses noisy label problems in federated learning, with FedRG demonstrating superior performance through geometric representation analysis and noise handling mechanisms.
Abstract: Federated learning (FL) suffers from performance degradation due to the inevitable presence of noisy annotations in distributed scenarios. Existing approaches have advanced in distinguishing noisy samples from the dataset for label correction by leveraging loss values. However, noisy samples recognition relying on scalar loss lacks reliability for FL under heterogeneous scenarios. In this paper, we rethink this paradigm from a representation perspective and propose \method~(\textbf{Fed}erated under \textbf{R}epresentation \textbf{G}emometry), which follows \textbf{the principle of ``representation geometry priority’’} to recognize noisy labels. Firstly, \methodcreates label-agnostic spherical representations by using self-supervision. It then iteratively fits a spherical von Mises-Fisher (vMF) mixture model to this geometry using previously identified clean samples to capture semantic clusters. This geometric evidence is integrated with a semantic-label soft mapping mechanism to derive a distribution divergence between the label-free and annotated label-conditioned feature space, which robustly identifies noisy samples and updates the vMF mixture model with the newly separated clean dataset. Lastly, we employ an additional personalized noise absorption matrix on noisy labels to achieve robust optimization. Extensive experimental results demonstrate that \methodsignificantly outperforms state-of-the-art methods for FL with data heterogeneity under diverse noisy clients scenarios.
[408] FedPDPO: Federated Personalized Direct Preference Optimization for Large Language Model Alignment
Kewen Zhu, Liping Yi, Zhiming Zhao, Zhuang Qi, Han Yu, Qinghua Hu
Main category: cs.LG
TL;DR: FedPDPO: A personalized federated learning framework for aligning LLMs with human preferences using parameter-efficient fine-tuning with LoRA adapters and personalized DPO training to handle non-IID data.
Details
Motivation: Aligning LLMs with human preferences in federated learning is challenging due to decentralized, privacy-sensitive, and highly non-IID preference data. Standard DPO suffers from performance degradation under non-IID conditions and limited generalization of implicit rewards.Method: Proposes FedPDPO with: 1) Parameter-efficient architecture using frozen LLM backbone with LoRA adapters, 2) Globally shared LoRA adapter with personalized client-specific LLM head, 3) Personalized DPO training with client-specific explicit reward head, 4) Bottleneck adapter to balance global and local features.
Result: Extensive experiments on multiple preference datasets show state-of-the-art performance with up to 4.80% average accuracy improvements in federated intra-domain and cross-domain settings.
Conclusion: FedPDPO effectively addresses non-IID heterogeneity in federated preference alignment of LLMs through personalized federated optimization with theoretical soundness and practical efficiency.
Abstract: Aligning large language models (LLMs) with human preferences in federated learning (FL) is challenging due to decentralized, privacy-sensitive, and highly non-IID preference data. Direct Preference Optimization (DPO) offers an efficient alternative to reinforcement learning with human feedback (RLHF), but its direct application in FL suffers from severe performance degradation under non-IID data and limited generalization of implicit rewards. To bridge this gap, we propose FedPDPO (Federated Personalized Direct Preference Optimization), a personalized federated framework for preference alignment of LLMs. It adopts a parameter-efficient fine-tuning architecture where each client maintains a frozen pretrained LLM backbone augmented with a Low-Rank Adaptation (LoRA) adapter, enabling communication-efficient aggregation. To address non-IID heterogeneity, we devise (1) the globally shared LoRA adapter with the personalized client-specific LLM head. Moreover, we introduce (2) a personalized DPO training strategy with a client-specific explicit reward head to complement implicit rewards and further alleviate non-IID heterogeneity, and (3) a bottleneck adapter to balance global and local features. We provide theoretical analysis establishing the probabilistic foundation and soundness. Extensive experiments on multiple preference datasets demonstrate state-of-the-art performance, achieving up to 4.80% average accuracy improvements in federated intra-domain and cross-domain settings.
[409] Dual Path Attribution: Efficient Attribution for SwiGLU-Transformers through Layer-Wise Target Propagation
Lasse Marten Jantsch, Dong-Jae Koh, Seonghyeon Lee, Young-Kyoon Suh
Main category: cs.LG
TL;DR: DPA is a novel attribution framework that efficiently traces information flow in transformer LLMs with O(1) complexity, achieving state-of-the-art faithfulness without counterfactual examples.
Details
Motivation: Understanding internal mechanisms of transformer-based LLMs is crucial for reliable deployment, but dense component attribution remains prohibitively expensive despite recent attribution methods attempting to balance faithfulness and computational efficiency.Method: DPA analytically decomposes and linearizes SwiGLU Transformers into distinct pathways, propagating a targeted unembedding vector to receive effective representations at each residual position in one forward and one backward pass without counterfactual examples.
Result: DPA achieves O(1) time complexity with respect to number of model components, scales to long input sequences and dense component attribution, and demonstrates state-of-the-art faithfulness and unprecedented efficiency on standard interpretability benchmarks.
Conclusion: DPA provides an efficient and faithful attribution framework for understanding transformer LLMs, enabling practical analysis of internal mechanisms with computational efficiency.
Abstract: Understanding the internal mechanisms of transformer-based large language models (LLMs) is crucial for their reliable deployment and effective operation. While recent efforts have yielded a plethora of attribution methods attempting to balance faithfulness and computational efficiency, dense component attribution remains prohibitively expensive. In this work, we introduce Dual Path Attribution (DPA), a novel framework that faithfully traces information flow on the frozen transformer in one forward and one backward pass without requiring counterfactual examples. DPA analytically decomposes and linearizes the computational structure of the SwiGLU Transformers into distinct pathways along which it propagates a targeted unembedding vector to receive the effective representation at each residual position. This target-centric propagation achieves O(1) time complexity with respect to the number of model components, scaling to long input sequences and dense component attribution. Extensive experiments on standard interpretability benchmarks demonstrate that DPA achieves state-of-the-art faithfulness and unprecedented efficiency compared to existing baselines.
[410] Scalable Learning of Multivariate Distributions via Coresets
Zeyu Ding, Katja Ickstadt, Nadja Klein, Alexander Munteanu, Simon Omlor
Main category: cs.LG
TL;DR: Novel coreset construction for multivariate conditional transformation models (MCTMs) to enhance scalability and training efficiency for large-scale data, using importance sampling to maintain statistical accuracy while substantially reducing computational burden.
Details
Motivation: Existing non-parametric and semi-parametric regression and density estimation methods struggle with large-scale data, creating a need for more efficient and scalable approaches that can handle complex distributions and non-linear relationships without requiring full parametric specifications.Method: Develops coreset construction for MCTMs using importance sampling to reduce data size while preserving statistical accuracy with high probability. Uses geometric approximation based on convex hull of input data to address numerical issues with logarithmic terms, ensuring stable inference.
Result: The approach achieves substantial data reduction while maintaining log-likelihood within multiplicative error bounds of (1±ε). Numerical experiments show significantly improved computational efficiency for large, complex datasets compared to conventional methods.
Conclusion: The novel coreset construction enables efficient and scalable semi-parametric distributional modeling, laying foundation for broader applications in statistics and machine learning, particularly for scenarios with complex distributions and non-linear relationships.
Abstract: Efficient and scalable non-parametric or semi-parametric regression analysis and density estimation are of crucial importance to the fields of statistics and machine learning. However, available methods are limited in their ability to handle large-scale data. We address this issue by developing a novel coreset construction for multivariate conditional transformation models (MCTMs) to enhance their scalability and training efficiency. To the best of our knowledge, these are the first coresets for semi-parametric distributional models. Our approach yields substantial data reduction via importance sampling. It ensures with high probability that the log-likelihood remains within multiplicative error bounds of $(1\pm\varepsilon)$ and thereby maintains statistical model accuracy. Compared to conventional full-parametric models, where coresets have been incorporated before, our semi-parametric approach exhibits enhanced adaptability, particularly in scenarios where complex distributions and non-linear relationships are present, but not fully understood. To address numerical problems associated with normalizing logarithmic terms, we follow a geometric approximation based on the convex hull of input data. This ensures feasible, stable, and accurate inference in scenarios involving large amounts of data. Numerical experiments demonstrate substantially improved computational efficiency when handling large and complex datasets, thus laying the foundation for a broad range of applications within the statistics and machine learning communities.
[411] Quantifying Gate Contribution in Quantum Feature Maps for Scalable Circuit Optimization
F. Rodríguez-Díaz, D. Gutiérrez-Avilés, A. Troncoso, F. Martínez-Álvarez
Main category: cs.LG
TL;DR: GATE methodology optimizes quantum machine learning circuits by removing low-contribution gates using a significance index, reducing circuit size/runtime while maintaining accuracy.
Details
Motivation: Current quantum devices face limitations from noise, decoherence, and connectivity constraints that hinder efficient execution of feature map-based circuits for quantum machine learning classification tasks.Method: Develops a gate significance index combining fidelity, entanglement, and sensitivity to quantify gate relevance. Iteratively scans threshold ranges, eliminates low-contribution gates, generates optimized QML models, and ranks them based on accuracy, runtime, and balanced performance.
Result: Consistent reductions in circuit size and runtime, with preserved or improved predictive accuracy in many cases. Best trade-offs occur at intermediate thresholds rather than baseline or aggressively compressed circuits.
Conclusion: GATE methodology effectively optimizes quantum machine learning circuits, demonstrating practical improvements for real-world classification tasks across simulation, emulation, and real hardware scenarios.
Abstract: Quantum machine learning offers promising advantages for classification tasks, but noise, decoherence, and connectivity constraints in current devices continue to limit the efficient execution of feature map-based circuits. Gate Assessment and Threshold Evaluation (GATE) is presented as a circuit optimization methodology that reduces quantum feature maps using a novel gate significance index. This index quantifies the relevance of each gate by combining fidelity, entanglement, and sensitivity. It is formulated for both simulator/emulator environments, where quantum states are accessible, and for real hardware, where these quantities are estimated from measurement results and auxiliary circuits. The approach iteratively scans a threshold range, eliminates low-contribution gates, generates optimized quantum machine learning models, and ranks them based on accuracy, runtime, and a balanced performance criterion before final testing. The methodology is evaluated on real-world classification datasets using two representative quantum machine learning models, PegasosQSVM and Quantum Neural Network, in three execution scenarios: noise-free simulation, noisy emulation derived from an IBM backend, and real IBM quantum hardware. The structural impact of gate removal in feature maps is examined, compatibility with noise-mitigation techniques is studied, and the scalability of index computation is evaluated using approaches based on density matrices, matrix product states, tensor networks, and real-world devices. The results show consistent reductions in circuit size and runtime and, in many cases, preserved or improved predictive accuracy, with the best trade-offs typically occurring at intermediate thresholds rather than in the baseline circuits or in those compressed more aggressively.
[412] Two-Time-Scale Learning Dynamics: A Population View of Neural Network Training
Giacomo Borghi, Hyesung Im, Lorenzo Pareschi
Main category: cs.LG
TL;DR: Theoretical framework for population-based neural network training using two-time-scale dynamics: fast parameter updates via SGD/Langevin and slow hyperparameter evolution via selection-mutation dynamics.
Details
Motivation: Population-based learning methods (evolutionary strategies, PBT, model-merging) have empirical success but lack a general mathematical description of collective training dynamics. Need theoretical framework to understand these methods.Method: Model population of neural networks as interacting agent system with two-time-scale dynamics: fast noisy gradient updates (SGD/Langevin) for parameters, slower selection-mutation dynamics for hyperparameters. Derive large-population limit and selection-mutation equation for hyperparameter density under strong time-scale separation.
Result: Proved large-population limit for joint distribution of parameters and hyperparameters. Under time-scale separation, derived selection-mutation equation for hyperparameter density. Fast parameter dynamics relax to Boltzmann-Gibbs measure, inducing effective fitness for slow evolution. Connects population-based learning with bilevel optimization and replicator-mutator models.
Conclusion: Theoretical framework provides mathematical foundation for population-based learning, clarifies role of noise and diversity in balancing optimization and exploration, and shows access to effective fitness can improve population-level updates.
Abstract: Population-based learning paradigms, including evolutionary strategies, Population-Based Training (PBT), and recent model-merging methods, combine fast within-model optimisation with slower population-level adaptation. Despite their empirical success, a general mathematical description of the resulting collective training dynamics remains incomplete. We introduce a theoretical framework for neural network training based on two-time-scale population dynamics. We model a population of neural networks as an interacting agent system in which network parameters evolve through fast noisy gradient updates of SGD/Langevin type, while hyperparameters evolve through slower selection–mutation dynamics. We prove the large-population limit for the joint distribution of parameters and hyperparameters and, under strong time-scale separation, derive a selection–mutation equation for the hyperparameter density. For each fixed hyperparameter, the fast parameter dynamics relaxes to a Boltzmann–Gibbs measure, inducing an effective fitness for the slow evolution. The averaged dynamics connects population-based learning with bilevel optimisation and classical replicator–mutator models, yields conditions under which the population mean moves toward the fittest hyperparameter, and clarifies the role of noise and diversity in balancing optimisation and exploration. Numerical experiments illustrate both the large-population regime and the reduced two-time-scale dynamics, and indicate that access to the effective fitness, either in closed form or through population-level estimation, can improve population-level updates.
[413] Eye Gaze-Informed and Context-Aware Pedestrian Trajectory Prediction in Shared Spaces with Automated Shuttles: A Virtual Reality Study
Danya Li, Yan Feng, Rico Krueger
Main category: cs.LG
TL;DR: VR study of pedestrian interactions with automated shuttles in shared spaces, proposing GazeX-LSTM - a multimodal eye gaze-informed prediction model that integrates trajectories, gaze dynamics, and contextual factors for improved behavior prediction.
Details
Motivation: Accurate pedestrian behavior prediction is critical for safe automated shuttle integration in unstructured shared urban spaces lacking traffic rules, requiring understanding of complex pedestrian interactions.Method: Virtual Reality study capturing pedestrian interactions across diverse scenarios; development of GazeX-LSTM model integrating pedestrian trajectories, fine-grained eye gaze dynamics, and contextual factors using LSTM architecture.
Result: Identified critical pedestrian behavior patterns; demonstrated eye gaze’s unique predictive power over head orientation alone; showed super-additive improvements when combining gaze data with contextual information for prediction accuracy.
Conclusion: Eye gaze-informed modeling fundamentally advances pedestrian behavior prediction; situational contexts play critical role in shared-space interactions; paves way for safer automated vehicle technologies accounting for human perception and action.
Abstract: The integration of Automated Shuttles into shared urban spaces presents unique challenges due to the absence of traffic rules and the complex pedestrian interactions. Accurately anticipating pedestrian behavior in such unstructured environments is therefore critical for ensuring both safety and efficiency. This paper presents a Virtual Reality (VR) study that captures how pedestrians interact with automated shuttles across diverse scenarios, including varying approach angles and navigating in continuous traffic. We identify critical behavior patterns present in pedestrians’ decision-making in shared spaces, including hesitation, evasive maneuvers, gaze allocation, and proxemic adjustments. To model pedestrian behavior, we propose GazeX-LSTM, a multimodal eye gaze-informed and context-aware prediction model that integrates pedestrians’ trajectories, fine-grained eye gaze dynamics, and contextual factors. We shift prediction from a vehicle- to a human-centered perspective by leveraging eye-tracking data to capture pedestrian attention. We systematically validate the unique and irreplaceable predictive power of eye gaze over head orientation alone, further enhancing performance by integrating contextual variables. Notably, the combination of eye gaze data and contextual information produces super-additive improvements on pedestrian behavior prediction accuracy, revealing the complementary relationship between visual attention and situational contexts. Together, our findings provide the first evidence that eye gaze-informed modeling fundamentally advances pedestrian behavior prediction and highlight the critical role of situational contexts in shared-space interactions. This paves the way for safer and more adaptive automated vehicle technologies that account for how people perceive and act in complex shared spaces.
[414] GDEGAN: Gaussian Dynamic Equivariant Graph Attention Network for Ligand Binding Site Prediction
Animesh, Plaban Kumar Bhowmick, Pralay Mitra
Main category: cs.LG
TL;DR: GDEGAN introduces a novel equivariant graph attention network with adaptive Gaussian kernels for protein binding site prediction, outperforming existing methods by capturing local chemical and geometric variations.
Details
Motivation: Current equivariant GNNs for binding site identification use dot-product attention that ignores variations in chemical and geometric properties of neighboring residues, limiting their ability to accurately recognize binding sites.Method: Proposes GDEGAN (Gaussian Dynamic Equivariant Graph Attention Network) which replaces dot-product attention with adaptive kernels that capture local feature distribution statistics. Uses local variance as adaptive bandwidth with learnable per-head temperatures for context-specific importance.
Result: Outperforms existing methods with 37-66% relative improvement in DCC and 7-19% improvement in DCA success rates across COACH420, HOLO4k, and PDBBind2020 datasets.
Conclusion: GDEGAN’s adaptive attention mechanism effectively captures local variations in protein structures, providing significant improvements in binding site prediction with direct applications in computational drug discovery.
Abstract: Accurate prediction of binding sites of a given protein, to which ligands can bind, is a critical step in structure-based computational drug discovery. Recently, Equivariant Graph Neural Networks (GNNs) have emerged as a powerful paradigm for binding site identification methods due to the large-scale availability of 3D structures of proteins via protein databases and AlphaFold predictions. The state-of-the-art equivariant GNN methods implement dot product attention, disregarding the variation in the chemical and geometric properties of the neighboring residues. To capture this variation, we propose GDEGAN (Gaussian Dynamic Equivariant Graph Attention Network), which replaces dot-product attention with adaptive kernels that recognize binding sites. The proposed attention mechanism captures variation in neighboring residues using statistics of their characteristic local feature distributions. Our mechanism dynamically computes neighborhood statistics at each layer, using local variance as an adaptive bandwidth parameter with learnable per-head temperatures, enabling each protein region to determine its own context-specific importance. GDEGAN outperforms existing methods with relative improvements of 37-66% in DCC and 7-19% DCA success rates across COACH420, HOLO4k, and PDBBind2020 datasets. These advances have direct application in accelerating protein-ligand docking by identifying potential binding sites for therapeutic target identification.
[415] What If Consensus Lies? Selective-Complementary Reinforcement Learning at Test Time
Dong Yan, Jian Liang, Yanbo Wang, Shuo Lu, Ran He, Tieniu Tan
Main category: cs.LG
TL;DR: SCRL introduces selective positive and entropy-gated negative pseudo-labeling for test-time reinforcement learning to mitigate label noise amplification in LLM reasoning tasks.
Details
Motivation: Existing test-time reinforcement learning methods rely exclusively on positive pseudo-labeling, which becomes vulnerable when answer distributions are highly dispersed, leading to weak consensus that reinforces incorrect trajectories as supervision signals.Method: SCRL develops Selective Positive Pseudo-Labeling with strict consensus criteria to filter unreliable majorities, and introduces Entropy-Gated Negative Pseudo-Labeling as the first negative supervision mechanism in TTRL to prune incorrect trajectories based on generation uncertainty.
Result: Extensive experiments on multiple reasoning benchmarks demonstrate that SCRL achieves substantial improvements over baselines while maintaining robust generalization and training stability under constrained rollout budgets.
Conclusion: SCRL provides a robust test-time reinforcement learning framework that effectively mitigates label noise amplification through complementary positive and negative pseudo-labeling strategies.
Abstract: Test-Time Reinforcement Learning (TTRL) enables Large Language Models (LLMs) to enhance reasoning capabilities on unlabeled test streams by deriving pseudo-rewards from majority voting consensus. However, existing TTRL methods rely exclusively on positive pseudo-labeling strategies. Such reliance becomes vulnerable under challenging scenarios where answer distributions are highly dispersed, resulting in weak consensus that inadvertently reinforces incorrect trajectories as supervision signals. In this paper, we propose SCRL (Selective-Complementary Reinforcement Learning), a robust test-time reinforcement learning framework that effectively mitigates label noise amplification. SCRL develops Selective Positive Pseudo-Labeling, which enforces strict consensus criteria to filter unreliable majorities. Complementarily, SCRL introduces Entropy-Gated Negative Pseudo-Labeling, the first negative supervision mechanism in TTRL, to reliably prune incorrect trajectories based on generation uncertainty. Extensive experiments on multiple reasoning benchmarks demonstrate that SCRL achieves substantial improvements over baselines, while maintaining robust generalization and training stability under constrained rollout budgets. Our code is available at https://github.com/Jasper-Yan/SCRL.
[416] FIPO: Eliciting Deep Reasoning with Future-KL Influenced Policy Optimization
Chiyu Ma, Shuo Yang, Kexin Huang, Jinda Lu, Haoming Meng, Shangshang Wang, Bolin Ding, Soroush Vosoughi, Guoyin Wang, Jingren Zhou
Main category: cs.LG
TL;DR: FIPO is a reinforcement learning algorithm that improves reasoning in LLMs by using future-KL divergence for better credit assignment, enabling longer reasoning chains and higher accuracy on complex tasks.
Details
Motivation: Current RL methods for LLMs use coarse-grained credit assignment that treats all tokens equally, failing to distinguish critical logical steps from trivial tokens, which creates reasoning bottlenecks and limits performance.Method: FIPO incorporates discounted future-KL divergence into policy updates to create a dense advantage formulation that re-weights tokens based on their influence on subsequent trajectory behavior, enabling more precise credit assignment.
Result: FIPO extends average chain-of-thought length from ~4,000 to over 10,000 tokens and increases AIME 2024 Pass@1 accuracy from 50.0% to 58.0% (converging at ~56.0%), outperforming DeepSeek-R1-Zero-Math-32B (~47.0%) and o1-mini (~56.0%).
Conclusion: Dense advantage formulations are crucial for evolving ORM-based algorithms to unlock the full reasoning potential of base models, and FIPO demonstrates significant improvements in reasoning length and accuracy.
Abstract: We present Future-KL Influenced Policy Optimization (FIPO), a reinforcement learning algorithm designed to overcome reasoning bottlenecks in large language models. While GRPO style training scales effectively, it typically relies on outcome-based rewards (ORM) that distribute a global advantage uniformly across every token in a trajectory. We argue that this coarse-grained credit assignment imposes a performance ceiling by failing to distinguish critical logical pivots from trivial tokens. FIPO addresses this by incorporating discounted future-KL divergence into the policy update, creating a dense advantage formulation that re-weights tokens based on their influence on subsequent trajectory behavior. Empirically, FIPO enables models to break through the length stagnation seen in standard baselines. Evaluated on Qwen2.5-32B, FIPO extends the average chain-of-thought length from roughly 4,000 to over 10,000 tokens and increases AIME 2024 Pass@1 accuracy from 50.0% to a peak of 58.0% (converging at approximately 56.0%). This outperforms both DeepSeek-R1-Zero-Math-32B (around 47.0%) and o1-mini (approximately 56.0%). Our results suggest that establishing dense advantage formulations is a vital path for evolving ORM-based algorithms to unlock the full reasoning potential of base models. We open-source our training system, built on the verl framework.
[417] Integrating Meta-Features with Knowledge Graph Embeddings for Meta-Learning
Antonis Klironomos, Ioannis Dasoulas, Francesco Periti, Mohamed Gad-Elrab, Heiko Paulheim, Anastasia Dimou, Evgeny Kharlamov
Main category: cs.LG
TL;DR: KGmetaSP: Knowledge graph embeddings approach for meta-learning that leverages past experiment data to improve pipeline performance estimation and dataset similarity estimation.
Details
Motivation: Existing meta-learning approaches rely on dataset meta-features but overlook valuable past experimental results and pipeline metadata, limiting their ability to capture dataset-pipeline interactions that reveal performance similarity patterns.Method: Proposes KGmetaSP that represents datasets and pipelines in a unified knowledge graph, derives embeddings to support pipeline-agnostic meta-models for performance estimation and distance-based retrieval for dataset similarity.
Result: Validated on large-scale benchmark of 144,177 OpenML experiments; enables accurate pipeline performance estimation using single pipeline-agnostic meta-model and improves dataset similarity estimation over baselines.
Conclusion: KGmetaSP establishes new reference point for meta-learning by consolidating open experiment data into unified knowledge graph, advancing the field through better utilization of past experimental results.
Abstract: The vast collection of machine learning records available on the web presents a significant opportunity for meta-learning, where past experiments are leveraged to improve performance. Two crucial meta-learning tasks are pipeline performance estimation (PPE), which predicts pipeline performance on target datasets, and dataset performance-based similarity estimation (DPSE), which identifies datasets with similar performance patterns. Existing approaches primarily rely on dataset meta-features (e.g., number of instances, class entropy, etc.) to represent datasets numerically and approximate these meta-learning tasks. However, these approaches often overlook the wealth of past experimental results and pipeline metadata available. This limits their ability to capture dataset - pipeline interactions that reveal performance similarity patterns. In this work, we propose KGmetaSP, a knowledge-graph-embeddings approach that leverages existing experiment data to capture these interactions and improve both PPE and DPSE. We represent datasets and pipelines within a unified knowledge graph (KG) and derive embeddings that support pipeline-agnostic meta-models for PPE and distance-based retrieval for DPSE. To validate our approach, we construct a large-scale benchmark comprising 144,177 OpenML experiments, enabling a rich cross-dataset evaluation. KGmetaSP enables accurate PPE using a single pipeline-agnostic meta-model and improves DPSE over baselines. The proposed KGmetaSP, KG, and benchmark are released, establishing a new reference point for meta-learning and demonstrating how consolidating open experiment data into a unified KG advances the field.
[418] NASimJax: GPU-Accelerated Policy Learning Framework for Penetration Testing
Raphael Simon, José Carrasquel, Wim Mees, Pieter Libin
Main category: cs.LG
TL;DR: NASimJax is a JAX-based network attack simulator that achieves 100x speedup for RL-based penetration testing, enabling training on larger networks and studying zero-shot generalization through contextual POMDP formulation and network generation pipeline.
Details
Motivation: Existing penetration testing simulators are too slow to train RL policies at scale, resulting in poor generalization. There's a need for faster simulation to enable experimentation on realistic network scenarios and study policy generalization.Method: 1) NASimJax: JAX-based reimplementation of Network Attack Simulator achieving 100x throughput; 2) Formulate penetration testing as Contextual POMDP; 3) Network generation pipeline for diverse, solvable scenarios; 4) Two-stage action decomposition (2SAS) for large action spaces; 5) Use Prioritized Level Replay for training.
Result: Achieved up to 100x environment throughput, enabled training on networks up to 40 hosts. Found Prioritized Level Replay outperforms Domain Randomization at scale, sparse topology training improves OOD generalization, and 2SAS substantially outperforms flat action masking for large action spaces.
Conclusion: NASimJax provides a fast, flexible platform for RL-based penetration testing research, enabling study of generalization and scaling to realistic network sizes. Identified interaction issues between training methods and action decomposition that need addressing.
Abstract: Penetration testing, the practice of simulating cyberattacks to identify vulnerabilities, is a complex sequential decision-making task that is inherently partially observable and features large action spaces. Training reinforcement learning (RL) policies for this domain faces a fundamental bottleneck: existing simulators are too slow to train on realistic network scenarios at scale, resulting in policies that fail to generalize. We present NASimJax, a complete JAX-based reimplementation of the Network Attack Simulator (NASim), achieving up to 100x higher environment throughput than the original simulator. By running the entire training pipeline on hardware accelerators, NASimJax enables experimentation on larger networks under fixed compute budgets that were previously infeasible. We formulate automated penetration testing as a Contextual POMDP and introduce a network generation pipeline that produces structurally diverse and guaranteed-solvable scenarios. Together, these provide a principled basis for studying zero-shot policy generalization. We use the framework to investigate action-space scaling and generalization across networks of up to 40 hosts. We find that Prioritized Level Replay better handles dense training distributions than Domain Randomization, particularly at larger scales, and that training on sparser topologies yields an implicit curriculum that improves out-of-distribution generalization, even on topologies denser than those seen during training. To handle linearly growing action spaces, we propose a two-stage action decomposition (2SAS) that substantially outperforms flat action masking at scale. Finally, we identify a failure mode arising from the interaction between Prioritized Level Replay’s episode-reset behaviour and 2SAS’s credit assignment structure. NASimJax thus provides a fast, flexible, and realistic platform for advancing RL-based penetration testing.
[419] On the Dynamics & Transferability of Latent Generalization during Memorization
Simran Ketha, Venkatakrishnan Ramaswamy
Main category: cs.LG
TL;DR: The paper investigates latent generalization in deep networks during memorization of corrupted labels, tracking training dynamics and exploring methods to extract and transfer this latent generalization using both quadratic (MASC) and linear probes.
Details
Motivation: Deep networks can memorize corrupted training data while still retaining latent generalization abilities in their internal representations, but the origin and dynamics of this latent generalization during training are not well understood. The paper aims to understand when latent generalization emerges during training and whether it can be linearly decoded from layerwise outputs.Method: The authors empirically track training dynamics to observe when latent generalization peaks, analyze the mathematical structure of existing MASC probes (showing they are quadratic classifiers), design new linear probes for the setting, and develop methods to transfer latent generalization from last-layer representations to model weights through direct weight editing.
Result: Latent generalization abilities largely peak early in training, similar to model generalization. The MASC probe is shown to be a quadratic (non-linear) classifier, raising questions about linear decodability. A new linear probe is designed, and methods are developed to transfer latent generalization to model generalization through weight editing.
Conclusion: The paper provides insights into the training dynamics of latent generalization during memorization and explores both non-linear and linear approaches to extract this generalization, with practical methods to transfer latent generalization to improve model performance.
Abstract: Deep networks have been known to have extraordinary generalization abilities, via mechanisms that aren’t yet well understood. It is also known that upon shuffling labels in the training data to varying degrees, deep networks, trained with standard methods, can still achieve perfect or high accuracy on this corrupted training data. This phenomenon is called memorization, and typically comes at the cost of poorer generalization to true labels. Our recent work has demonstrated, that the internal representations of such models retain significantly better latent generalization abilities than is directly apparent from the model. In particular, it has been shown that such latent generalization can be recovered via simple probes (called MASC probes) on the layer-wise representations of the model. However, the origin and dynamics over training of this latent generalization during memorization is not well understood. Here, we track the training dynamics, empirically, and find that latent generalization abilities largely peak early in training, with model generalization. Next, we investigate to what extent the specific nature of the MASC probe is critical for our ability to extract latent generalization from the model’s layerwise outputs. To this end, we first examine the mathematical structure of the MASC probe and show that it is a quadratic classifier, i.e. is non-linear. This brings up the question of the extent to which this latent generalization might be linearly decodable from layerwise outputs. To investigate this, we designed a new linear probe for this setting. Next, we consider the question of whether it is possible to transfer latent generalization to model generalization by directly editing model weights. To this end, we devise a way to transfer the latent generalization present in last-layer representations to the model using the new linear probe.
[420] Discovery of Decision Synchronization Patterns from Event Logs
Tijmen Kuijpers, Karolin Winter, Remco Dijkman
Main category: cs.LG
TL;DR: An approach for discovering decision synchronization patterns in business processes that coordinate multiple cases simultaneously rather than treating them individually.
Details
Motivation: Current process discovery techniques rarely address decision synchronization patterns that coordinate multiple running cases simultaneously, which are important for fair resource allocation, prioritization, and preventing unnecessary waiting in business processes.Method: Proposes an approach inspired by supply chain processes to discover decision synchronization patterns, which combine specific process constructs with constraints determining which case to execute. Formalizes and demonstrates discovery of constraints for four such patterns.
Result: Evaluated in two artificial scenarios: (1) four separate process models each containing a single decision synchronization pattern, showing reliable discovery of each pattern type; (2) a process model containing all four patterns, demonstrating generalizability to complex problems.
Conclusion: The approach can reliably discover decision synchronization patterns in business processes, addressing a gap in current process discovery techniques that typically only consider single-case properties rather than multi-case coordination.
Abstract: Synchronizing decisions between running cases in business processes facilitates fair and efficient use of resources, helps prioritize the most valuable cases, and prevents unnecessary waiting. Consequently, decision synchronization patterns are regularly built into processes, in the form of mechanisms that temporarily delay one case to favor another. These decision mechanisms therefore consider properties of multiple cases at once, rather than just the properties of a single case; an aspect that is rarely addressed by current process discovery techniques. To address this gap, this paper proposes an approach for discovering decision synchronization patterns inspired by supply chain processes. These decision synchronization patterns take the form of specific process constructs combined with a constraint that determines which particular case to execute. We describe, formalize and demonstrate how the constraint for four such patterns can be discovered. We evaluate our approach in two artificial scenarios. First, with four separate process models each containing a single decision synchronization pattern, i.e., we demonstrate that our approach can discover every type of pattern when only this one type is present. Second, we consider a process model containing all four decision synchronization patterns to show generalizability of the approach to more complex problems. For both scenarios, we could reliably retrieve the expected patterns.
[421] Memori: A Persistent Memory Layer for Efficient, Context-Aware LLM Agents
Luiz C. Borro, Luiz A. B. Macarini, Gordon Tindall, Michael Montero, Adam B. Struck
Main category: cs.LG
TL;DR: Memori is an LLM-agnostic persistent memory layer that structures conversation data into semantic triples and summaries for efficient retrieval, reducing token costs by 67% while improving accuracy on memory tasks.
Details
Motivation: Current LLM memory approaches suffer from vendor lock-in and inefficient prompt injection of raw conversations, leading to high token costs and degraded performance. There's a need for a memory system that enables context-aware behavior across multi-session interactions without these limitations.Method: Memori treats memory as a data structuring problem with an Advanced Augmentation pipeline that converts unstructured dialogue into compact semantic triples and conversation summaries, enabling precise retrieval and coherent reasoning while being LLM-agnostic.
Result: On the LoCoMo benchmark, Memori achieves 81.95% accuracy, outperforming existing memory systems while using only 1,294 tokens per query (~5% of full context). This results in 67% fewer tokens than competing approaches and over 20x savings compared to full-context methods.
Conclusion: Effective memory in LLM agents depends on structured representations rather than larger context windows, enabling scalable and cost-efficient deployment of autonomous agents with persistent memory.
Abstract: As large language models (LLMs) evolve into autonomous agents, persistent memory at the API layer is essential for enabling context-aware behavior across LLMs and multi-session interactions. Existing approaches force vendor lock-in and rely on injecting large volumes of raw conversation into prompts, leading to high token costs and degraded performance. We introduce Memori, an LLM-agnostic persistent memory layer that treats memory as a data structuring problem. Its Advanced Augmentation pipeline converts unstructured dialogue into compact semantic triples and conversation summaries, enabling precise retrieval and coherent reasoning. Evaluated on the LoCoMo benchmark, Memori achieves 81.95% accuracy, outperforming existing memory systems while using only 1,294 tokens per query (~5% of full context). This results in substantial cost reductions, including 67% fewer tokens than competing approaches and over 20x savings compared to full-context methods. These results show that effective memory in LLM agents depends on structured representations instead of larger context windows, enabling scalable and cost-efficient deployment.
[422] Graph2TS: Structure-Controlled Time Series Generation via Quantile-Graph VAEs
Shaoshuai Du, Joze M. Rozanec, Andy Pimentel, Ana-Lucia Varbanescu
Main category: cs.LG
TL;DR: Graph2TS: A quantile-graph conditioned VAE for time-series generation that separates structural backbone from stochastic residuals, enabling structure-controlled cross-modal generation from graphs to time series.
Details
Motivation: Current generative models for time series struggle to balance global temporal structure preservation with modeling stochastic local variations, especially for volatile signals with weak/irregular periodicity. Direct distribution matching can amplify noise or suppress meaningful patterns.Method: Proposes a structure-residual perspective: time series = structural backbone + stochastic residual dynamics. Uses quantile-based transition graphs to capture global distributional and temporal dependencies. Develops Graph2TS, a quantile-graph conditioned variational autoencoder for cross-modal generation from structural graphs to time series.
Result: Experiments on sunspot, electricity load, ECG, and EEG signals show improved distributional fidelity, temporal alignment, and representativeness compared to diffusion- and GAN-based baselines.
Conclusion: Structure-controlled and cross-modal generation is a promising direction for time-series modeling, enabling preservation of global temporal organization while allowing controlled stochastic variation.
Abstract: Although recent generative models can produce time series with close marginal distributions, they often face a fundamental tension between preserving global temporal structure and modeling stochastic local variations, particularly for highly volatile signals with weak or irregular periodicity. Direct distribution matching in such settings can amplify noise or suppress meaningful temporal patterns. In this work, we propose a structure-residual perspective on time-series generation, viewing temporal data as the combination of a structural backbone and stochastic residual dynamics, thereby motivating the separation of global organization from sample-level variability. Based on this insight, we represent time-series structure using a quantile-based transition graph that compactly captures global distributional and temporal dependencies. Building on this representation, we propose Graph2TS, a quantile-graph conditioned variational autoencoder that performs cross-modal generation from structural graphs to time series. By conditioning generation on structure rather than labels or metadata, the model preserves global temporal organization while enabling controlled stochastic variation. Experiments on diverse datasets, including sunspot, electricity load, ECG, and EEG signals, demonstrate improved distributional fidelity, temporal alignment, and representativeness compared to diffusion- and GAN-based baselines, highlighting structure-controlled and cross-modal generation as a promising direction for time-series modeling.
[423] Model-Driven Learning-Based Physical Layer Authentication for Mobile Wi-Fi Devices
Yijia Guo, Junqing Zhang, Yao-Win Peter Hong, Stefano Tomasin
Main category: cs.LG
TL;DR: A learning-based physical layer authentication scheme that combines hypothesis testing with neural networks for IoT security, achieving near-optimal performance without requiring channel statistics knowledge.
Details
Motivation: Wireless IoT faces authentication risks due to broadcast nature of communications. Existing physical layer authentication approaches either rely on impractical channel statistics (hypothesis testing) or are practical but suboptimal (deep learning). Need a solution that combines theoretical optimality with practical applicability.Method: Proposed LiteNP-Net: a lightweight neural network driven by Neyman-Pearson detector. Incorporated conditional statistical models into hypothesis testing framework to derive theoretically optimal NP detector, then used this as foundation for neural network design. Evaluated through simulations and real-world Wi-Fi IoT testbed.
Result: Simulations show LiteNP-Net approaches NP detector performance without channel statistics knowledge. Experimental results in real-world Wi-Fi scenarios demonstrate LiteNP-Net outperforms conventional correlation-based methods and state-of-the-art Siamese-based methods.
Conclusion: The proposed learning-based PLA scheme successfully bridges gap between theoretical optimality and practical applicability, offering effective authentication for wireless IoT systems without requiring prior channel statistics.
Abstract: The rise of wireless technologies has made the Internet of Things (IoT) ubiquitous, but the broadcast nature of wireless communications exposes IoT to authentication risks. Physical layer authentication (PLA) offers a promising solution by leveraging unique characteristics of wireless channels. As a common approach in PLA, hypothesis testing yields a theoretically optimal Neyman-Pearson (NP) detector, but its reliance on channel statistics limits its practicality in real-world scenarios. In contrast, deep learning-based PLA approaches are practical but tend to be not optimal. To address these challenges, we proposed a learning-based PLA scheme driven by hypothesis testing and conducted extensive simulations and experimental evaluations using Wi-Fi. Specifically, we incorporated conditional statistical models into the hypothesis testing framework to derive a theoretically optimal NP detector. Building on this, we developed LiteNP-Net, a lightweight neural network driven by the NP detector. Simulation results demonstrated that LiteNP-Net could approach the performance of the NP detector even without prior knowledge of the channel statistics. To further assess its effectiveness in practical environments, we deployed an experimental testbed using Wi-Fi IoT development kits in various real-world scenarios. Experimental results demonstrated that the LiteNP-Net outperformed the conventional correlation-based method as well as state-of-the-art Siamese-based methods.
[424] Breaking the Capability Ceiling of LLM Post-Training by Reintroducing Markov States
Yurun Yuan, Tengyang Xie
Main category: cs.LG
TL;DR: Introducing explicit Markov states in RL for LLM post-training breaks performance ceilings by reducing sample complexity and enabling discovery of novel reasoning strategies beyond pre-trained patterns.
Details
Motivation: Current RL approaches for LLM post-training face a "capability ceiling" where they merely refine existing patterns rather than enabling novel discovery, due to reliance on ever-expanding action histories instead of compact Markov states.Method: Revisits classical RL principle of explicit Markov states for LLM post-training, providing theoretical guarantees and empirical validation through structured Markovian representations that replace history-as-state modeling.
Result: Markov state-based RL consistently breaks performance boundaries across complex logic puzzles, demonstrating significantly reduced sample complexity and enabling discovery of genuinely new reasoning capabilities.
Conclusion: Moving beyond history-as-state modeling to structured Markovian representations is essential for unlocking open-ended discovery and new reasoning capabilities in generative AI systems.
Abstract: Reinforcement learning (RL) has become a standard paradigm for post-training and aligning Large Language Models (LLMs), yet recent evidence suggests it faces a persistent “capability ceiling”: unlike classical RL systems that discover novel strategies, RL for LLMs often acts as a mere refiner of patterns already latent in pre-trained weights. In this work, we identify a fundamental structural bottleneck: while classical RL relies on compact, informative Markov states, current LLM post-training formulations are tethered to an ever-expanding history of actions. We revisit a classical principle long central to RL yet absent from LLM post-training: explicit Markov states. Theoretically, we provide rigorous guarantees demonstrating that leveraging estimated Markov states can significantly reduce sample complexity. Empirically, we show that introducing Markov states consistently breaks the performance boundaries of standard RL post-training across a suite of complex logic puzzles. Our findings suggest that moving beyond “history-as-state” modeling in favor of structured Markovian representations is essential for unlocking open-ended discovery and genuinely new reasoning capabilities in Generative AI.
[425] A Super Fast K-means for Indexing Vector Embeddings
Leonardo Kuffo, Sven Hepkema, Peter Boncz
Main category: cs.LG
TL;DR: SuperKMeans is an accelerated k-means variant for clustering high-dimensional embeddings that achieves 7x speedup on CPUs and 4x on GPUs while maintaining centroid quality for similarity search tasks.
Details
Motivation: Traditional k-means clustering for high-dimensional vector embeddings is computationally expensive, especially for large-scale similarity search applications. There's a need for faster clustering algorithms that maintain retrieval quality.Method: SuperKMeans reduces data-access and compute overhead by pruning unnecessary dimensions for centroid assignment. It also introduces Early Termination by Recall, which stops iterations when centroid quality for retrieval tasks stops improving.
Result: Achieves up to 7x faster clustering than FAISS and Scikit-Learn on CPUs, and up to 4x faster than cuVS on GPUs, while maintaining centroid quality for vector similarity search tasks.
Conclusion: SuperKMeans provides significant speed improvements for clustering high-dimensional embeddings without compromising retrieval quality, making it practical for large-scale similarity search applications.
Abstract: We present SuperKMeans: a k-means variant designed for clustering collections of high-dimensional vector embeddings. SuperKMeans’ clustering is up to 7x faster than FAISS and Scikit-Learn on modern CPUs and up to 4x faster than cuVS on GPUs (Figure 1), while maintaining the quality of the resulting centroids for vector similarity search tasks. SuperKMeans acceleration comes from reducing data-access and compute overhead by reliably and efficiently pruning dimensions that are not needed to assign a vector to a centroid. Furthermore, we present Early Termination by Recall, a novel mechanism that early-terminates k-means when the quality of the centroids for retrieval tasks stops improving across iterations. In practice, this further reduces runtimes without compromising retrieval quality. We open-source our implementation at https://github.com/cwida/SuperKMeans
[426] AgenticRS-EnsNAS: Ensemble-Decoupled Self-Evolving Architecture Search
Yun Chen, Moyu Zhang, Jinxin Hu, Yu Zhang, Xiaoyi Zeng
Main category: cs.LG
TL;DR: Ensemble-Decoupled Architecture Search framework reduces NAS validation cost from O(M) to O(1) by predicting ensemble performance from single-learner evaluation, enabling faster architecture iteration in industrial systems.
Details
Motivation: Industrial NAS faces prohibitive O(M) validation costs when evaluating candidate architectures against deployed ensembles (M=50-200 models), severely limiting architecture iteration frequency in production systems.Method: Proposes Ensemble-Decoupled Theory with sufficient condition for monotonic ensemble improvement using lightweight dual-learner training. Unifies three solution strategies: closed-form optimization for continuous parameters, constrained differentiable optimization for intractable continuous parameters, and LLM-driven search for discrete architectures.
Result: Theoretical framework reduces per-candidate search cost from O(M) to O(1) while maintaining O(M) deployment cost only for validated winners. Reveals two orthogonal improvement mechanisms: base diversity gain and accuracy gain.
Conclusion: Provides actionable design principles for industrial-scale NAS by decoupling architecture search from full ensemble training, enabling more frequent architecture iteration in production systems.
Abstract: Neural Architecture Search (NAS) deployment in industrial production systems faces a fundamental validation bottleneck: verifying a single candidate architecture pi requires evaluating the deployed ensemble of M models, incurring prohibitive O(M) computational cost per candidate. This cost barrier severely limits architecture iteration frequency in real-world applications where ensembles (M=50-200) are standard for robustness. This work introduces Ensemble-Decoupled Architecture Search, a framework that leverages ensemble theory to predict system-level performance from single-learner evaluation. We establish the Ensemble-Decoupled Theory with a sufficient condition for monotonic ensemble improvement under homogeneity assumptions: a candidate architecture pi yields lower ensemble error than the current baseline if rho(pi) < rho(pi_old) - (M / (M - 1)) * (Delta E(pi) / sigma^2(pi)), where Delta E, rho, and sigma^2 are estimable from lightweight dual-learner training. This decouples architecture search from full ensemble training, reducing per-candidate search cost from O(M) to O(1) while maintaining O(M) deployment cost only for validated winners. We unify solution strategies across pipeline continuity: (1) closed-form optimization for tractable continuous pi (exemplified by feature bagging in CTR prediction), (2) constrained differentiable optimization for intractable continuous pi, and (3) LLM-driven search with iterative monotonic acceptance for discrete pi. The framework reveals two orthogonal improvement mechanisms – base diversity gain and accuracy gain – providing actionable design principles for industrial-scale NAS. All theoretical derivations are rigorous with detailed proofs deferred to the appendix. Comprehensive empirical validation will be included in the journal extension of this work.
[427] ODySSeI: An Open-Source End-to-End Framework for Automated Detection, Segmentation, and Severity Estimation of Lesions in Invasive Coronary Angiography Images
Anand Choudhary, Xiaowu Sun, Thabo Mahendiran, Ortal Senouf, Denise Auberson, Bernard De Bruyne, Stephane Fournier, Olivier Muller, Emmanuel Abbé, Pascal Frossard, Dorina Thanou
Main category: cs.LG
TL;DR: ODySSeI is an open-source framework for automated detection, segmentation, and severity estimation of coronary artery lesions in invasive coronary angiography images using deep learning and novel augmentation techniques.
Details
Motivation: Invasive Coronary Angiography (ICA) interpretation is subjective and prone to operator variability, creating a need for automated, objective analysis tools for coronary artery disease assessment.Method: Deep learning-based lesion detection and segmentation models trained with Pyramidal Augmentation Scheme (PAS) for robustness, plus a quantitative coronary angiography-free Lesion Severity Estimation technique that computes Minimum Lumen Diameter and diameter stenosis from predicted lesion geometry.
Result: PAS yields 2.5-fold increase in lesion detection performance; LSE achieves high accuracy with MLD predictions differing by only ±2-3 pixels from ground truth; processes images in seconds on CPU/fraction of second on GPU.
Conclusion: ODySSeI establishes a comprehensive open-source framework for automated, reproducible, and scalable ICA analysis suitable for real-time clinical decision-making.
Abstract: Invasive Coronary Angiography (ICA) is the clinical gold standard for the assessment of coronary artery disease. However, its interpretation remains subjective and prone to intra- and inter-operator variability. In this work, we introduce ODySSeI: an Open-source end-to-end framework for automated Detection, Segmentation, and Severity estimation of lesions in ICA images. ODySSeI integrates deep learning-based lesion detection and lesion segmentation models trained using a novel Pyramidal Augmentation Scheme (PAS) to enhance robustness and real-time performance across diverse patient cohorts (2149 patients from Europe, North America, and Asia). Furthermore, we propose a quantitative coronary angiography-free Lesion Severity Estimation (LSE) technique that directly computes the Minimum Lumen Diameter (MLD) and diameter stenosis from the predicted lesion geometry. Extensive evaluation on both in-distribution and out-of-distribution clinical datasets demonstrates ODySSeI’s strong generalizability. Our PAS yields large performance gains in highly complex tasks as compared to relatively simpler ones, notably, a 2.5-fold increase in lesion detection performance versus a 1-3% increase in lesion segmentation performance over their respective baselines. Our LSE technique achieves high accuracy, with predicted MLD values differing by only $\pm$ 2-3 pixels from the corresponding ground truths. On average, ODySSeI processes a raw ICA image within only a few seconds on a CPU and in a fraction of a second on a GPU and is available as a plug-and-play web interface at swisscardia.epfl.ch. Overall, this work establishes ODySSeI as a comprehensive and open-source framework which supports automated, reproducible, and scalable ICA analysis for real-time clinical decision-making.
[428] Continual Learning as Shared-Manifold Continuation Under Compatible Shift
Henry J. Kobs
Main category: cs.LG
TL;DR: SPMA-OG: A geometry-preserving continual learning method that maintains shared latent manifold structure across tasks through anchor regularization and relational geometry preservation.
Details
Motivation: Existing continual learning methods focus on preserving old behavior through parameter regularization, output matching, or replay, but they don't directly control how latent representations evolve. The authors propose a geometric approach for scenarios where old and new data should remain on the same latent support.Method: Support-Preserving Manifold Assimilation (SPMA) with geometry-preserving variant SPMA-OG combines: 1) sparse replay, 2) output distillation, 3) relational geometry preservation, 4) local smoothing, and 5) chart-assignment regularization on old anchors to maintain shared manifold structure.
Result: On CIFAR10 and Tiny-ImageNet with compatible shifts, SPMA-OG improves old-task retention and representation preservation over sparse replay baselines while maintaining competitive new-task accuracy. On synthetic atlas-manifold benchmark, achieves near-perfect anchor-geometry preservation with improved new-task accuracy.
Conclusion: Geometry-aware anchor regularization provides a useful inductive bias for continual learning when preserving shared latent support is more important than creating new representations.
Abstract: Continual learning methods usually preserve old behavior by regularizing parameters, matching old outputs, or replaying previous examples. These strategies can reduce forgetting, but they do not directly specify how the latent representation should evolve. We study a narrower geometric alternative for the regime where old and new data should remain on the same latent support: continual learning as continuation of a shared manifold. We instantiate this view within Support-Preserving Manifold Assimilation (SPMA) and evaluate a geometry-preserving variant, SPMA-OG, that combines sparse replay, output distillation, relational geometry preservation, local smoothing, and chart-assignment regularization on old anchors. On representative compatible-shift CIFAR10 and Tiny-ImageNet runs, SPMA-OG improves over sparse replay baselines in old-task retention and representation-preservation metrics while remaining competitive on new-task accuracy. On a controlled synthetic atlas-manifold benchmark, it achieves near-perfect anchor-geometry preservation while also improving new-task accuracy over replay. These results provide evidence that geometry-aware anchor regularization is a useful inductive bias when continual learning should preserve a shared latent support rather than create a new one.
[429] Federated Hyperdimensional Computing for Resource-Constrained Industrial IoT
Nikita Zeulin, Olga Galinina, Nageen Himayat, Sergey Andreev
Main category: cs.LG
TL;DR: Federated hyperdimensional computing enables lightweight collaborative learning for resource-constrained Industrial IoT devices by exchanging only prototype representations.
Details
Motivation: Industrial IoT systems face strict constraints in memory, compute capability, and wireless bandwidth, making deployment of advanced analytics like predictive maintenance challenging. There's a need for lightweight learning paradigms suitable for resource-constrained edge devices.Method: Integrates hyperdimensional computing (HDC) into a federated learning framework where devices exchange only prototype representations instead of full model parameters, significantly reducing communication overhead while leveraging HDC’s energy-efficient training and inference properties.
Result: Numerical results show federated HDC supports collaborative learning in IIoT with fast convergence speed and communication efficiency, demonstrating its potential for distributed intelligence in large-scale resource-constrained environments.
Conclusion: HDC represents a lightweight and resilient framework for distributed intelligence in large-scale and resource-constrained IIoT environments, offering an efficient solution for edge device analytics.
Abstract: In the Industrial Internet of Things (IIoT) systems, edge devices often operate under strict constraints in memory, compute capability, and wireless bandwidth. These limitations challenge the deployment of advanced data analytics tasks, such as predictive and prescriptive maintenance. In this work, we explore hyperdimensional computing (HDC) as a lightweight learning paradigm for resource-constrained IIoT. Conventional centralized HDC leverages the properties of high-dimensional vector spaces to enable energy-efficient training and inference. We integrate this paradigm into a federated learning (FL) framework where devices exchange only prototype representations, which significantly reduces communication overhead. Our numerical results highlight the potential of federated HDC to support collaborative learning in IIoT with fast convergence speed and communication efficiency. These results indicate that HDC represents a lightweight and resilient framework for distributed intelligence in large-scale and resource-constrained IIoT environments.
[430] Fine-tuning Timeseries Predictors Using Reinforcement Learning
Hugo Cazaux, Ralph Rudd, Hlynur Stefánsson, Sverrir Ólafsson, Eyjólfur Ingi Ásgeirsson
Main category: cs.LG
TL;DR: This paper presents reinforcement learning algorithms for fine-tuning financial forecasters, showing performance improvements through transfer learning from supervised to RL fine-tuning.
Details
Motivation: To improve financial forecasting models by applying reinforcement learning fine-tuning to models initially trained with supervised learning, leveraging the benefits of both approaches.Method: Proposes implementation plan for backpropagating RL loss to supervised learning models, compares three major RL algorithms for fine-tuning financial forecasters, and evaluates before/after performance.
Result: Found increased performance after fine-tuning and transfer learning properties, demonstrating benefits of RL fine-tuning for financial forecasting models.
Conclusion: RL fine-tuning improves financial forecasting performance and enables transfer learning, with practical implementation guidance provided for practitioners.
Abstract: This chapter presents three major reinforcement learning algorithms used for fine-tuning financial forecasters. We propose a clear implementation plan for backpropagating the loss of a reinforcement learning task to a model trained using supervised learning, and compare the performance before and after the fine-tuning. We find an increase in performance after fine-tuning, and transfer learning properties to the models, indicating the benefits of fine-tuning. We also highlight the tuning process and empirical results for future implementation by practitioners.
[431] How Out-of-Equilibrium Phase Transitions can Seed Pattern Formation in Trained Diffusion Models
Luca Ambrogioni
Main category: cs.LG
TL;DR: Diffusion model generation is reinterpreted as out-of-equilibrium phase transitions with critical regimes where spatial fluctuations amplify to create large-scale structure.
Details
Motivation: To develop a theoretical framework explaining how diffusion models generate coherent patterns beyond training data, connecting deep learning to non-equilibrium physics principles.Method: Proposes interpreting reverse diffusion as phase transitions, uses analytically tractable patch score models to show symmetry-breaking bifurcations become spatially extended critical phenomena, connects to Ginzburg-Landau field theories, and validates with empirical analysis of trained convolutional diffusion models.
Result: Theory reveals critical regimes with mode softening and growing spatial correlations in diffusion models; demonstrates practical relevance through targeted perturbations at critical times improving generation control.
Conclusion: Non-equilibrium critical phenomena provide a unifying principle for understanding and potentially improving diffusion model behavior, bridging deep learning with physics of pattern formation.
Abstract: In this work, we propose a theoretical framework that interprets the generation process in trained diffusion models as an instance of out-of-equilibrium phase transitions. We argue that, rather than evolving smoothly from noise to data, reverse diffusion passes through a critical regime in which small spatial fluctuations are amplified and seed the emergence of large-scale structure. Our central insight is that architectural constraints, such as locality, sparsity, and translation equivariance, transform memorization-driven instabilities into collective spatial modes, enabling the formation of coherent patterns beyond the training data. Using analytically tractable patch score models, we show how classical symmetry-breaking bifurcations generalize into spatially extended critical phenomena described by softening Fourier modes and growing correlation lengths. We further connect these dynamics to effective field theories of the Ginzburg-Landau type and to mechanisms of pattern formation in non-equilibrium physics. Empirical results on trained convolutional diffusion models corroborate the theory, revealing signatures of criticality including mode softening and rapid growth of spatial correlations. Finally, we demonstrate that this critical regime has practical relevance: targeted perturbations, such as classifier-free guidance pulses applied at the estimated critical time, significantly improve generation control. Together, these findings position non-equilibrium critical phenomena as a unifying principle for understanding, and potentially improving, the behavior of modern diffusion models.
[432] Spectral Alignment in Forward-Backward Representations via Temporal Abstraction
Seyed Mahdi B. Azad, Jasper Hoffmann, Iman Nematollahi, Hao Zhu, Abhinav Valada, Joschka Boedecker
Main category: cs.LG
TL;DR: Temporal abstraction helps align high-rank continuous environment dynamics with low-rank forward-backward representation learning by acting as a spectral low-pass filter.
Details
Motivation: Forward-backward representations struggle with spectral mismatch between high-rank continuous environment dynamics and low-rank architectural bottlenecks, making accurate successor representation learning difficult in continuous spaces.Method: Analyze temporal abstraction as a mechanism to mitigate spectral mismatch by characterizing spectral properties of transition operators, showing temporal abstraction acts as a low-pass filter that suppresses high-frequency components.
Result: Temporal abstraction reduces effective rank of induced successor representation while preserving formal bounds on value function error, enabling stable forward-backward learning especially at high discount factors.
Conclusion: Temporal abstraction is a principled mechanism for shaping spectral structure of MDPs and enabling effective long-horizon representations in continuous control by aligning high-rank dynamics with low-rank representation learning.
Abstract: Forward-backward (FB) representations provide a powerful framework for learning the successor representation (SR) in continuous spaces by enforcing a low-rank factorization. However, a fundamental spectral mismatch often exists between the high-rank transition dynamics of continuous environments and the low-rank bottleneck of the FB architecture, making accurate low-rank representation learning difficult. In this work, we analyze temporal abstraction as a mechanism to mitigate this mismatch. By characterizing the spectral properties of the transition operator, we show that temporal abstraction acts as a low-pass filter that suppresses high-frequency spectral components. This suppression reduces the effective rank of the induced SR while preserving a formal bound on the resulting value function error. Empirically, we show that this alignment is a key factor for stable FB learning, particularly at high discount factors where bootstrapping becomes error-prone. Our results identify temporal abstraction as a principled mechanism for shaping the spectral structure of the underlying MDP and enabling effective long-horizon representations in continuous control.
[433] The $\mathbf{Y}$-Combinator for LLMs: Solving Long-Context Rot with $λ$-Calculus
Amartya Roy, Rasul Tutunov, Xiaotong Ji, Matthieu Zimmer, Haitham Bou-Ammar
Main category: cs.LG
TL;DR: λ-RLM: A typed functional framework for long-context reasoning that replaces open-ended recursive code generation with a λ-calculus-based runtime using pre-verified combinators, providing formal guarantees and improved performance.
Details
Motivation: Existing Recursive Language Models (RLMs) use open-ended read-eval-print loops where models generate arbitrary control code, making execution difficult to verify, predict, and analyze. There's a need for more structured, verifiable approaches to long-context reasoning.Method: Introduces λ-RLM framework that replaces free-form recursive code generation with a typed functional runtime grounded in λ-calculus. It executes a compact library of pre-verified combinators and uses neural inference only on bounded leaf subproblems, turning recursive reasoning into structured functional programs with explicit control flow.
Result: Outperforms standard RLM in 29 of 36 model-task comparisons across four long-context reasoning tasks and nine base models, improves average accuracy by up to +21.9 points across model tiers, and reduces latency by up to 4.1x. Provides formal guarantees including termination, closed-form cost bounds, controlled accuracy scaling, and optimal partition rules.
Conclusion: Typed symbolic control yields a more reliable and efficient foundation for long-context reasoning than open-ended recursive code generation, with λ-RLM providing both theoretical guarantees and empirical performance improvements.
Abstract: LLMs are increasingly used as general-purpose reasoners, but long inputs remain bottlenecked by a fixed context window. Recursive Language Models (RLMs) address this by externalising the prompt and recursively solving subproblems. Yet existing RLMs depend on an open-ended read-eval-print loop (REPL) in which the model generates arbitrary control code, making execution difficult to verify, predict, and analyse. We introduce $λ$-RLM, a framework for long-context reasoning that replaces free-form recursive code generation with a typed functional runtime grounded in $λ$-calculus. It executes a compact library of pre-verified combinators and uses neural inference only on bounded leaf subproblems, turning recursive reasoning into a structured functional program with explicit control flow. We show that $λ$-RLM admits formal guarantees absent from standard RLMs, including termination, closed-form cost bounds, controlled accuracy scaling with recursion depth, and an optimal partition rule under a simple cost model. Empirically, across four long-context reasoning tasks and nine base models, $λ$-RLM outperforms standard RLM in 29 of 36 model-task comparisons, improves average accuracy by up to +21.9 points across model tiers, and reduces latency by up to 4.1x. These results show that typed symbolic control yields a more reliable and efficient foundation for long-context reasoning than open-ended recursive code generation. The complete implementation of $λ$-RLM, is open-sourced for the community at: https://github.com/lambda-calculus-LLM/lambda-RLM.
[434] Trojan horse hunt in deep forecasting models: Insights from the European Space Agency competition
Krzysztof Kotowski, Ramez Shendy, Jakub Nalepa, Agata Kaczmarek, Dawid Płudowski, Piotr Wilczyński, Artur Janicki, Przemysław Biecek, Ambros Marzetta, Atul Pande, Lalit Chandra Routhu, Swapnil Srivastava, Evridiki Ntagiou
Main category: cs.LG
TL;DR: A competition focused on detecting trojan horse attacks in deep forecasting models for spacecraft telemetry, where triggers hidden in models can manipulate predictions when activated.
Details
Motivation: As deep forecasting models become critical for safety applications like space operations, they introduce security risks from trojan attacks that can manipulate predictions through hidden triggers.Method: Organized a data science competition (Trojan Horse Hunt) with over 200 teams tasked with identifying triggers hidden in deep forecasting models for spacecraft telemetry data.
Result: The competition established a benchmark set, evaluation protocol, and collected best solutions for trigger identification in time series forecasting models.
Conclusion: The competition provides insights and research directions for detecting trojan attacks in forecasting models, with all materials publicly available for further study.
Abstract: Forecasting plays a crucial role in modern safety-critical applications, such as space operations. However, the increasing use of deep forecasting models introduces a new security risk of trojan horse attacks, carried out by hiding a backdoor in the training data or directly in the model weights. Once implanted, the backdoor is activated by a specific trigger pattern at test time, causing the model to produce manipulated predictions. We focus on this issue in our \textit{Trojan Horse Hunt} data science competition, where more than 200 teams faced the task of identifying triggers hidden in deep forecasting models for spacecraft telemetry. We describe the novel task formulation, benchmark set, evaluation protocol, and best solutions from the competition. We further summarize key insights and research directions for effective identification of triggers in time series forecasting models. All materials are publicly available on the official competition webpage https://www.kaggle.com/competitions/trojan-horse-hunt-in-space.
[435] Var-JEPA: A Variational Formulation of the Joint-Embedding Predictive Architecture – Bridging Predictive and Generative Self-Supervised Learning
Moritz Gögl, Christopher Yau
Main category: cs.LG
TL;DR: Var-JEPA reformulates JEPA as a variational latent-variable model, showing JEPA’s connection to generative modeling and enabling principled uncertainty quantification without ad-hoc regularization.
Details
Motivation: The paper aims to bridge the conceptual gap between Joint-Embedding Predictive Architecture (JEPA) and probabilistic generative modeling, showing that JEPA's separation from likelihood-based methods is more rhetorical than structural.Method: The authors derive Variational JEPA (Var-JEPA) by showing JEPA’s connection to variational inference in coupled latent-variable models. They optimize a single Evidence Lower Bound (ELBO) to make latent generative structure explicit, eliminating need for ad-hoc anti-collapse regularizers.
Result: Var-JEPA achieves strong representation learning and downstream performance on tabular data (Var-T-JEPA), consistently improving over T-JEPA while remaining competitive with strong raw-feature baselines.
Conclusion: JEPA can be viewed as a deterministic specialization of variational latent-variable models, and making this connection explicit through Var-JEPA enables principled uncertainty quantification and better representation learning without heuristic regularization.
Abstract: The Joint-Embedding Predictive Architecture (JEPA) is often seen as a non-generative alternative to likelihood-based self-supervised learning, emphasizing prediction in representation space rather than reconstruction in observation space. We argue that the resulting separation from probabilistic generative modeling is largely rhetorical rather than structural: the canonical JEPA design, coupled encoders with a context-to-target predictor, mirrors the variational posteriors and learned conditional priors obtained when variational inference is applied to a particular class of coupled latent-variable models, and standard JEPA can be viewed as a deterministic specialization in which regularization is imposed via architectural and training heuristics rather than an explicit likelihood. Building on this view, we derive the Variational JEPA (Var-JEPA), which makes the latent generative structure explicit by optimizing a single Evidence Lower Bound (ELBO). This yields meaningful representations without ad-hoc anti-collapse regularizers and allows principled uncertainty quantification in the latent space. We instantiate the framework for tabular data (Var-T-JEPA) and achieve strong representation learning and downstream performance, consistently improving over T-JEPA while remaining competitive with strong raw-feature baselines.
[436] GO-GenZip: Goal-Oriented Generative Sampling and Hybrid Compression
Pietro Talli, Qi Liao, Alessandro Lieto, Parijat Bhattacharjee, Federico Chiariotti, Andrea Zanella
Main category: cs.LG
TL;DR: GenAI-driven sampling and hybrid compression framework for network telemetry that jointly optimizes what to observe and how to encode it based on downstream task relevance, achieving over 50% reduction in sampling and data transfer costs.
Details
Motivation: Current network telemetry pipelines generate massive streams of fine-grained KPIs from distributed sources to central aggregators, making data storage, transmission, and real-time analysis increasingly unsustainable. There's a need for more efficient approaches that go beyond passive compression of fully observed data.Method: A generative AI-driven framework that redesigns network telemetry from a goal-oriented perspective. It integrates adaptive sampling policies using adaptive masking techniques with generative modeling to identify patterns and preserve critical features across temporal and spatial dimensions. The selectively acquired data are processed through a hybrid compression scheme combining traditional lossless coding with GenAI-driven lossy compression.
Result: Experimental results on real network datasets demonstrate over 50% reductions in sampling and data transfer costs, while maintaining comparable reconstruction accuracy and goal-oriented analytical fidelity in downstream tasks.
Conclusion: The proposed GenAI-driven framework successfully addresses the sustainability challenges of current network telemetry pipelines by jointly optimizing observation and encoding based on downstream task relevance, achieving significant efficiency gains without compromising analytical quality.
Abstract: Current network data telemetry pipelines consist of massive streams of fine-grained Key Performance Indicators (KPIs) from multiple distributed sources towards central aggregators, making data storage, transmission, and real-time analysis increasingly unsustainable. This work presents a generative AI (GenAI)-driven sampling and hybrid compression framework that redesigns network telemetry from a goal-oriented perspective. Unlike conventional approaches that passively compress fully observed data, our approach jointly optimizes what to observe and how to encode it, guided by the relevance of information to downstream tasks. The framework integrates adaptive sampling policies, using adaptive masking techniques, with generative modeling to identify patterns and preserve critical features across temporal and spatial dimensions. The selectively acquired data are further processed through a hybrid compression scheme that combines traditional lossless coding with GenAI-driven, lossy compression. Experimental results on real network datasets demonstrate over 50$%$ reductions in sampling and data transfer costs, while maintaining comparable reconstruction accuracy and goal-oriented analytical fidelity in downstream tasks.
[437] Conditioning Protein Generation via Hopfield Pattern Multiplicity
Jeffrey D. Varner
Main category: cs.LG
TL;DR: A method using attention bias to guide protein sequence generation toward functional subsets without retraining, addressing the calibration gap between representation and decoded sequences.
Details
Motivation: Existing protein sequence generation treats all stored sequences equally and cannot direct generation toward specific functional subsets of interest (e.g., binding, stability, specificity).Method: Adds a single scalar parameter as bias to sampler’s attention logits to shift generation from full family toward user-specified subset, with no retraining or architecture changes. Uses small set of sequences and multiplicity ratio to control generation direction.
Result: Method enables exact conditioning at representation level but shows calibration gap in decoded sequences due to encoding dimensionality reduction. Experiments on five Pfam families confirm monotonic relationship between geometric separation and gap. Applied to omega-conotoxin peptides, produced over 1,000 candidates preserving pharmacophore and binding determinants.
Conclusion: Stochastic attention enables expanding handful of experimentally characterized sequences into diverse candidate libraries without retraining generative models, with calibration gap predictable by geometric separation measure.
Abstract: Protein sequence generation via stochastic attention produces plausible family members from small alignments without training, but treats all stored sequences equally and cannot direct generation toward a functional subset of interest. We show that a single scalar parameter, added as a bias to the sampler’s attention logits, continuously shifts generation from the full family toward a user-specified subset, with no retraining and no change to the model architecture. A practitioner supplies a small set of sequences (for example, hits from a binding screen) and a multiplicity ratio that controls how strongly generation favors them. The method is agnostic to what the subset represents: binding, stability, specificity, or any other property. We find that the conditioning is exact at the level of the sampler’s internal representation, but that the decoded sequence phenotype can fall short because the dimensionality reduction used to encode sequences does not always preserve the residue-level variation that defines the functional split. We term this discrepancy the calibration gap and show that it is predicted by a simple geometric measure of how well the encoding separates the functional subset from the rest of the family. Experiments on five Pfam families (Kunitz, SH3, WW, Homeobox, and Forkhead domains) confirm the monotonic relationship between separation and gap across a fourfold range of geometries. Applied to omega-conotoxin peptides targeting a calcium channel involved in pain signaling, curated seeding from 23 characterized binders produces over a thousand candidates that preserve the primary pharmacophore and all experimentally identified binding determinants. These results show that stochastic attention enables practitioners to expand a handful of experimentally characterized sequences into diverse candidate libraries without retraining a generative model.
[438] Revisiting Gene Ontology Knowledge Discovery with Hierarchical Feature Selection and Virtual Study Group of AI Agents
Cen Wan, Alex A. Freitas
Main category: cs.LG
TL;DR: Agentic AI framework for biological knowledge discovery using hierarchical feature selection on ageing-related Gene Ontology terms across model organisms
Details
Motivation: To leverage emerging agentic AI techniques for scientific discovery, specifically extracting meaningful ageing-related biological knowledge from Gene Ontology dataMethod: Proposes an agentic AI-based virtual study group that uses hierarchical feature selection methods to identify highly ageing-related Gene Ontology terms across four model organisms, then validates findings against existing literature
Result: Majority of AI agent-generated scientific claims are supported by existing literature, and the proposed internal mechanisms of the virtual study group prove important for knowledge discovery
Conclusion: Agentic AI frameworks show promise for biological knowledge discovery, with the virtual study group approach effectively extracting validated ageing-related insights from Gene Ontology data
Abstract: Large language models have achieved great success in multiple challenging tasks, and their capacity can be further boosted by the emerging agentic AI techniques. This new computing paradigm has already started revolutionising the traditional scientific discovery pipelines. In this work, we propose a novel agentic AI-based knowledge discovery-oriented virtual study group that aims to extract meaningful ageing-related biological knowledge considering highly ageing-related Gene Ontology terms that are selected by hierarchical feature selection methods. We investigate the performance of the proposed agentic AI framework by considering four different model organisms’ ageing-related Gene Ontology terms and validate the biological findings by reviewing existing research articles. It is found that the majority of the AI agent-generated scientific claims can be supported by existing literatures and the proposed internal mechanisms of the virtual study group also play an important role in the designed agentic AI-based knowledge discovery framework.
[439] Beyond Single Tokens: Distilling Discrete Diffusion Models via Discrete MMD
Emiel Hoogeboom, David Ruhe, Jonathan Heek, Thomas Mensink, Tim Salimans
Main category: cs.LG
TL;DR: D-MMD enables effective distillation of discrete diffusion models by adapting continuous domain techniques, maintaining quality/diversity and sometimes outperforming teachers
Details
Motivation: Discrete diffusion models lack effective distillation methods compared to continuous diffusion models which have many approaches to reduce sampling steps. Current discrete distillation methods collapse or fail to maintain quality.Method: Discrete Moment Matching Distillation (D-MMD) adapts successful continuous domain distillation ideas to discrete diffusion models. The method matches moments between teacher and student models to enable effective knowledge transfer.
Result: D-MMD maintains high quality and diversity with sufficient sampling steps, demonstrated on both text and image datasets. The distilled generators can sometimes outperform their teacher models.
Conclusion: D-MMD successfully bridges the distillation gap between continuous and discrete diffusion models, enabling efficient sampling while preserving or improving generation quality.
Abstract: It is currently difficult to distill discrete diffusion models. In contrast, continuous diffusion literature has many distillation approaches methods that can reduce sampling steps to a handful. Our method, Discrete Moment Matching Distillation (D-MMD), leverages ideas that have been highly successful in the continuous domain. Whereas previous discrete distillation methods collapse, D-MMD maintains high quality and diversity (given sufficient sampling steps). This is demonstrated on both text and image datasets. Moreover, the newly distilled generators can outperform their teachers.
[440] Kolmogorov-Arnold causal generative models
Alejandro Almodóvar, Mar Elizo, Patricia A. Apellániz, Santiago Zazo, Juan Parras
Main category: cs.LG
TL;DR: KaCGM is a causal generative model using Kolmogorov-Arnold Networks for interpretable causal modeling of tabular data, enabling direct inspection of learned mechanisms while maintaining competitive performance.
Details
Motivation: Deep causal models often use opaque architectures that limit auditability in high-stakes domains, creating a need for models that combine expressive causal modeling with functional transparency.Method: Proposes KaCGM where each structural equation is parameterized by a Kolmogorov-Arnold Network (KAN), enabling decomposition and inspection of causal mechanisms. Includes validation pipeline using distributional matching and independence diagnostics of exogenous variables.
Result: Competitive performance against state-of-the-art methods on synthetic and semi-synthetic benchmarks. Real-world cardiovascular case study demonstrates extraction of simplified structural equations and interpretable causal effects.
Conclusion: Expressive causal generative modeling and functional transparency can be achieved jointly, supporting trustworthy deployment in tabular decision-making settings.
Abstract: Causal generative models provide a principled framework for answering observational, interventional, and counterfactual queries from observational data. However, many deep causal models rely on highly expressive architectures with opaque mechanisms, limiting auditability in high-stakes domains. We propose KaCGM, a causal generative model for mixed-type tabular data where each structural equation is parameterized by a Kolmogorov–Arnold Network (KAN). This decomposition enables direct inspection of learned causal mechanisms, including symbolic approximations and visualization of parent–child relationships, while preserving query-agnostic generative semantics. We introduce a validation pipeline based on distributional matching and independence diagnostics of inferred exogenous variables, allowing assessment using observational data alone. Experiments on synthetic and semi-synthetic benchmarks show competitive performance against state-of-the-art methods. A real-world cardiovascular case study further demonstrates the extraction of simplified structural equations and interpretable causal effects. These results suggest that expressive causal generative modeling and functional transparency can be achieved jointly, supporting trustworthy deployment in tabular decision-making settings. Code: https://github.com/aalmodovares/kacgm
[441] Structured Latent Dynamics in Wireless CSI via Homomorphic World Models
Salmane Naoumi, Mehdi Bennis, Marwa Chafii
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to failed paper fetchMethod: Unable to determine method due to failed paper fetch
Result: Unable to determine results due to failed paper fetch
Conclusion: Unable to determine conclusion due to failed paper fetch
Abstract: Failed to fetch summary for 2603.20048: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.20048&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[442] torchgfn: A PyTorch GFlowNet library
Joseph D. Viviano, Omar G. Younis, Sanghyeok Choi, Victor Schmidt, Yoshua Bengio, Salem Lahlou
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting) - cannot analyze content
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to draw conclusions due to fetch failure
Abstract: Failed to fetch summary for 2305.14594: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2305.14594&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[443] Causal Learning in Biomedical Applications: Krebs Cycle as a Benchmark
Xiaoyu He, Petr Ryšavý, Jakub Mareček
Main category: cs.LG
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation without access to paper contentMethod: Cannot determine method without access to paper content
Result: Cannot determine results without access to paper content
Conclusion: Cannot determine conclusion without access to paper content
Abstract: Failed to fetch summary for 2406.15189: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2406.15189&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[444] Simulation-based Inference with the Python Package sbijax
Simon Dirmeier, Antonietta Mira, Carlo Albert
Main category: cs.LG
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot determine conclusion as paper content is unavailable
Abstract: Failed to fetch summary for 2409.19435: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2409.19435&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[445] Online Clustering of Data Sequences with Bandit Information
G Dhinesh Chandran, Srinivas Reddy Kota, Srikrishna Bhashyam
Main category: cs.LG
TL;DR: Paper 2501.11421: Could not fetch summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to access restrictionsMethod: Unable to determine method due to access restrictions
Result: Unable to determine results due to access restrictions
Conclusion: Unable to determine conclusion due to access restrictions
Abstract: Failed to fetch summary for 2501.11421: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2501.11421&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[446] Flow-based Conformal Prediction for Multi-dimensional Time Series
Junghwan Lee, Chen Xu, Yao Xie
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2502.05709: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2502.05709&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[447] CatBOX: A Categorical-Continuous Bayesian Optimization with Spectral Mixture Kernels for Accelerated Catalysis Experiments
Changquan Zhao, Yi Zhang, Zhuo Li, Li Jin, Cheng Hua, Yulian He
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2505.17393: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2505.17393&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[448] Hidden Breakthroughs in Language Model Training
Sara Kangaslahti, Elan Rosenfeld, Naomi Saphra
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Unable to determine motivation due to API access failureMethod: Unable to determine method due to API access failure
Result: Unable to determine results due to API access failure
Conclusion: Unable to determine conclusion due to API access failure
Abstract: Failed to fetch summary for 2506.15872: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2506.15872&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[449] Predictive Scaling Laws for Efficient GRPO Training of Large Reasoning Models
Datta Nimmaturi, Vaishnavi Bhargava, Rajat Ghosh, Johnu George, Debojyoti Dutta
Main category: cs.LG
TL;DR: Unable to analyze paper 2507.18014 due to HTTP 429 error when fetching abstract from arXiv API
Details
Motivation: Cannot determine motivation as abstract retrieval failedMethod: Cannot determine method as abstract retrieval failed
Result: Cannot determine results as abstract retrieval failed
Conclusion: Cannot draw conclusions without access to paper content
Abstract: Failed to fetch summary for 2507.18014: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2507.18014&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[450] Fast 3D Diffusion for Scalable Granular Media Synthesis
Muhammad Moeeze Hassan, Régis Cottereau, Filippo Gatti, Patryk Dec
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation as paper content could not be retrievedMethod: Unable to determine method as paper content could not be retrieved
Result: Unable to determine results as paper content could not be retrieved
Conclusion: Unable to draw conclusions as paper content could not be retrieved
Abstract: Failed to fetch summary for 2508.19752: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2508.19752&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[451] FNODE: Flow-Matching for data-driven simulation of constrained multibody systems
Hongyu Wang, Jingquan Wang, Dan Negrut
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2509.00183: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.00183&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[452] An upper bound on the silhouette evaluation metric for clustering
Hugo Sträng, Tai Dinh
Main category: cs.LG
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation without paper contentMethod: Cannot determine method without paper content
Result: Cannot determine results without paper content
Conclusion: Cannot draw conclusions without paper content
Abstract: Failed to fetch summary for 2509.08625: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.08625&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[453] Does Weak-to-strong Generalization Happen under Spurious Correlations?
Chenruo Liu, Yijun Dong, Qi Lei
Main category: cs.LG
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot draw conclusions as paper content is unavailable
Abstract: Failed to fetch summary for 2509.24005: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.24005&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[454] Less is More: Towards Simple Graph Contrastive Learning
Yanan Zhao, Feng Ji, Jingyang Dai, Jiaze Ma, Wee Peng Tay
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to failed paper fetchMethod: Unable to determine method due to failed paper fetch
Result: Unable to determine results due to failed paper fetch
Conclusion: Unable to draw conclusions due to failed paper fetch
Abstract: Failed to fetch summary for 2509.25742: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.25742&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[455] Gym-TORAX: Open-source software for integrating reinforcement learning with plasma control simulators in tokamak research
Antoine Mouchamps, Arthur Malherbe, Adrien Bolland, Damien Ernst
Main category: cs.LG
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot determine conclusion as paper content is unavailable
Abstract: Failed to fetch summary for 2510.11283: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.11283&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[456] KoALA: KL-L0 Adversarial Detector via Label Agreement
Siqi Li, Yasser Shoukry
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to access errorMethod: Unable to determine method due to access error
Result: Unable to determine results due to access error
Conclusion: Unable to draw conclusions due to access error
Abstract: Failed to fetch summary for 2510.12752: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.12752&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[457] SHAP Meets Tensor Networks: Provably Tractable Explanations with Parallelism
Reda Marzouk, Shahaf Bassan, Guy Katz
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to data fetch failureMethod: Unable to determine method due to data fetch failure
Result: Unable to determine results due to data fetch failure
Conclusion: Unable to draw conclusions due to data fetch failure
Abstract: Failed to fetch summary for 2510.21599: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.21599&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[458] DETECT: Data-Driven Evaluation of Treatments Enabled by Classification Transformers
Yuanheng Mao, Lillian Yang, Stephen Yang, Ethan Shao, Zihan Li
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to access limitationsMethod: Unable to determine method due to access limitations
Result: Unable to determine results due to access limitations
Conclusion: Unable to determine conclusion due to access limitations
Abstract: Failed to fetch summary for 2511.07213: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.07213&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[459] ACT as Human: Multimodal Large Language Model Data Annotation with Critical Thinking
Lequan Lin, Dai Shi, Andi Han, Feng Chen, Qiuzheng Chen, Jiawen Li, Zhaoyang Li, Jiyuan Li, Zhenbang Sun, Junbin Gao
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting) when attempting to retrieve arXiv metadata
Details
Motivation: Unable to determine motivation as paper content could not be retrieved due to API rate limitingMethod: Unable to determine method as paper content could not be retrieved
Result: Unable to determine results as paper content could not be retrieved
Conclusion: Unable to draw conclusions about the paper due to technical retrieval issues
Abstract: Failed to fetch summary for 2511.09833: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.09833&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[460] Efficient Cross-Domain Offline Reinforcement Learning with Dynamics- and Value-Aligned Data Filtering
Zhongjian Qiao, Rui Yang, Jiafei Lyu, Chenjia Bai, Xiu Li, Siyang Gao, Shuang Qiu
Main category: cs.LG
TL;DR: Unable to analyze paper 2512.02435 due to HTTP 429 error when fetching abstract from arXiv API
Details
Motivation: Cannot determine motivation as abstract is unavailable due to rate limiting errorMethod: Cannot determine method as abstract is unavailable due to rate limiting error
Result: Cannot determine results as abstract is unavailable due to rate limiting error
Conclusion: Cannot draw conclusion as paper content is inaccessible due to arXiv API rate limiting
Abstract: Failed to fetch summary for 2512.02435: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.02435&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[461] Bridging Training and Merging Through Momentum-Aware Optimization
Alireza Moayedikia, Alicia Troncoso
Main category: cs.LG
TL;DR: Paper 2512.17109: Could not fetch summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to access restrictionsMethod: Unable to determine method due to access restrictions
Result: Unable to determine results due to access restrictions
Conclusion: Unable to determine conclusion due to access restrictions
Abstract: Failed to fetch summary for 2512.17109: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.17109&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[462] Alzheimer’s Disease Brain Network Mining
Alireza Moayedikia, Sara Fin
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to retrieval failureMethod: Unable to determine method due to retrieval failure
Result: Unable to determine results due to retrieval failure
Conclusion: Unable to determine conclusion due to retrieval failure
Abstract: Failed to fetch summary for 2512.17276: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.17276&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[463] Physics-Guided Temporal Fusion for Lane-Change Intention Prediction
Jiazhao Shi, Ziyu Wang, Yichen Lin, Shoufeng Lu
Main category: cs.LG
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting). The paper ID 2512.24075 appears to be a recent arXiv submission from December 2024.
Details
Motivation: Cannot determine motivation without access to the paper content.Method: Cannot determine method without access to the paper content.
Result: Cannot determine results without access to the paper content.
Conclusion: Cannot determine conclusion without access to the paper content.
Abstract: Failed to fetch summary for 2512.24075: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.24075&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[464] Reinforcement Unlearning via Group Relative Policy Optimization
Efstratios Zaradoukas, Bardh Prenkaj, Gjergji Kasneci
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Unable to determine motivation due to API access limitationsMethod: Unable to determine method due to API access limitations
Result: Unable to determine results due to API access limitations
Conclusion: Unable to draw conclusions due to API access limitations
Abstract: Failed to fetch summary for 2601.20568: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.20568&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[465] How Understanding Forecast Uncertainty Resolves the Explainability Problem in Machine Learning Models
Joseph L. Breeden
Main category: cs.LG
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation without access to paper contentMethod: Cannot determine method without access to paper content
Result: Cannot determine results without access to paper content
Conclusion: Cannot draw conclusions without access to paper content
Abstract: Failed to fetch summary for 2602.00179: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.00179&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[466] A Task-Centric Theory for Iterative Self-Improvement with Easy-to-Hard Curricula
Chenruo Liu, Yijun Dong, Yiqiu Shen, Qi Lei
Main category: cs.LG
TL;DR: Failed to fetch paper summary - HTTP 429 error indicates rate limiting from arXiv API
Details
Motivation: Unable to determine motivation due to API rate limiting preventing access to paper contentMethod: Cannot analyze method as paper content is unavailable due to HTTP 429 error
Result: No results available - technical issue with arXiv API access
Conclusion: Cannot draw conclusions about paper content due to technical limitations in accessing the abstract
Abstract: Failed to fetch summary for 2602.10014: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.10014&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[467] A Pragmatic Method for Comparing Clusterings with Overlaps and Outliers
Ryan DeWolfe, Paweł Prałat, François Théberge
Main category: cs.LG
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation due to missing paper contentMethod: Cannot determine method due to missing paper content
Result: Cannot determine results due to missing paper content
Conclusion: Cannot determine conclusion due to missing paper content
Abstract: Failed to fetch summary for 2602.14855: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.14855&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[468] mlx-vis: GPU-Accelerated Dimensionality Reduction and Visualization on Apple Silicon
Han Xiao
Main category: cs.LG
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting). The paper ID 2603.04035 suggests it’s from March 2024, but content is unavailable.
Details
Motivation: Cannot determine motivation without access to the paper content.Method: Cannot determine method without access to the paper content.
Result: Cannot determine results without access to the paper content.
Conclusion: Cannot draw conclusions without access to the paper content.
Abstract: Failed to fetch summary for 2603.04035: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.04035&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[469] On the Cone Effect and Modality Gap in Medical Vision-Language Embeddings
David Restrepo, Miguel L Martins, Chenwei Wu, Luis Filipe Nakayama, Diego M Lopez, Stergios Christodoulidis, Maria Vakalopoulou, Enzo Ferrante
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to failed paper fetchMethod: Unable to determine method due to failed paper fetch
Result: Unable to determine results due to failed paper fetch
Conclusion: Unable to determine conclusion due to failed paper fetch
Abstract: Failed to fetch summary for 2603.17246: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.17246&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[470] AcceRL: A Distributed Asynchronous Reinforcement Learning and World Model Framework for Vision-Language-Action Models
Chengxuan Lu, Shukuan Wang, Yanjie Li, Wei Liu, Shiji Jin, Fuyuan Qian, Peiming Li, Baigui Sun, Yang Liu
Main category: cs.LG
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) - need to try again later or use different approach
Details
Motivation: Cannot determine motivation without access to paper contentMethod: Cannot determine method without access to paper content
Result: Cannot determine results without access to paper content
Conclusion: Cannot draw conclusions without access to paper content
Abstract: Failed to fetch summary for 2603.18464: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.18464&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[471] Best-of-Both-Worlds Multi-Dueling Bandits: Unified Algorithms for Stochastic and Adversarial Preferences under Condorcet and Borda Objectives
S Akash, Pratik Gajane, Jawar Singh
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to missing paper contentMethod: Unable to determine method due to missing paper content
Result: Unable to determine results due to missing paper content
Conclusion: Unable to determine conclusion due to missing paper content
Abstract: Failed to fetch summary for 2603.18972: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.18972&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[472] Predicting Hidden Links and Missing Nodes in Scale-Free Networks with Artificial Neural Networks
Rakib Hassan Pran
Main category: cs.LG
TL;DR: Unable to analyze paper 2109.12331 due to HTTP 429 error when fetching abstract from arXiv API
Details
Motivation: Cannot determine motivation as abstract is unavailableMethod: Cannot determine method as abstract is unavailable
Result: Cannot determine results as abstract is unavailable
Conclusion: Cannot draw conclusions as paper content is inaccessible
Abstract: Failed to fetch summary for 2109.12331: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2109.12331&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[473] In-and-Out: Algorithmic Diffusion for Sampling Convex Bodies
Yunbum Kook, Santosh S. Vempala, Matthew S. Zhang
Main category: cs.LG
TL;DR: Paper 2405.01425: Unable to fetch summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation due to inability to access paper contentMethod: Cannot determine method due to inability to access paper content
Result: Cannot determine results due to inability to access paper content
Conclusion: Cannot draw conclusions due to inability to access paper content
Abstract: Failed to fetch summary for 2405.01425: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2405.01425&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[474] Generalized Continuous-Time Models for Nesterov’s Accelerated Gradient Methods
Chanwoong Park, Youngchae Cho, Insoon Yang
Main category: cs.LG
TL;DR: Unable to fetch paper abstract due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot draw conclusions as paper content is unavailable
Abstract: Failed to fetch summary for 2409.00913: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2409.00913&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[475] A new paradigm for global sensitivity analysis
Gildas Mazo
Main category: cs.LG
TL;DR: Paper 2409.06271: Unable to fetch abstract due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation due to missing abstractMethod: Cannot determine method due to missing abstract
Result: Cannot determine results due to missing abstract
Conclusion: Cannot draw conclusions due to missing abstract
Abstract: Failed to fetch summary for 2409.06271: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2409.06271&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[476] Learning Representations for Independence Testing
Nathaniel Xu, Feng Liu, Danica J. Sutherland
Main category: cs.LG
TL;DR: Unable to analyze paper 2409.06890 due to HTTP 429 error when fetching summary from arXiv API
Details
Motivation: Cannot determine motivation as paper content is unavailable due to API rate limitingMethod: Cannot determine method as paper content is unavailable due to API rate limiting
Result: Cannot determine results as paper content is unavailable due to API rate limiting
Conclusion: Cannot draw conclusions about paper content due to unavailability of abstract/summary
Abstract: Failed to fetch summary for 2409.06890: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2409.06890&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[477] Investigating layer-selective transfer learning of QAOA parameters for Max-Cut problem
Francesco Aldo Venturelli, Sreetama Das, Filippo Caruso
Main category: cs.LG
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting). Need to try again later or use alternative methods to access the paper information.
Details
Motivation: Cannot determine motivation without access to paper content.Method: Cannot determine method without access to paper content.
Result: Cannot determine results without access to paper content.
Conclusion: Cannot determine conclusion without access to paper content.
Abstract: Failed to fetch summary for 2412.21071: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2412.21071&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[478] Interpretable Early Warnings using Machine Learning in an Online Game-experiment
Guillaume Falmagne, Anna B. Stephenson, Simon A. Levin
Main category: cs.LG
TL;DR: Paper 2502.09880: Unable to fetch abstract due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation without access to paper contentMethod: Cannot determine method without access to paper content
Result: Cannot determine results without access to paper content
Conclusion: Cannot determine conclusion without access to paper content
Abstract: Failed to fetch summary for 2502.09880: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2502.09880&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[479] A Phylogenetic Approach to Genomic Language Modeling
Carlos Albors, Jianan Canal Li, Gonzalo Benegas, Chengzhong Ye, Yun S. Song
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2503.03773: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2503.03773&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[480] On Policy Stochasticity in Mutual Information Optimal Control of Linear Systems
Shoju Enami, Kenji Kashima
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to failed paper retrievalMethod: Unable to determine method due to failed paper retrieval
Result: Unable to determine results due to failed paper retrieval
Conclusion: Unable to draw conclusions due to failed paper retrieval
Abstract: Failed to fetch summary for 2507.21543: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2507.21543&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[481] Virtual Sensing for Solder Layer Degradation and Temperature Monitoring in IGBT Modules
Andrea Urgolo, Monika Stipsitz, Hèlios Sanchis-Alepuz
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to access errorMethod: Unable to determine method due to access error
Result: Unable to determine results due to access error
Conclusion: Unable to determine conclusion due to access error
Abstract: Failed to fetch summary for 2508.10515: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2508.10515&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[482] Unsupervised Feature Selection via Robust Autoencoder and Adaptive Graph Learning
Feng Yu, MD Saifur Rahman Mazumder, Ying Su, Oscar Contreras Velasco
Main category: cs.LG
TL;DR: Unable to analyze paper 2512.18720 due to HTTP 429 error when fetching abstract from arXiv API
Details
Motivation: Cannot determine motivation as abstract is unavailable due to rate limiting errorMethod: Cannot determine method as abstract is unavailable due to rate limiting error
Result: Cannot determine results as abstract is unavailable due to rate limiting error
Conclusion: Cannot draw conclusions as abstract is unavailable due to rate limiting error
Abstract: Failed to fetch summary for 2512.18720: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.18720&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[483] Accelerating Large-Scale Cheminformatics Using a Byte-Offset Indexing Architecture for Terabyte-Scale Data Integration
Malikussaid, Septian Caesar Floresko, Sutiyo
Main category: cs.LG
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation as paper content is unavailable due to API rate limitingMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot draw conclusions about the paper due to technical limitations in accessing the content
Abstract: Failed to fetch summary for 2601.18921: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.18921&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[484] Precedence-Constrained Decision Trees and Coverings
Michał Szyfelbein, Dariusz Dereniowski
Main category: cs.LG
TL;DR: Unable to analyze paper 2602.21312 due to HTTP 429 error when fetching abstract from arXiv API
Details
Motivation: Cannot determine motivation as paper content is unavailable due to API rate limitingMethod: Cannot determine method as paper content is unavailable due to API rate limiting
Result: Cannot determine results as paper content is unavailable due to API rate limiting
Conclusion: Cannot draw conclusions as paper content is unavailable due to API rate limiting
Abstract: Failed to fetch summary for 2602.21312: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.21312&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[485] Schrödinger Bridge Over A Compact Connected Lie Group
Hamza Mahmood, Abhishek Halder, Adeel Akhtar
Main category: cs.LG
TL;DR: Unable to analyze paper 2603.14049 due to HTTP 429 error when fetching summary from arXiv API
Details
Motivation: Cannot determine motivation as paper content was not accessible due to rate limiting errorMethod: No method information available due to failed API request
Result: No results available as paper content could not be retrieved
Conclusion: Cannot provide analysis due to technical limitations in accessing the paper
Abstract: Failed to fetch summary for 2603.14049: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.14049&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[486] Learnability with Partial Labels and Adaptive Nearest Neighbors
Nicolas A. Errandonea, Santiago Mazuelas, Jose A. Lozano, Sanjoy Dasgupta
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to access errorMethod: Unable to determine method due to access error
Result: Unable to determine results due to access error
Conclusion: Unable to determine conclusion due to access error
Abstract: Failed to fetch summary for 2603.15781: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.15781&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
cs.MA
[487] TrustFlow: Topic-Aware Vector Reputation Propagation for Multi-Agent Ecosystems
Volodymyr Seliuchenko
Main category: cs.MA
TL;DR: TrustFlow is a reputation propagation algorithm that assigns multi-dimensional reputation vectors to software agents using topic-gated transfer operators on interaction graphs, with strong performance on multi-label classification and resistance to various attacks.
Details
Motivation: Traditional reputation systems use scalar scores which are insufficient for complex multi-dimensional trust assessment. There's a need for reputation systems that can handle multi-dimensional trust, resist attacks like sybil attacks and reputation laundering, and provide queryable reputation vectors that align with user queries in the same embedding space.Method: TrustFlow propagates reputation through an interaction graph using topic-gated transfer operators that modulate each edge by its content embedding. It employs a family of Lipschitz-1 transfer operators and composable information-theoretic gates. The algorithm guarantees convergence to a unique fixed point via the contraction mapping theorem.
Result: Achieves up to 98% multi-label Precision@5 on dense graphs and 78% on sparse graphs. On a benchmark of 50 agents across 8 domains, it resists sybil attacks, reputation laundering, and vote rings with at most 4 percentage-point precision impact.
Conclusion: TrustFlow provides a robust, multi-dimensional reputation system that outperforms traditional approaches like PageRank and Topic-Sensitive PageRank by producing directly queryable vector reputations in the same embedding space as user queries.
Abstract: We introduce TrustFlow, a reputation propagation algorithm that assigns each software agent a multi-dimensional reputation vector rather than a scalar score. Reputation is propagated through an interaction graph via topic-gated transfer operators that modulate each edge by its content embedding, with convergence to a unique fixed point guaranteed by the contraction mapping theorem. We develop a family of Lipschitz-1 transfer operators and composable information-theoretic gates that achieve up to 98% multi-label Precision@5 on dense graphs and 78% on sparse ones. On a benchmark of 50 agents across 8 domains, TrustFlow resists sybil attacks, reputation laundering, and vote rings with at most 4 percentage-point precision impact. Unlike PageRank and Topic-Sensitive PageRank, TrustFlow produces vector reputation that is directly queryable by dot product in the same embedding space as user queries.
[488] Planning Autonomous Vehicle Maneuvering in Work Zones Through Game-Theoretic Trajectory Generation
Mayar Nour, Atrisha Sarkar, Mohamed H. Zaki
Main category: cs.MA
TL;DR: Game-theoretic framework for autonomous vehicle lane change decision-making in work zones, reducing conflicts by 35% compared to traditional methods.
Details
Motivation: Work zones present high-risk environments for autonomous vehicles due to constrained geometries and unpredictable traffic patterns, with limited research addressing the specific decision-making challenges for safe navigation in these scenarios.Method: Proposes a game-theoretic framework modeling lane change maneuvers as non-cooperative games between vehicles, using a game-theoretic planner to generate trajectories balancing safety, progress, and traffic stability.
Result: Simulation results show the proposed model reduces conflict frequency by 35% and decreases probability of high-risk safety events compared to traditional vehicle behavior planning models in work-zone scenarios.
Conclusion: Game-theoretic approach effectively enhances safety for autonomous vehicle navigation in challenging work zone environments through improved decision-making and trajectory planning.
Abstract: Work zone navigation remains one of the most challenging manoeuvres for autonomous vehicles (AVs), where constrained geometries and unpredictable traffic patterns create a high-risk environment. Despite extensive research on AV trajectory planning, few studies address the decision-making required to navigate work zones safely. This paper proposes a novel game-theoretic framework for trajectory generation and control to enhance the safety of lane changes in a work zone environment. By modelling the lane change manoeuvre as a non-cooperative game between vehicles, we use a game-theoretic planner to generate trajectories that balance safety, progress, and traffic stability. The simulation results show that the proposed game-theoretic model reduces the frequency of conflicts by 35 percent and decreases the probability of high risk safety events compared to traditional vehicle behaviour planning models in safety-critical highway work-zone scenarios.
[489] Helix: A Dual-Helix Co-Evolutionary Multi-Agent System for Prompt Optimization and Question Reformulation
Kewen Zhu, Liping Yi, Zhiming Zhao, Xiang Li, Qinghua Hu
Main category: cs.MA
TL;DR: Helix is a multi-agent system that jointly optimizes both question reformulation and prompt instructions through a co-evolutionary framework, achieving better LLM performance than single-sided optimization approaches.
Details
Motivation: Current automated prompt optimization methods are limited by fixed templates, constrained search spaces, and treating user questions as immutable inputs, ignoring the interdependence between question formulation and prompt design.Method: A three-stage co-evolutionary framework: 1) planner-guided decomposition of coupled question-prompt objectives, 2) dual-track co-evolution with specialized agents iteratively refining and critiquing each other, and 3) strategy-driven question generation for robust inference.
Result: Extensive experiments on 12 benchmarks against 6 baselines show up to 3.95% performance improvements across tasks with favorable optimization efficiency.
Conclusion: Joint optimization of question reformulation and prompt instructions through a structured multi-agent system significantly improves LLM performance beyond single-sided optimization approaches.
Abstract: Automated prompt optimization (APO) aims to improve large language model performance by refining prompt instructions. However, existing methods are largely constrained by fixed prompt templates, limited search spaces, or single-sided optimization that treats user questions as immutable inputs. In practice, question formulation and prompt design are inherently interdependent: clearer question structures facilitate focused reasoning and task understanding, while effective prompts reveal better ways to organize and restate queries. Ignoring this coupling fundamentally limits the effectiveness and adaptability of current APO approaches. We propose a unified multi-agent system (Helix) that jointly optimizes question reformulation and prompt instructions through a structured three-stage co-evolutionary framework. Helix integrates (1) planner-guided decomposition that breaks optimization into coupled question-prompt objectives, (2) dual-track co-evolution where specialized agents iteratively refine and critique each other to produce complementary improvements, and (3) strategy-driven question generation that instantiates high-quality reformulations for robust inference. Extensive experiments on 12 benchmarks against 6 strong baselines demonstrate the effectiveness of Helix, achieving up to 3.95% performance improvements across tasks with favorable optimization efficiency.
cs.MM
eess.AS
[490] Plug-and-Steer: Decoupling Separation and Selection in Audio-Visual Target Speaker Extraction
Doyeop Kwak, Suyeon Lee, Joon Son Chung
Main category: eess.AS
TL;DR: Plug-and-Steer: A new AV-TSE approach that decouples separation and target selection by using a frozen audio-only backbone for high-fidelity separation and visual modality only for target selection via a Latent Steering Matrix.
Details
Motivation: Conventional AV-TSE systems deeply integrate audio and visual features, which can limit fidelity due to noisy in-the-wild datasets. The authors aim to preserve acoustic priors by separating the roles of audio (separation) and vision (target selection).Method: Proposes Plug-and-Steer with Latent Steering Matrix (LSM) - a minimalist linear transformation that re-routes latent features within a frozen audio-only backbone to anchor target speaker to designated channel. Visual modality is limited strictly to target selection.
Result: Experiments across four representative architectures show the method effectively preserves acoustic priors of diverse backbones, achieving perceptual quality comparable to original backbones.
Conclusion: Decoupling separation and target selection in AV-TSE through Plug-and-Steer with LSM provides better fidelity preservation compared to conventional deeply integrated approaches.
Abstract: The goal of this paper is to provide a new perspective on audio-visual target speaker extraction (AV-TSE) by decoupling the separation and target selection. Conventional AV-TSE systems typically integrate audio and visual features deeply to re-learn the entire separation process, which can act as a fidelity ceiling due to the noisy nature of in-the-wild audio-visual datasets. To address this, we propose Plug-and-Steer, which assigns high-fidelity separation to a frozen audio-only backbone and limits the role of visual modality strictly to target selection. We introduce the Latent Steering Matrix (LSM), a minimalist linear transformation that re-routes latent features within the backbone to anchor the target speaker to a designated channel. Experiments across four representative architectures show that our method effectively preserves the acoustic priors of diverse backbones, achieving perceptual quality comparable to the original backbones. Audio samples are available at: https://plugandsteer.github.io
[491] Gesture2Speech: How Far Can Hand Movements Shape Expressive Speech?
Lokesh Kumar, Nirmesh Shah, Ashishkumar P. Gudmalwar, Pankaj Wasnik
Main category: eess.AS
TL;DR: Gesture2Speech: A multimodal TTS framework that uses hand gesture cues to modulate speech prosody, achieving better naturalness and gesture-speech synchrony than existing methods.
Details
Motivation: Human communication integrates speech and bodily motion, with hand gestures naturally complementing vocal prosody. While recent TTS systems incorporate multimodal cues like facial expressions, hand gestures' role in shaping prosody remains underexplored.Method: Proposes a multimodal Mixture-of-Experts (MoE) architecture that dynamically fuses linguistic content and gesture features in a style extraction module. Uses an LLM-based speech decoder conditioned on the fused representation, with a gesture-speech alignment loss for temporal synchrony.
Result: Outperforms state-of-the-art baselines on the PATS dataset in both speech naturalness and gesture-speech synchrony. First work to utilize hand gesture cues for prosody control in neural speech synthesis.
Conclusion: Gesture2Speech successfully demonstrates that hand gesture cues can effectively modulate prosody in synthesized speech, achieving better naturalness and temporal alignment between gestures and prosodic contours.
Abstract: Human communication seamlessly integrates speech and bodily motion, where hand gestures naturally complement vocal prosody to express intent, emotion, and emphasis. While recent text-to-speech (TTS) systems have begun incorporating multimodal cues such as facial expressions or lip movements, the role of hand gestures in shaping prosody remains largely underexplored. We propose a novel multimodal TTS framework, Gesture2Speech, that leverages visual gesture cues to modulate prosody in synthesized speech. Motivated by the observation that confident and expressive speakers coordinate gestures with vocal prosody, we introduce a multimodal Mixture-of-Experts (MoE) architecture that dynamically fuses linguistic content and gesture features within a dedicated style extraction module. The fused representation conditions an LLM-based speech decoder, enabling prosodic modulation that is temporally aligned with hand movements. We further design a gesture-speech alignment loss that explicitly models their temporal correspondence to ensure fine-grained synchrony between gestures and prosodic contours. Evaluations on the PATS dataset show that Gesture2Speech outperforms state-of-the-art baselines in both speech naturalness and gesture-speech synchrony. To the best of our knowledge, this is the first work to utilize hand gesture cues for prosody control in neural speech synthesis. Demo samples are available at https://research.sri-media-analysis.com/aaai26-beeu-gesture2speech/
[492] BioDCASE 2026 Challenge Baseline for Cross-Domain Mosquito Species Classification
Yuanbo Hou, Vanja Zdravkovic, Marianne Sinka, Yunpeng Li, Wenwu Wang, Mark D. Plumbley, Kathy Willis, Stephen Roberts
Main category: eess.AS
TL;DR: Baseline system for cross-domain mosquito species classification challenge using audio recordings, showing strong within-domain performance but poor generalization to unseen domains.
Details
Motivation: Mosquito-borne diseases cause significant global health impacts, but traditional surveillance methods are slow and labor-intensive. Audio-based monitoring offers a scalable alternative, but species classification is difficult due to low signal-to-noise ratio, class imbalance, and domain variation across recording conditions.Method: Uses log-mel features and a multitemporal resolution convolutional neural network (MTRCNN) with species and auxiliary domain outputs. Provides complete training and test scripts as a reproducible baseline for the BioDCASE 2026 Cross-Domain Mosquito Species Classification challenge.
Result: The baseline system performs strongly on seen domains but degrades markedly on unseen domains, demonstrating that cross-domain generalization is the central challenge rather than within-domain recognition.
Conclusion: Cross-domain generalization is the key challenge for practical mosquito species classification from multi-source bioacoustic recordings, highlighting the need for models that can transfer effectively to new acquisition settings.
Abstract: Mosquito-borne diseases affect more than one billion people each year and cause close to one million deaths. Traditional surveillance methods rely on traps and manual identification that are slow, labor-intensive, and difficult to scale. Audio-based mosquito monitoring offers a non-destructive, lower-cost, and more scalable complement to trap-based surveillance, but reliable species classification remains difficult under real-world recording conditions. Mosquito flight tones are narrow-band, often low in signal-to-noise ratio, and easily masked by background noise, and recordings for several epidemiologically relevant species remain limited, creating pronounced class imbalance. Variation across devices, environments, and collection protocols further increases the difficulty of robust classification. Such variation can cause models to rely on domain-specific recording artefacts rather than species-relevant acoustic cues, which makes transfer to new acquisition settings difficult. The BioDCASE 2026 Cross-Domain Mosquito Species Classification (CD-MSC) challenge is designed around this deployment problem by evaluating performance on both seen and unseen domains. This paper presents the official baseline system and evaluation pipeline as a simple, fully reproducible reference for the CD-MSC challenge task. The baseline uses log-mel features and a multitemporal resolution convolutional neural network (MTRCNN) with species and auxiliary domain outputs, together with complete training and test scripts. The baseline system performs strongly on seen domains but degrades markedly on unseen domains, showing that cross-domain generalisation, rather than within-domain recognition, is the central challenge for practical mosquito species classification from multi-source bioacoustic recordings.
[493] VSSFlow: Unifying Video-conditioned Sound and Speech Generation via Joint Learning
Xin Cheng, Yuyue Wang, Xihua Wang, Yihan Wu, Kaisi Guan, Yijing Chen, Peng Zhang, Xiaojiang Liu, Meng Cao, Ruihua Song
Main category: eess.AS
TL;DR: VSSFlow is a unified flow-matching framework that bridges Video-to-Sound (V2S) and Visual Text-to-Speech (VisualTTS) generation using a Diffusion Transformer architecture with disentangled condition aggregation.
Details
Motivation: Video-conditioned audio generation has traditionally been treated as separate tasks (V2S and VisualTTS), leaving the potential for a unified generative framework largely unexplored. The paper aims to bridge this gap and demonstrate that joint training for these tasks can be effective rather than leading to performance degradation.Method: Proposes VSSFlow, a unified flow-matching framework using Diffusion Transformer (DiT) architecture. Introduces a disentangled condition aggregation mechanism that leverages distinct properties of attention layers: cross-attention for semantic conditions and self-attention for temporally-intensive conditions. Also uses a straightforward feature-level data synthesis method for joint sound and speech generation.
Result: Extensive experiments on V2S, VisualTTS and joint generation benchmarks show that VSSFlow effectively unifies these tasks and surpasses state-of-the-art domain-specific baselines. The framework maintains superior performance during end-to-end joint learning and adapts well to joint sound and speech generation using synthetic data.
Conclusion: VSSFlow demonstrates the critical potential of unified generative models for video-conditioned audio generation, successfully bridging V2S and VisualTTS tasks while outperforming specialized baselines, challenging the prevailing belief that joint training leads to performance degradation.
Abstract: Video-conditioned audio generation, including Video-to-Sound (V2S) and Visual Text-to-Speech (VisualTTS), has traditionally been treated as distinct tasks, leaving the potential for a unified generative framework largely underexplored. In this paper, we bridge this gap with VSSFlow, a unified flow-matching framework that seamlessly solve both problems. To effectively handle multiple input signals within a Diffusion Transformer (DiT) architecture, we propose a disentangled condition aggregation mechanism leveraging distinct intrinsic properties of attention layers: cross-attention for semantic conditions, and self-attention for temporally-intensive conditions. Besides, contrary to the prevailing belief that joint training for the two tasks leads to performance degradation, we demonstrate that VSSFlow maintains superior performance during end-to-end joint learning process. Furthermore, we use a straightforward feature-level data synthesis method, demonstrating that our framework provides a robust foundation that easily adapts to joint sound and speech generation using synthetic data. Extensive experiments on V2S, VisualTTS and joint generation benchmarks show that VSSFlow effectively unifies these tasks and surpasses state-of-the-art domain-specific baselines, underscoring the critical potential of unified generative models. Project page: https://vasflow1.github.io/vasflow/
eess.IV
[494] TuLaBM: Tumor-Biased Latent Bridge Matching for Contrast-Enhanced MRI Synthesis
Atharva Rege, Adinath Madhavrao Dukre, Numan Balci, Dwarikanath Mahapatra, Imran Razzak
Main category: eess.IV
TL;DR: TuLaBM: A latent space Brownian bridge matching method for synthesizing contrast-enhanced MRI from non-contrast MRI with tumor-biased attention for improved tumor region fidelity.
Details
Motivation: CE-MRI requires gadolinium contrast agents that increase costs and safety concerns. Existing GAN-based methods suffer from instability, while diffusion models are computationally expensive and often fail to reproduce critical tumor contrast patterns accurately.Method: Formulates NC-to-CE MRI translation as Brownian bridge transport between source and target distributions in a learned latent space. Introduces Tumor-Biased Attention Mechanism (TuBAM) to amplify tumor-relevant latent features during bridge evolution, and a boundary-aware loss to improve tumor margin sharpness.
Result: Outperforms state-of-the-art baselines on both whole-image and tumor-region metrics on BraTS2023-GLI and liver MRI datasets. Generalizes effectively to unseen liver MRI data in zero-shot and fine-tuned settings, with inference times under 0.097 seconds per image.
Conclusion: TuLaBM provides an efficient and accurate alternative to CE-MRI acquisition by addressing computational cost and tumor fidelity limitations of previous methods through latent space bridge matching with tumor-biased attention.
Abstract: Contrast-enhanced magnetic resonance imaging (CE-MRI) plays a crucial role in brain tumor assessment; however, its acquisition requires gadolinium-based contrast agents (GBCAs), which increase costs and raise safety concerns. Consequently, synthesizing CE-MRI from non-contrast MRI (NC-MRI) has emerged as a promising alternative. Early Generative Adversarial Network (GAN)-based approaches suffered from instability and mode collapse, while diffusion models, despite impressive synthesis quality, remain computationally expensive and often fail to faithfully reproduce critical tumor contrast patterns. To address these limitations, we propose Tumor-Biased Latent Bridge Matching (TuLaBM), which formulates NC-to-CE MRI translation as Brownian bridge transport between source and target distributions in a learned latent space, enabling efficient training and inference. To enhance tumor-region fidelity, we introduce a Tumor-Biased Attention Mechanism (TuBAM) that amplifies tumor-relevant latent features during bridge evolution, along with a boundary-aware loss that constrains tumor interfaces to improve margin sharpness. While bridge matching has been explored for medical image translation in pixel space, our latent formulation substantially reduces computational cost and inference time. Experiments on BraTS2023-GLI (BraSyn) and Cleveland Clinic (in-house) liver MRI dataset show that TuLaBM consistently outperforms state-of-the-art baselines on both whole-image and tumor-region metrics, generalizes effectively to unseen liver MRI data in zero-shot and fine-tuned settings, and achieves inference times under 0.097 seconds per image.
[495] Offshore oil and gas platform dynamics in the North Sea, Gulf of Mexico, and Persian Gulf: Exploiting the Sentinel-1 archive
Robin Spanier, Thorsten Hoeser, John Truckenbrodt, Felix Bachofer, Claudia Kuenzer
Main category: eess.IV
TL;DR: Automated detection of offshore oil/gas platforms using Sentinel-1 satellite data and deep learning for spatiotemporal monitoring across three major regions from 2017-2025.
Details
Motivation: Marine spaces with offshore infrastructure need consistent, scalable monitoring due to economic, environmental, and regulatory implications, but maritime areas are difficult to monitor systematically due to inaccessibility and spatial extent.Method: Leveraging Sentinel-1 archive Earth observation data and deep learning-based object detection to create quarterly time series of platform locations, with additional derivation of platform size, water depth, distance to coast, national affiliation, and installation/decommissioning dates.
Result: Identified 3,728 offshore platforms in 2025 across three regions: 356 in North Sea, 1,641 in Gulf of Mexico, 1,731 in Persian Gulf. Observed expansion in Persian Gulf until 2024, decline in Gulf of Mexico and North Sea from 2018-2020, with high dynamics - over 2,700 platforms installed/relocated and comparable number decommissioned/relocated.
Conclusion: Demonstrates potential of freely available Earth observation data and deep learning for consistent, long-term monitoring of marine infrastructure. Public dataset provides basis for offshore monitoring, maritime planning, and analysis of offshore energy sector transformation.
Abstract: The increasing use of marine spaces by offshore infrastructure, including oil and gas platforms, underscores the need for consistent, scalable monitoring. Offshore development has economic, environmental, and regulatory implications, yet maritime areas remain difficult to monitor systematically due to their inaccessibility and spatial extent. This study presents an automated approach to the spatiotemporal detection of offshore oil and gas platforms based on freely available Earth observation data. Leveraging Sentinel-1 archive data and deep learning-based object detection, a consistent quarterly time series of platform locations for three major production regions: the North Sea, the Gulf of Mexico, and the Persian Gulf, was created for the period 2017-2025. In addition, platform size, water depth, distance to the coast, national affiliation, and installation and decommissioning dates were derived. 3,728 offshore platforms were identified in 2025, 356 in the North Sea, 1,641 in the Gulf of Mexico, and 1,731 in the Persian Gulf. While expansion was observed in the Persian Gulf until 2024, the Gulf of Mexico and the North Sea saw a decline in platform numbers from 2018-2020. At the same time, a pronounced dynamic was apparent. More than 2,700 platforms were installed or relocated to new sites, while a comparable number were decommissioned or relocated. Furthermore, the increasing number of platforms with short lifespans points to a structural change in the offshore sector associated with the growing importance of mobile offshore units such as jack-ups or drillships. The results highlighted the potential of freely available Earth observation data and deep learning for consistent, long-term monitoring of marine infrastructure. The derived dataset is public and provides a basis for offshore monitoring, maritime planning, and analyses of the transformation of the offshore energy sector.
[496] ReconMIL: Synergizing Latent Space Reconstruction with Bi-Stream Mamba for Whole Slide Image Analysis
Lubin Gan, Jing Zhang, Heng Zhang, Xin Di, Zhifeng Wang, Wenke Huang, Xiaoyan Sun
Main category: eess.IV
TL;DR: ReconMIL is a novel framework for whole slide image analysis that addresses domain gap issues and over-smoothing in multiple instance learning by introducing task-specific feature adaptation and a bi-stream architecture combining global context with local morphological details.
Details
Motivation: Current WSI analysis methods using MIL struggle with: 1) domain gap between generic foundation model features and specific histological tasks, leading to suboptimal separability; 2) over-smoothing where global aggregators dilute sparse but critical diagnostic signals by overwhelming them with dominant background context.Method: ReconMIL introduces: 1) Latent Space Reconstruction module that adaptively projects generic features into a compact, task-specific manifold; 2) Bi-stream architecture with Mamba-based global stream for contextual priors and CNN-based local stream to preserve subtle morphological anomalies; 3) Scale-adaptive selection mechanism that dynamically fuses the two streams based on when to rely on overall architecture versus local saliency.
Result: Evaluations across multiple diagnostic and survival prediction benchmarks show ReconMIL consistently outperforms current state-of-the-art methods, effectively localizing fine-grained diagnostic regions while suppressing background noise. Visualization confirms superior ability to balance global structure and local granularity.
Conclusion: ReconMIL successfully bridges the domain gap in WSI analysis and addresses over-smoothing through adaptive feature reconstruction and balanced global-local aggregation, demonstrating improved diagnostic region localization and performance across various benchmarks.
Abstract: Whole slide image (WSI) analysis heavily relies on multiple instance learning (MIL). While recent methods benefit from large-scale foundation models and advanced sequence modeling to capture long-range dependencies, they still struggle with two critical issues. First, directly applying frozen, task-agnostic features often leads to suboptimal separability due to the domain gap with specific histological tasks. Second, relying solely on global aggregators can cause over-smoothing, where sparse but critical diagnostic signals are overshadowed by the dominant background context. In this paper, we present ReconMIL, a novel framework designed to bridge this domain gap and balance global-local feature aggregation. Our approach introduces a Latent Space Reconstruction module that adaptively projects generic features into a compact, task-specific manifold, improving boundary delineation. To prevent information dilution, we develop a bi-stream architecture combining a Mamba-based global stream for contextual priors and a CNN-based local stream to preserve subtle morphological anomalies. A scale-adaptive selection mechanism dynamically fuses these two streams, determining when to rely on overall architecture versus local saliency. Evaluations across multiple diagnostic and survival prediction benchmarks show that ReconMIL consistently outperforms current state-of-the-art methods, effectively localizing fine-grained diagnostic regions while suppressing background noise. Visualization results confirm the models superior ability to localize diagnostic regions by effectively balancing global structure and local granularity.
[497] Goal-Oriented Framework for Optical Flow-based Multi-User Multi-Task Video Transmission
Yujie Xu, Shutong Chen, Nan Li, Yansha Deng, Jinhong Yuan, Robert Schober
Main category: eess.IV
TL;DR: Proposes OF-GSC, a goal-oriented semantic communication framework for optical flow-based multi-user multi-task video transmission with transformer-based decoding and DDPG bandwidth allocation.
Details
Motivation: To reduce transmission burden and save communication resources for multi-user multi-task video transmission in wireless systems by moving from traditional data transmission to semantic communication.Method: Uses optical flow-based semantic representation extraction at transmitter, transformer-based semantic decoder at receiver for video reconstruction/classification, and DDPG-based bandwidth allocation algorithm for multi-user optimization.
Result: 13.47% SSIM improvement over DeepJSCC for reconstruction; Top-1 accuracy slightly surpassing VideoMAE with only 25% data; 25.97% reduction in maximum transmission time vs equal-bandwidth allocation.
Conclusion: OF-GSC framework effectively reduces transmission burden while maintaining high-quality video reconstruction and classification performance through semantic communication and intelligent resource allocation.
Abstract: Efficient multi-user multi-task video transmission is an important research topic within the realm of current wireless communication systems. To reduce the transmission burden and save communication resources, we propose a goal-oriented semantic communication framework for optical flow-based multi-user multi-task video transmission (OF-GSC). At the transmitter, we design a semantic encoder that consists of a motion extractor and a patch-level optical flow-based semantic representation extractor to effectively identify and select important semantic representations. At the receiver, we design a transformer-based semantic decoder for high-quality video reconstruction and video classification tasks. To minimize the communication time, we develop a deep deterministic policy gradient (DDPG)-based bandwidth allocation algorithm for multi-user transmission. For video reconstruction tasks, our OF-GSC framework achieves a significant improvement in the received video quality, as evidenced by a 13.47% increase in the structural similarity index measure (SSIM) score in comparison to DeepJSCC. For video classification tasks, OF-GSC achieves a Top-1 accuracy slightly surpassing the performance of VideoMAE with only 25% required data under the same mask ratio of 0.3. For bandwidth allocation optimization, our DDPG-based algorithm reduces the maximum transmission time by 25.97% compared with the baseline equal-bandwidth allocation scheme.
[498] Investigating a Policy-Based Formulation for Endoscopic Camera Pose Recovery
Jan Emily Mangulabnan, Akshat Chauhan, Laura Fleig, Lalithkumar Seenivasan, Roger D. Soberanis-Mukul, S. Swaroop Vedula, Russell H. Taylor, Masaru Ishii, Gregory D. Hager, Mathias Unberath
Main category: eess.IV
TL;DR: Policy-based approach for endoscopic camera pose recovery that predicts short-horizon relative motions without explicit geometric representation, addressing challenges of traditional feature-matching methods in endoscopic imaging.
Details
Motivation: Traditional vision-based navigation systems for endoscopic surgery rely on feature matching and geometric optimization, which degrade under challenging endoscopic conditions like low texture and rapid illumination changes. Surgeons, however, interpret visual appearance in context of prior knowledge, suggesting a different approach is needed.Method: Policy-based formulation that imitates experts by estimating camera trajectories conditioned on previous camera state. Directly predicts short-horizon relative motions without maintaining explicit geometric representation at inference time, avoiding brittle correspondence matching and reconstruction failures.
Result: On cadaveric sinus endoscopy, under oracle state conditioning, achieved lowest mean translation error and competitive rotational accuracy compared to geometric baselines. Showed reduced sensitivity to low-texture conditions when analyzing prediction windows grouped by texture richness and illumination change.
Conclusion: A learned motion policy offers a viable alternative formulation for endoscopic camera pose recovery, addressing key limitations of geometry-based approaches while maintaining competitive accuracy.
Abstract: In endoscopic surgery, surgeons continuously locate the endoscopic view relative to the anatomy by interpreting the evolving visual appearance of the intraoperative scene in the context of their prior knowledge. Vision-based navigation systems seek to replicate this capability by recovering camera pose directly from endoscopic video, but most approaches do not embody the same principles of reasoning about new frames that makes surgeons successful. Instead, they remain grounded in feature matching and geometric optimization over keyframes, an approach that has been shown to degrade under the challenging conditions of endoscopic imaging like low texture and rapid illumination changes. Here, we pursue an alternative approach and investigate a policy-based formulation of endoscopic camera pose recovery that seeks to imitate experts in estimating trajectories conditioned on the previous camera state. Our approach directly predicts short-horizon relative motions without maintaining an explicit geometric representation at inference time. It thus addresses, by design, some of the notorious challenges of geometry-based approaches, such as brittle correspondence matching, instability in texture-sparse regions, and limited pose coverage due to reconstruction failure. We evaluate the proposed formulation on cadaveric sinus endoscopy. Under oracle state conditioning, we compare short-horizon motion prediction quality to geometric baselines achieving lowest mean translation error and competitive rotational accuracy. We analyze robustness by grouping prediction windows according to texture richness and illumination change indicating reduced sensitivity to low-texture conditions. These findings suggest that a learned motion policy offers a viable alternative formulation for endoscopic camera pose recovery.
[499] Understanding Task Aggregation for Generalizable Ultrasound Foundation Models
Fangyijie Wang, Tanya Akumu, Vien Ngoc Dang, Amelia Jiménez-Sánchez, Jieyun Bai, Guénolé Silvestre, Karim Lekadir, Kathleen M. Curran
Main category: eess.IV
TL;DR: M2DINO: A multi-organ, multi-task ultrasound framework using DINOv3 with task-conditioned Mixture-of-Experts that analyzes when heterogeneous ultrasound tasks can be jointly learned without performance loss, finding aggregation effectiveness depends on training data scale.
Details
Motivation: Foundation models promise to unify clinical tasks but recent ultrasound studies show unified models underperform task-specific baselines. The degradation may arise from task aggregation strategies ignoring interactions between task heterogeneity and training data scale.Method: Introduces M2DINO, a multi-organ, multi-task framework built on DINOv3 with task-conditioned Mixture-of-Experts blocks for adaptive capacity allocation. Systematically evaluates 27 ultrasound tasks (segmentation, classification, detection, regression) under three paradigms: task-specific, clinically-grouped, and all-task unified training.
Result: Aggregation effectiveness strongly depends on training data scale. Clinically-grouped training improves performance in data-rich settings but causes negative transfer in low-data settings. All-task unified training shows more consistent performance. Segmentation tasks show largest performance drops compared to regression and classification.
Conclusion: Aggregation strategies should jointly consider training data availability and task characteristics rather than relying on clinical taxonomy alone. Provides practical guidance for ultrasound foundation models.
Abstract: Foundation models promise to unify multiple clinical tasks within a single framework, but recent ultrasound studies report that unified models can underperform task-specific baselines. We hypothesize that this degradation arises not from model capacity limitations, but from task aggregation strategies that ignore interactions between task heterogeneity and available training data scale. In this work, we systematically analyze when heterogeneous ultrasound tasks can be jointly learned without performance loss, establishing practical criteria for task aggregation in unified clinical imaging models. We introduce M2DINO, a multi-organ, multi-task framework built on DINOv3 with task-conditioned Mixture-of-Experts blocks for adaptive capacity allocation. We systematically evaluate 27 ultrasound tasks spanning segmentation, classification, detection, and regression under three paradigms: task-specific, clinically-grouped, and all-task unified training. Our results show that aggregation effectiveness depends strongly on training data scale. While clinically-grouped training can improve performance in data-rich settings, it may induce substantial negative transfer in low-data settings. In contrast, all-task unified training exhibits more consistent performance across clinical groups. We further observe that task sensitivity varies by task type in our experiments: segmentation shows the largest performance drops compared with regression and classification. These findings provide practical guidance for ultrasound foundation models, emphasizing that aggregation strategies should jointly consider training data availability and task characteristics rather than relying on clinical taxonomy alone.