Today’s Research Highlights

AI-enhanced summaries of the latest research papers from arXiv.

cs.CL [Total: 173]
cs.CV [Total: 118]
cs.AI [Total: 62]
cs.SD [Total: 8]
cs.LG [Total: 128]
cs.MA [Total: 8]
cs.MM [Total: 1]
eess.AS [Total: 5]
eess.IV [Total: 7]

cs.CL

[1] DeepResearch-Slice: Bridging the Retrieval-Utilization Gap via Explicit Text Slicing

Shuo Lu, Yinuo Xu, Jianjie Cheng, Lingxiao He, Meng Wang, Jian Liang

Main category: cs.CL

TL;DR: DeepResearch-Slice introduces a neuro-symbolic framework that uses explicit span prediction to filter retrieved evidence before reasoning, addressing the retrieval-utilization gap where models fail to use gold evidence even when retrieved.

Details

Motivation: Current deep research agents focus on optimizing search policies for retrieval but suffer from a "retrieval-utilization gap" - models fail to effectively use gold evidence even after retrieval due to context blindness in noisy environments.

Method: Proposes DeepResearch-Slice, a neuro-symbolic framework that predicts precise span indices to perform deterministic hard filtering of retrieved evidence before reasoning, unlike implicit attention mechanisms.

Result: Extensive evaluations across six benchmarks show substantial robustness gains. Applying the method to frozen backbones yields 73% relative improvement (from 19.1% to 33.0%), effectively mitigating noise without requiring parameter updates to the reasoning model.

Conclusion: The results highlight the need for explicit grounding mechanisms in open-ended research, demonstrating that explicit span-based filtering can bridge the retrieval-utilization gap more effectively than implicit attention approaches.

Abstract: Deep Research agents predominantly optimize search policies to maximize retrieval probability. However, we identify a critical bottleneck: the retrieval-utilization gap, where models fail to use gold evidence even after it is retrieved, due to context blindness in noisy environments. To bridge this gap, we propose DeepResearch-Slice, a simple yet effective neuro-symbolic framework. Unlike implicit attention, our approach predicts precise span indices to perform a deterministic hard filter before reasoning. Extensive evaluations across six benchmarks show substantial robustness gains. Applying our method to frozen backbones yields a 73 percent relative improvement, from 19.1 percent to 33.0 percent, effectively mitigating noise without requiring parameter updates to the reasoning model. These results highlight the need for explicit grounding mechanisms in open-ended research.

[2] Internal Reasoning vs. External Control: A Thermodynamic Analysis of Sycophancy in Large Language Models

Edward Y. Chang

Main category: cs.CL

TL;DR: The paper finds that internal reasoning alone cannot eliminate sycophancy in LLMs, while external structural constraints (RCA) completely eliminate it, revealing a thermodynamic hierarchy where only matched strong systems achieve optimal efficiency.

Details

Motivation: To investigate whether sycophancy in LLMs (prioritizing user agreeableness over correctness) can be mitigated by internal reasoning alone or requires external regulation, using adversarial testing to understand the structural limits of different approaches.

Method: Used CAP-GSM8K (N=500) adversarial dataset to evaluate internal (Chain-of-Thought reasoning) versus external (RCA - presumably some form of external constraint mechanism) approaches across GPT-3.5, GPT-4o, and GPT-5.1 models.

Result: Internal reasoning causes performance collapse in weak models (Prioritization Paradox) and leaves 11.4% final output gap in frontier models, while RCA structurally eliminates sycophancy (0.0%) across all model tiers. Reveals thermodynamic hierarchy: hybrid systems achieve Resonance only with matched strong capabilities, while weak/mismatched pairs suffer Dissonance and Entropy.

Conclusion: External structural constraints are strictly necessary to guarantee safety in LLMs, as internal reasoning alone cannot eliminate sycophancy, confirming the need for external regulation mechanisms.

Abstract: Large Language Models frequently exhibit sycophancy, prioritizing user agreeableness over correctness. We investigate whether this requires external regulation or can be mitigated by internal reasoning alone. Using CAP-GSM8K (N=500), an adversarial dataset, we evaluate internal (CoT) versus external (RCA) mechanisms across GPT-3.5, GPT-4o, and GPT-5.1. Our results reveal the structural limits of internal reasoning: it causes performance collapse in weak models (the Prioritization Paradox) and leaves an 11.4% final output gap in frontier models. In contrast, RCA structurally eliminates sycophancy (0.0%) across all tiers. We synthesize these findings into a thermodynamic hierarchy: hybrid systems achieve Resonance (optimal efficiency) only when capabilities are matched and strong, while weak or mismatched pairs succumb to Dissonance and Entropy. This confirms that external structural constraints are strictly necessary to guarantee safety.

[3] Jailbreak-Zero: A Path to Pareto Optimal Red Teaming for Large Language Models

Kai Hu, Abhinav Aggarwal, Mehran Khodabandeh, David Zhang, Eric Hsin, Li Chen, Ankit Jain, Matt Fredrikson, Akash Bharadwaj

Main category: cs.CL

TL;DR: Jailbreak-Zero is a new red teaming method that moves from example-based to policy-based LLM safety evaluation, using an attack LLM to generate diverse adversarial prompts and fine-tuning it for Pareto optimality across coverage, diversity, and fidelity.

Details

Motivation: Current LLM safety evaluation methods are constrained by example-based approaches that limit effectiveness. There's a need for a more expansive, scalable framework that can better identify safety vulnerabilities across diverse policies and attack strategies.

Method: Uses an attack LLM to generate high volumes of diverse adversarial prompts, then fine-tunes this attack model with a preference dataset to achieve Pareto optimality across three key objectives: policy coverage, attack strategy diversity, and prompt fidelity to real user inputs.

Result: Demonstrates significantly higher attack success rates against both open-source and proprietary models (GPT-40 and Claude 3.5) compared to state-of-the-art techniques, while producing human-readable adversarial prompts with minimal human intervention.

Conclusion: Jailbreak-Zero presents a more scalable and comprehensive solution for identifying and mitigating LLM safety vulnerabilities by shifting to a policy-based framework that achieves better coverage and effectiveness with less human effort.

Abstract: This paper introduces Jailbreak-Zero, a novel red teaming methodology that shifts the paradigm of Large Language Model (LLM) safety evaluation from a constrained example-based approach to a more expansive and effective policy-based framework. By leveraging an attack LLM to generate a high volume of diverse adversarial prompts and then fine-tuning this attack model with a preference dataset, Jailbreak-Zero achieves Pareto optimality across the crucial objectives of policy coverage, attack strategy diversity, and prompt fidelity to real user inputs. The empirical evidence demonstrates the superiority of this method, showcasing significantly higher attack success rates against both open-source and proprietary models like GPT-40 and Claude 3.5 when compared to existing state-of-the-art techniques. Crucially, Jailbreak-Zero accomplishes this while producing human-readable and effective adversarial prompts with minimal need for human intervention, thereby presenting a more scalable and comprehensive solution for identifying and mitigating the safety vulnerabilities of LLMs.

[4] Benchmarking and Adapting On-Device Large Language Models for Clinical Decision Support

Alif Munim, Jun Ma, Omar Ibrahim, Alhusain Abdalla, Shuolin Yin, Leo Chen, Bo Wang

Main category: cs.CL

TL;DR: On-device LLMs (gpt-oss-20b and gpt-oss-120b) achieve comparable performance to larger proprietary models in clinical tasks while enabling privacy-preserving local inference, with fine-tuning further improving diagnostic accuracy.

Details

Motivation: Proprietary LLMs face privacy concerns and cloud dependency, while open-source alternatives are often too large for resource-constrained clinical settings, creating a need for efficient on-device solutions.

Method: Benchmarked two on-device LLMs (gpt-oss-20b and gpt-oss-120b) across three clinical tasks: general disease diagnosis, ophthalmology diagnosis/management, and expert grading simulation. Compared against proprietary models (GPT-5, o4-mini) and open-source DeepSeek-R1. Fine-tuned gpt-oss-20b on diagnostic data.

Result: gpt-oss models achieved performance comparable to or exceeding DeepSeek-R1 and o4-mini despite smaller size. Fine-tuning significantly improved gpt-oss-20b’s diagnostic accuracy, approaching GPT-5 performance.

Conclusion: On-device LLMs offer accurate, adaptable, privacy-preserving clinical decision support, providing a practical pathway for broader LLM integration into routine clinical practice.

Abstract: Large language models (LLMs) have rapidly advanced in clinical decision-making, yet the deployment of proprietary systems is hindered by privacy concerns and reliance on cloud-based infrastructure. Open-source alternatives allow local inference but often require large model sizes that limit their use in resource-constrained clinical settings. Here, we benchmark two on-device LLMs, gpt-oss-20b and gpt-oss-120b, across three representative clinical tasks: general disease diagnosis, specialty-specific (ophthalmology) diagnosis and management, and simulation of human expert grading and evaluation. We compare their performance with state-of-the-art proprietary models (GPT-5 and o4-mini) and a leading open-source model (DeepSeek-R1), and we further evaluate the adaptability of on-device systems by fine-tuning gpt-oss-20b on general diagnostic data. Across tasks, gpt-oss models achieve performance comparable to or exceeding DeepSeek-R1 and o4-mini despite being substantially smaller. In addition, fine-tuning remarkably improves the diagnostic accuracy of gpt-oss-20b, enabling it to approach the performance of GPT-5. These findings highlight the potential of on-device LLMs to deliver accurate, adaptable, and privacy-preserving clinical decision support, offering a practical pathway for broader integration of LLMs into routine clinical practice.

[5] OpenAI GPT-5 System Card

Aaditya Singh, Adam Fry, Adam Perelman, Adam Tart, Adi Ganesh, Ahmed El-Kishky, Aidan McLaughlin, Aiden Low, AJ Ostrow, Akhila Ananthram, Akshay Nathan, Alan Luo, Alec Helyar, Aleksander Madry, Aleksandr Efremov, Aleksandra Spyra, Alex Baker-Whitcomb, Alex Beutel, Alex Karpenko, Alex Makelov, Alex Neitz, Alex Wei, Alexandra Barr, Alexandre Kirchmeyer, Alexey Ivanov, Alexi Christakis, Alistair Gillespie, Allison Tam, Ally Bennett, Alvin Wan, Alyssa Huang, Amy McDonald Sandjideh, Amy Yang, Ananya Kumar, Andre Saraiva, Andrea Vallone, Andrei Gheorghe, Andres Garcia Garcia, Andrew Braunstein, Andrew Liu, Andrew Schmidt, Andrey Mereskin, Andrey Mishchenko, Andy Applebaum, Andy Rogerson, Ann Rajan, Annie Wei, Anoop Kotha, Anubha Srivastava, Anushree Agrawal, Arun Vijayvergiya, Ashley Tyra, Ashvin Nair, Avi Nayak, Ben Eggers, Bessie Ji, Beth Hoover, Bill Chen, Blair Chen, Boaz Barak, Borys Minaiev, Botao Hao, Bowen Baker, Brad Lightcap, Brandon McKinzie, Brandon Wang, Brendan Quinn, Brian Fioca, Brian Hsu, Brian Yang, Brian Yu, Brian Zhang, Brittany Brenner, Callie Riggins Zetino, Cameron Raymond, Camillo Lugaresi, Carolina Paz, Cary Hudson, Cedric Whitney, Chak Li, Charles Chen, Charlotte Cole, Chelsea Voss, Chen Ding, Chen Shen, Chengdu Huang, Chris Colby, Chris Hallacy, Chris Koch, Chris Lu, Christina Kaplan, Christina Kim, CJ Minott-Henriques, Cliff Frey, Cody Yu, Coley Czarnecki, Colin Reid, Colin Wei, Cory Decareaux, Cristina Scheau, Cyril Zhang, Cyrus Forbes, Da Tang, Dakota Goldberg, Dan Roberts, Dana Palmie, Daniel Kappler, Daniel Levine, Daniel Wright, Dave Leo, David Lin, David Robinson, Declan Grabb, Derek Chen, Derek Lim, Derek Salama, Dibya Bhattacharjee, Dimitris Tsipras, Dinghua Li, Dingli Yu, DJ Strouse, Drew Williams, Dylan Hunn, Ed Bayes, Edwin Arbus, Ekin Akyurek, Elaine Ya Le, Elana Widmann, Eli Yani, Elizabeth Proehl, Enis Sert, Enoch Cheung, Eri Schwartz, Eric Han, Eric Jiang, Eric Mitchell, Eric Sigler, Eric Wallace, Erik Ritter, Erin Kavanaugh, Evan Mays, Evgenii Nikishin, Fangyuan Li, Felipe Petroski Such, Filipe de Avila Belbute Peres, Filippo Raso, Florent Bekerman, Foivos Tsimpourlas, Fotis Chantzis, Francis Song, Francis Zhang, Gaby Raila, Garrett McGrath, Gary Briggs, Gary Yang, Giambattista Parascandolo, Gildas Chabot, Grace Kim, Grace Zhao, Gregory Valiant, Guillaume Leclerc, Hadi Salman, Hanson Wang, Hao Sheng, Haoming Jiang, Haoyu Wang, Haozhun Jin, Harshit Sikchi, Heather Schmidt, Henry Aspegren, Honglin Chen, Huida Qiu, Hunter Lightman, Ian Covert, Ian Kivlichan, Ian Silber, Ian Sohl, Ibrahim Hammoud, Ignasi Clavera, Ikai Lan, Ilge Akkaya, Ilya Kostrikov, Irina Kofman, Isak Etinger, Ishaan Singal, Jackie Hehir, Jacob Huh, Jacqueline Pan, Jake Wilczynski, Jakub Pachocki, James Lee, James Quinn, Jamie Kiros, Janvi Kalra, Jasmyn Samaroo, Jason Wang, Jason Wolfe, Jay Chen, Jay Wang, Jean Harb, Jeffrey Han, Jeffrey Wang, Jennifer Zhao, Jeremy Chen, Jerene Yang, Jerry Tworek, Jesse Chand, Jessica Landon, Jessica Liang, Ji Lin, Jiancheng Liu, Jianfeng Wang, Jie Tang, Jihan Yin, Joanne Jang, Joel Morris, Joey Flynn, Johannes Ferstad, Johannes Heidecke, John Fishbein, John Hallman, Jonah Grant, Jonathan Chien, Jonathan Gordon, Jongsoo Park, Jordan Liss, Jos Kraaijeveld, Joseph Guay, Joseph Mo, Josh Lawson, Josh McGrath, Joshua Vendrow, Joy Jiao, Julian Lee, Julie Steele, Julie Wang, Junhua Mao, Kai Chen, Kai Hayashi, Kai Xiao, Kamyar Salahi, Kan Wu, Karan Sekhri, Karan Sharma, Karan Singhal, Karen Li, Kenny Nguyen, Keren Gu-Lemberg, Kevin King, Kevin Liu, Kevin Stone, Kevin Yu, Kristen Ying, Kristian Georgiev, Kristie Lim, Kushal Tirumala, Kyle Miller, Lama Ahmad, Larry Lv, Laura Clare, Laurance Fauconnet, Lauren Itow, Lauren Yang, Laurentia Romaniuk, Leah Anise, Lee Byron, Leher Pathak, Leon Maksin, Leyan Lo, Leyton Ho, Li Jing, Liang Wu, Liang Xiong, Lien Mamitsuka, Lin Yang, Lindsay McCallum, Lindsey Held, Liz Bourgeois, Logan Engstrom, Lorenz Kuhn, Louis Feuvrier, Lu Zhang, Lucas Switzer, Lukas Kondraciuk, Lukasz Kaiser, Manas Joglekar, Mandeep Singh, Mandip Shah, Manuka Stratta, Marcus Williams, Mark Chen, Mark Sun, Marselus Cayton, Martin Li, Marvin Zhang, Marwan Aljubeh, Matt Nichols, Matthew Haines, Max Schwarzer, Mayank Gupta, Meghan Shah, Melody Huang, Meng Dong, Mengqing Wang, Mia Glaese, Micah Carroll, Michael Lampe, Michael Malek, Michael Sharman, Michael Zhang, Michele Wang, Michelle Pokrass, Mihai Florian, Mikhail Pavlov, Miles Wang, Ming Chen, Mingxuan Wang, Minnia Feng, Mo Bavarian, Molly Lin, Moose Abdool, Mostafa Rohaninejad, Nacho Soto, Natalie Staudacher, Natan LaFontaine, Nathan Marwell, Nelson Liu, Nick Preston, Nick Turley, Nicklas Ansman, Nicole Blades, Nikil Pancha, Nikita Mikhaylin, Niko Felix, Nikunj Handa, Nishant Rai, Nitish Keskar, Noam Brown, Ofir Nachum, Oleg Boiko, Oleg Murk, Olivia Watkins, Oona Gleeson, Pamela Mishkin, Patryk Lesiewicz, Paul Baltescu, Pavel Belov, Peter Zhokhov, Philip Pronin, Phillip Guo, Phoebe Thacker, Qi Liu, Qiming Yuan, Qinghua Liu, Rachel Dias, Rachel Puckett, Rahul Arora, Ravi Teja Mullapudi, Raz Gaon, Reah Miyara, Rennie Song, Rishabh Aggarwal, RJ Marsan, Robel Yemiru, Robert Xiong, Rohan Kshirsagar, Rohan Nuttall, Roman Tsiupa, Ronen Eldan, Rose Wang, Roshan James, Roy Ziv, Rui Shu, Ruslan Nigmatullin, Saachi Jain, Saam Talaie, Sam Altman, Sam Arnesen, Sam Toizer, Sam Toyer, Samuel Miserendino, Sandhini Agarwal, Sarah Yoo, Savannah Heon, Scott Ethersmith, Sean Grove, Sean Taylor, Sebastien Bubeck, Sever Banesiu, Shaokyi Amdo, Shengjia Zhao, Sherwin Wu, Shibani Santurkar, Shiyu Zhao, Shraman Ray Chaudhuri, Shreyas Krishnaswamy, Shuaiqi, Xia, Shuyang Cheng, Shyamal Anadkat, Simón Posada Fishman, Simon Tobin, Siyuan Fu, Somay Jain, Song Mei, Sonya Egoian, Spencer Kim, Spug Golden, SQ Mah, Steph Lin, Stephen Imm, Steve Sharpe, Steve Yadlowsky, Sulman Choudhry, Sungwon Eum, Suvansh Sanjeev, Tabarak Khan, Tal Stramer, Tao Wang, Tao Xin, Tarun Gogineni, Taya Christianson, Ted Sanders, Tejal Patwardhan, Thomas Degry, Thomas Shadwell, Tianfu Fu, Tianshi Gao, Timur Garipov, Tina Sriskandarajah, Toki Sherbakov, Tomer Kaftan, Tomo Hiratsuka, Tongzhou Wang, Tony Song, Tony Zhao, Troy Peterson, Val Kharitonov, Victoria Chernova, Vineet Kosaraju, Vishal Kuo, Vitchyr Pong, Vivek Verma, Vlad Petrov, Wanning Jiang, Weixing Zhang, Wenda Zhou, Wenlei Xie, Wenting Zhan, Wes McCabe, Will DePue, Will Ellsworth, Wulfie Bain, Wyatt Thompson, Xiangning Chen, Xiangyu Qi, Xin Xiang, Xinwei Shi, Yann Dubois, Yaodong Yu, Yara Khakbaz, Yifan Wu, Yilei Qian, Yin Tat Lee, Yinbo Chen, Yizhen Zhang, Yizhong Xiong, Yonglong Tian, Young Cha, Yu Bai, Yu Yang, Yuan Yuan, Yuanzhi Li, Yufeng Zhang, Yuguang Yang, Yujia Jin, Yun Jiang, Yunyun Wang, Yushi Wang, Yutian Liu, Zach Stubenvoll, Zehao Dou, Zheng Wu, Zhigang Wang

Main category: cs.CL

TL;DR: GPT-5 is a unified AI system with smart/fast and deep reasoning models, plus a real-time router that selects between them based on conversation needs. It features improved safety, reduced hallucinations, and enhanced performance for writing, coding, and health tasks.

Details

Motivation: To create a more useful and efficient AI system that can handle diverse real-world queries by intelligently routing between specialized models, while improving safety, reducing hallucinations, and enhancing performance in key application areas.

Method: Unified system architecture with: 1) gpt-5-main for most questions, 2) gpt-5-thinking for harder problems, 3) real-time router trained on user signals to select appropriate model, 4) mini versions for usage limits, and 5) safe-completions safety training.

Result: Outperforms previous models on benchmarks, answers questions more quickly, more useful for real-world queries, with reduced hallucinations, improved instruction following, minimized sycophancy, and enhanced performance in writing, coding, and health domains.

Conclusion: GPT-5 represents significant advancement in AI systems with intelligent routing between specialized models, improved safety measures, and enhanced practical utility across key application areas, while taking precautionary safety measures for high-capability domains.

Abstract: This is the system card published alongside the OpenAI GPT-5 launch, August 2025. GPT-5 is a unified system with a smart and fast model that answers most questions, a deeper reasoning model for harder problems, and a real-time router that quickly decides which model to use based on conversation type, complexity, tool needs, and explicit intent (for example, if you say ’think hard about this’ in the prompt). The router is continuously trained on real signals, including when users switch models, preference rates for responses, and measured correctness, improving over time. Once usage limits are reached, a mini version of each model handles remaining queries. This system card focuses primarily on gpt-5-thinking and gpt-5-main, while evaluations for other models are available in the appendix. The GPT-5 system not only outperforms previous models on benchmarks and answers questions more quickly, but – more importantly – is more useful for real-world queries. We’ve made significant advances in reducing hallucinations, improving instruction following, and minimizing sycophancy, and have leveled up GPT-5’s performance in three of ChatGPT’s most common uses: writing, coding, and health. All of the GPT-5 models additionally feature safe-completions, our latest approach to safety training to prevent disallowed content. Similarly to ChatGPT agent, we have decided to treat gpt-5-thinking as High capability in the Biological and Chemical domain under our Preparedness Framework, activating the associated safeguards. While we do not have definitive evidence that this model could meaningfully help a novice to create severe biological harm – our defined threshold for High capability – we have chosen to take a precautionary approach.

[6] WRAVAL – WRiting Assist eVALuation

Gabriel Benedict, Matthew Butler, Naved Merchant, Eetu Salama-Laine

Main category: cs.CL

TL;DR: SLMs perform poorly on reasoning tasks but excel at practical applications like tone modification; new evaluation framework shows their real-world value beyond standard benchmarks.

Details

Motivation: Current LLM evaluation focuses on reasoning/problem-solving tasks, which unfairly disadvantages SLMs (under 10B parameters) that score 3-4x lower, despite their effectiveness in practical industrial applications like tone modification.

Method: Proposed evaluation framework with novel approaches: data generation, prompt-tuning, and LLM-based evaluation for non-reasoning tasks where standard datasets don’t exist, focusing on task-specific finetuning.

Result: Demonstrated SLMs’ strong capabilities in practical applications (e.g., tone modification) despite poor reasoning scores; framework enables effective benchmarking for edge/private computing scenarios.

Conclusion: Provides practitioners with tools to benchmark SLMs/LLMs for real-world applications, showing SLMs’ practical value beyond reasoning-focused evaluations, especially for edge computing.

Abstract: The emergence of Large Language Models (LLMs) has shifted language model evaluation toward reasoning and problem-solving tasks as measures of general intelligence. Small Language Models (SLMs) – defined here as models under 10B parameters – typically score 3-4 times lower than LLMs on these metrics. However, we demonstrate that these evaluations fail to capture SLMs’ effectiveness in common industrial applications, such as tone modification tasks (e.g., funny, serious, professional). We propose an evaluation framework specifically designed to highlight SLMs’ capabilities in non-reasoning tasks where predefined evaluation datasets don’t exist. Our framework combines novel approaches in data generation, prompt-tuning, and LLM-based evaluation to demonstrate the potential of task-specific finetuning. This work provides practitioners with tools to effectively benchmark both SLMs and LLMs for practical applications, particularly in edge and private computing scenarios. Our implementation is available at: https://github.com/amazon-science/wraval.

[7] The Instruction Gap: LLMs get lost in Following Instruction

Vishesh Tripathi, Uday Allu, Biddwan Ahmed

Main category: cs.CL

TL;DR: LLMs show inconsistent instruction adherence in enterprise settings, with Claude-Sonnet-4 and GPT-5 performing best in RAG scenarios, revealing an “instruction gap” between general capabilities and precise enterprise requirements.

Details

Motivation: Despite LLMs' strong natural language capabilities, their deployment in enterprise environments reveals a critical limitation: inconsistent adherence to custom instructions, which is essential for reliable enterprise applications.

Method: Comprehensive evaluation of 13 leading LLMs using systematic testing with samples and enterprise-grade evaluation protocols, assessing instruction compliance, response accuracy, and performance metrics in real-world RAG scenarios.

Result: Instruction following varies dramatically across models, with Claude-Sonnet-4 and GPT-5 achieving the highest results. The study reveals an “instruction gap” where models excel at general tasks but struggle with precise instruction adherence needed for enterprise deployment.

Conclusion: This work provides practical insights for organizations deploying LLM solutions and establishes benchmarks for instruction-following capabilities across major model families, highlighting the need for improved instruction adherence in enterprise contexts.

Abstract: Large Language Models (LLMs) have shown remarkable capabilities in natural language understanding and generation, yet their deployment in enterprise environments reveals a critical limitation: inconsistent adherence to custom instructions. This study presents a comprehensive evaluation of 13 leading LLMs across instruction compliance, response accuracy, and performance metrics in realworld RAG (Retrieval-Augmented Generation) scenarios. Through systematic testing with samples and enterprise-grade evaluation protocols, we demonstrate that instruction following varies dramatically across models, with Claude-Sonnet-4 and GPT-5 achieving the highest results. Our findings reveal the “instruction gap” - a fundamental challenge where models excel at general tasks but struggle with precise instruction adherence required for enterprise deployment. This work provides practical insights for organizations deploying LLM-powered solutions and establishes benchmarks for instruction-following capabilities across major model families.

[8] Advances and Challenges in Semantic Textual Similarity: A Comprehensive Survey

Lokendra Kumar, Neelesh S. Upadhye, Kannan Piedy

Main category: cs.CL

TL;DR: Survey of Semantic Textual Similarity (STS) advancements since 2021, covering transformer models, contrastive learning, domain adaptation, multi-modal approaches, graph-based methods, and knowledge-enhanced techniques.

Details

Motivation: To organize and analyze the rapid expansion of STS research since 2021, driven by advances in transformer architectures, contrastive learning, and domain-specific techniques, providing guidance for researchers and practitioners.

Method: Survey methodology reviewing progress across six key areas: transformer-based models (FarSSiBERT, DeBERTa-v3), contrastive learning (AspectCSE), domain-focused solutions (CXR-BERT, Financial-STS), multi-modal methods, graph-based approaches, and knowledge-enhanced techniques.

Result: Recent transformer models have achieved remarkable accuracy, contrastive methods established new benchmarks, domain-adapted models demonstrate effective customization for specialized fields, and multi-modal/graph-based/knowledge-integrated models enhance semantic understanding.

Conclusion: The survey provides valuable insights into current methods, practical applications, and remaining challenges, aiming to guide researchers and practitioners in navigating rapid advancements and highlighting emerging trends and future opportunities in STS.

Abstract: Semantic Textual Similarity (STS) research has expanded rapidly since 2021, driven by advances in transformer architectures, contrastive learning, and domain-specific techniques. This survey reviews progress across six key areas: transformer-based models, contrastive learning, domain-focused solutions, multi-modal methods, graph-based approaches, and knowledge-enhanced techniques. Recent transformer models such as FarSSiBERT and DeBERTa-v3 have achieved remarkable accuracy, while contrastive methods like AspectCSE have established new benchmarks. Domain-adapted models, including CXR-BERT for medical texts and Financial-STS for finance, demonstrate how STS can be effectively customized for specialized fields. Moreover, multi-modal, graph-based, and knowledge-integrated models further enhance semantic understanding and representation. By organizing and analyzing these developments, the survey provides valuable insights into current methods, practical applications, and remaining challenges. It aims to guide researchers and practitioners alike in navigating rapid advancements, highlighting emerging trends and future opportunities in the evolving field of STS.

[9] Less is more: Not all samples are effective for evaluation

Wentang Song, Jinqiang Li, Kele Huang, Junhui Lin, Shengxiang Wu, Zhongshi Xie

Main category: cs.CL

TL;DR: A history-free test set compression framework that reduces evaluation costs by over 90% while preserving benchmark fidelity, addressing cold-start scenarios where no prior model performance data exists.

Details

Motivation: Specialized evaluation benchmarks for LLMs suffer from semantic redundancy and high computational costs. Existing compression methods require correctness labels from multiple historical models, making them inapplicable in cold-start scenarios (new tasks, domains, or models with no prior evaluation history).

Method: 1) Fine-tune a base LLM on small domain-specific data to internalize task semantics; 2) Generate high-level semantic embeddings using only raw textual content; 3) Perform task-aware clustering in domain-adapted embedding space; 4) Introduce dataset X-ray mechanism that analyzes cluster geometry to dynamically calibrate compression intensity based on intrinsic redundancy.

Result: Experiments on professional-domain datasets (notably large-scale 3GPP communications benchmark) show the approach effectively identifies and removes redundant samples, reducing evaluation cost by over 90% while preserving high fidelity to the full benchmark.

Conclusion: The proposed history-free compression framework successfully addresses cold-start limitations of existing methods, enabling efficient evaluation without requiring prior model performance data, making it applicable to new tasks, domains, and models.

Abstract: The versatility of Large Language Models (LLMs) in vertical domains has spurred the development of numerous specialized evaluation benchmarks. However, these benchmarks often suffer from significant semantic redundancy and impose high computational costs during evaluation. Existing compression methods, such as tinyBenchmarks depend critically on correctness labels from multiple historical models evaluated on the full test set, making them inapplicable in cold-start scenarios, such as the introduction of a new task, domain, or model with no prior evaluation history. To address this limitation, we propose a history-free test set compression framework that requires no prior model performance data. Our method begins by fine-tuning a base LLM on a small amount of domain-specific data to internalize task-relevant semantics. It then generates high-level semantic embeddings for all original test samples using only their raw textual content. In this domain-adapted embedding space, we perform task-aware clustering and introduce a novel dataset X-ray mechanism that analyzes cluster geometry to dynamically calibrate the compression intensity based on the intrinsic redundancy of the benchmark. Experiments on professional-domain dataset, notably a large-scale 3GPP communications benchmark, demonstrate that our approach effectively identifies and removes redundant samples, reducing evaluation cost by over 90% while preserving high fidelity to the full benchmark.

[10] Analyzing Reasoning Shifts in Audio Deepfake Detection under Adversarial Attacks: The Reasoning Tax versus Shield Bifurcation

Binh Nguyen, Thai Le

Main category: cs.CL

TL;DR: Audio Language Models (ALMs) for audio deepfake detection offer explainability through reasoning traces, but their reasoning robustness under adversarial attacks shows mixed results: reasoning can act as a defensive shield for some models or impose a performance tax for others, with cognitive dissonance serving as a silent alarm for manipulation.

Details

Motivation: Current audio deepfake detection systems are often black-box classifiers lacking transparency. ALMs offer explainability through reasoning traces, but there's a need to analyze the robustness of this reasoning under adversarial attacks, going beyond just final prediction shifts.

Method: Introduces a forensic auditing framework to evaluate ALM reasoning robustness across three dimensions: acoustic perception (sound analysis), cognitive coherence (logical consistency), and cognitive dissonance (conflict between reasoning and prediction). Systematically analyzes reasoning shifts under adversarial attacks.

Result: Explicit reasoning doesn’t universally enhance robustness. Bifurcation observed: for models with robust acoustic perception, reasoning acts as a defensive shield; for others, it imposes a performance tax, especially under linguistic attacks that reduce cognitive coherence. High cognitive dissonance can serve as a silent alarm even when classification fails.

Conclusion: This work provides critical evaluation of reasoning’s role in forensic audio deepfake analysis and its vulnerabilities. Shows that while reasoning offers transparency, its robustness varies, and cognitive dissonance can be a valuable indicator of manipulation even when predictions are compromised.

Abstract: Audio Language Models (ALMs) offer a promising shift towards explainable audio deepfake detections (ADDs), moving beyond \textit{black-box} classifiers by providing some level of transparency into their predictions via reasoning traces. This necessitates a new class of model robustness analysis: robustness of the predictive reasoning under adversarial attacks, which goes beyond existing paradigm that mainly focuses on the shifts of the final predictions (e.g., fake v.s. real). To analyze such reasoning shifts, we introduce a forensic auditing framework to evaluate the robustness of ALMs’ reasoning under adversarial attacks in three inter-connected dimensions: acoustic perception, cognitive coherence, and cognitive dissonance. Our systematic analysis reveals that explicit reasoning does not universally enhance robustness. Instead, we observe a bifurcation: for models exhibiting robust acoustic perception, reasoning acts as a defensive \textit{shield''}, protecting them from adversarial attacks. However, for others, it imposes a performance \textit{tax’’}, particularly under linguistic attacks which reduce cognitive coherence and increase attack success rate. Crucially, even when classification fails, high cognitive dissonance can serve as a \textit{silent alarm}, flagging potential manipulation. Overall, this work provides a critical evaluation of the role of reasoning in forensic audio deepfake analysis and its vulnerabilities.

[11] GuardEval: A Multi-Perspective Benchmark for Evaluating Safety, Fairness, and Robustness in LLM Moderators

Naseem Machlovi, Maryam Saleki, Ruhul Amin, Mohamed Rahouti, Shawqi Al-Maliki, Junaid Qadir, Mohamed M. Abdallah, Ala Al-Fuqaha

Main category: cs.CL

TL;DR: GuardEval is a multi-perspective benchmark dataset for content moderation, and GemmaGuard (GGuard) is a fine-tuned LLM that achieves state-of-the-art performance in detecting nuanced harmful content.

Details

Motivation: Existing LLM moderation systems struggle with nuanced cases like implicit offensiveness, subtle biases, and jailbreak prompts due to their subjective nature and heavy reliance on training data that can reinforce societal biases, leading to inconsistent and ethically problematic outputs.

Method: Created GuardEval benchmark dataset with 106 fine-grained categories covering human emotions, offensive/hateful language, gender/racial bias, and safety concerns. Developed GemmaGuard (GGuard) by QLoRA fine-tuning Gemma3-12B on GuardEval for content moderation with fine-grained labels.

Result: GGuard achieves macro F1 score of 0.832, substantially outperforming OpenAI Moderator (0.64) and Llama Guard (0.61). Multi-perspective, human-centered safety benchmarks reduce biased and inconsistent moderation decisions.

Conclusion: Diverse, representative data materially improves safety, fairness, and robustness on complex borderline cases. GuardEval and GGuard demonstrate that multi-perspective benchmarks are critical for effective content moderation.

Abstract: As large language models (LLMs) become deeply embedded in daily life, the urgent need for safer moderation systems, distinguishing between naive from harmful requests while upholding appropriate censorship boundaries, has never been greater. While existing LLMs can detect harmful or unsafe content, they often struggle with nuanced cases such as implicit offensiveness, subtle gender and racial biases, and jailbreak prompts, due to the subjective and context-dependent nature of these issues. Furthermore, their heavy reliance on training data can reinforce societal biases, resulting in inconsistent and ethically problematic outputs. To address these challenges, we introduce GuardEval, a unified multi-perspective benchmark dataset designed for both training and evaluation, containing 106 fine-grained categories spanning human emotions, offensive and hateful language, gender and racial bias, and broader safety concerns. We also present GemmaGuard (GGuard), a QLoRA fine-tuned version of Gemma3-12B trained on GuardEval, to assess content moderation with fine-grained labels. Our evaluation shows that GGuard achieves a macro F1 score of 0.832, substantially outperforming leading moderation models, including OpenAI Moderator (0.64) and Llama Guard (0.61). We show that multi-perspective, human-centered safety benchmarks are critical for reducing biased and inconsistent moderation decisions. GuardEval and GGuard together demonstrate that diverse, representative data materially improve safety, fairness, and robustness on complex, borderline cases.

[12] LLM_annotate: A Python package for annotating and analyzing fiction characters

Hannes Rosenbusch

Main category: cs.CL

TL;DR: LLM_annotate is a Python package for analyzing fiction character personalities using LLMs, featuring standardized workflows for text annotation, trait inference, and quality validation with human-in-the-loop GUI.

Details

Motivation: To provide researchers with a standardized, efficient, and reproducible tool for analyzing character personalities in fiction texts (books, movie scripts) using large language models, addressing the need for systematic character analysis methodologies.

Method: Developed a Python package with functions for text chunking, LLM-based annotation, character name disambiguation, quality scoring, and computation of character-level statistics and embeddings. Supports any LLM (commercial, open-source, or custom) and includes human-in-the-loop GUI for validation.

Result: Created a functional package demonstrated through tutorial examples using The Simpsons Movie and Pride and Prejudice, showing efficient and reproducible character analysis capabilities.

Conclusion: LLM_annotate provides researchers with a flexible, standardized tool for character personality analysis that supports various LLMs and includes validation mechanisms, enabling efficient and reproducible literary character studies.

Abstract: LLM_annotate is a Python package for analyzing the personality of fiction characters with large language models. It standardizes workflows for annotating character behaviors in full texts (e.g., books and movie scripts), inferring character traits, and validating annotation/inference quality via a human-in-the-loop GUI. The package includes functions for text chunking, LLM-based annotation, character name disambiguation, quality scoring, and computation of character-level statistics and embeddings. Researchers can use any LLM, commercial, open-source, or custom, within LLM_annotate. Through tutorial examples using The Simpsons Movie and the novel Pride and Prejudice, I demonstrate the usage of the package for efficient and reproducible character analyses.

[13] Topic Segmentation Using Generative Language Models

Pierre Mackenzie, Maya Shah, Patrick Frenett

Main category: cs.CL

TL;DR: LLMs outperform traditional semantic similarity methods for topic segmentation using overlapping recursive prompting with sentence enumeration, but still have reliability issues.

Details

Motivation: Topic segmentation using LLMs is underexplored, and existing semantic similarity methods lack the long-range dependencies and extensive knowledge that LLMs possess.

Method: Proposed overlapping and recursive prompting strategy using sentence enumeration, and advocated for boundary similarity evaluation metric adoption.

Result: LLMs demonstrate greater effectiveness as segmenters compared to existing methods, showing improved performance in topic segmentation tasks.

Conclusion: While LLMs show promise for topic segmentation and outperform current approaches, significant issues remain that must be addressed before they can be reliably deployed for this task.

Abstract: Topic segmentation using generative Large Language Models (LLMs) remains relatively unexplored. Previous methods use semantic similarity between sentences, but such models lack the long range dependencies and vast knowledge found in LLMs. In this work, we propose an overlapping and recursive prompting strategy using sentence enumeration. We also support the adoption of the boundary similarity evaluation metric. Results show that LLMs can be more effective segmenters than existing methods, but issues remain to be solved before they can be relied upon for topic segmentation.

[14] HiKE: Hierarchical Evaluation Framework for Korean-English Code-Switching Speech Recognition

Gio Paik, Yongbeom Kim, Soungmin Lee, Sangmin Ahn, Chanwoo Kim

Main category: cs.CL

TL;DR: HiKE is the first non-synthetic Korean-English code-switching benchmark with hierarchical labeling for systematic evaluation of multilingual ASR models on code-switching tasks.

Details

Motivation: Code-switching (CS) remains an underexplored challenge in multilingual ASR despite advances, with Korean-English CS particularly lacking accessible evaluation frameworks.

Method: Created HiKE benchmark with high-quality natural CS data across topics, loanword labels, and hierarchical CS-level labeling (word, phrase, sentence) for systematic evaluation.

Result: Most multilingual ASR models initially perform poorly on CS tasks, but fine-tuning with synthetic CS data enables this capability.

Conclusion: HiKE provides a crucial evaluation framework for Korean-English code-switching research and demonstrates that CS capability in ASR models can be developed through appropriate fine-tuning.

Abstract: Despite advances in multilingual automatic speech recognition (ASR), code-switching (CS), the mixing of languages within an utterance common in daily speech, remains a severely underexplored challenge. In this paper, we introduce HiKE: the Hierarchical Korean-English code-switching benchmark, the first globally accessible non-synthetic evaluation framework for Korean-English CS, aiming to provide a means for the precise evaluation of multilingual ASR models and to foster research in the field. The proposed framework not only consists of high-quality, natural CS data across various topics, but also provides meticulous loanword labels and a hierarchical CS-level labeling scheme (word, phrase, and sentence) that together enable a systematic evaluation of a model’s ability to handle each distinct level of code-switching. Through evaluations of diverse multilingual ASR models and fine-tuning experiments, this paper demonstrates that although most multilingual ASR models initially exhibit inadequate CS-ASR performance, this capability can be enabled through fine-tuning with synthetic CS data. HiKE is available at https://github.com/ThetaOne-AI/HiKE.

[15] Bare-Metal Tensor Virtualization: Overcoming the Memory Wall in Edge-AI Inference on ARM64

Bugra Kilictas, Faruk Alpay

Main category: cs.CL

TL;DR: A software-based “Virtual Tensor Core” architecture for ARM64 that bypasses standard libraries using direct memory mapping and NEON SIMD kernels to overcome memory bottlenecks in LLM edge deployment.

Details

Motivation: LLM deployment on edge devices is constrained by the "Memory Wall" bottleneck where data movement latency exceeds arithmetic throughput. Standard inference runtimes have significant overhead from high-level abstractions, dynamic dispatch, and unaligned memory access patterns.

Method: Proposes a “Virtual Tensor Core” software architecture optimized for ARM64 (Apple Silicon) using direct memory mapping (mmap) instead of standard library containers, hand-tuned NEON SIMD kernels, and “Software-Defined DMA.” Includes Tensor Virtualization Layout (TVL) for 100% cache line utilization and zero-copy loader to eliminate initialization latency.

Result: Achieves stable throughput of >60 tokens/second on M2 hardware with a 110M parameter model, meeting the 200ms psycholinguistic latency threshold without proprietary dependencies.

Conclusion: While proprietary hardware accelerators (like Apple AMX) offer higher peak throughput, this architecture provides a fully open, portable, and deterministic reference implementation for studying memory bottlenecks on general-purpose ARM silicon.

Abstract: The deployment of Large Language Models (LLMs) on edge devices is fundamentally constrained by the “Memory Wall” the bottleneck where data movement latency outstrips arithmetic throughput. Standard inference runtimes often incur significant overhead through high-level abstractions, dynamic dispatch, and unaligned memory access patterns. In this work, we present a novel “Virtual Tensor Core” architecture implemented in software, optimized specifically for ARM64 microarchitectures (Apple Silicon). By bypassing standard library containers in favor of direct memory mapping (mmap) and implementing hand-tuned NEON SIMD kernels, we achieve a form of “Software-Defined Direct Memory Access (DMA).” Our proposed Tensor Virtualization Layout (TVL) guarantees 100% cache line utilization for weight matrices, while our zero-copy loader eliminates initialization latency. Experimental results on a 110M parameter model demonstrate a stable throughput of >60 tokens/second on M2 hardware. While proprietary hardware accelerators (e.g., Apple AMX) can achieve higher peak throughput, our architecture provides a fully open, portable, and deterministic reference implementation for studying the memory bottleneck on general-purpose ARM silicon, meeting the 200ms psycholinguistic latency threshold without opaque dependencies.

[16] A path to natural language through tokenisation and transformers

David S. Berman, Alexander G. Stapleton

Main category: cs.CL

TL;DR: BPE tokenization transforms corpus statistics toward Zipfian power laws, reduces local token dependencies, and helps language models better match Zipf-derived entropy predictions.

Details

Motivation: To understand how modern tokenization schemes (like BPE) relate to fundamental statistical properties of natural language (Zipf's and Heaps' laws), and how they affect the information content and statistical structure that language models learn.

Method: 1) Derive closed-form expression for slot entropy expectation under Zipfian distribution; 2) Empirically analyze how BPE transforms corpus statistics across varying depths; 3) Train transformer language models on corpora tokenized at different BPE depths and analyze their predictive entropies; 4) Use attention-based diagnostics to examine token dependencies.

Result: 1) Recursive BPE applications drive token frequencies toward Zipfian power law; 2) BPE induces characteristic growth pattern in empirical entropy; 3) Model predictive entropies increasingly agree with Zipf-derived predictions as BPE depth increases; 4) Deeper tokenization reduces local token dependencies, bringing distribution closer to weakly dependent (near IID) regime.

Conclusion: BPE acts not only as a compression mechanism but also as a statistical transform that reconstructs key informational properties of natural language, making token distributions more Zipfian and reducing local dependencies to better match theoretical expectations.

Abstract: Natural languages exhibit striking regularities in their statistical structure, including notably the emergence of Zipf’s and Heaps’ laws. Despite this, it remains broadly unclear how these properties relate to the modern tokenisation schemes used in contemporary transformer models. In this note, we analyse the information content (as measured by the Shannon entropy) of various corpora under the assumption of a Zipfian frequency distribution, and derive a closed-form expression for the slot entropy expectation value. We then empirically investigate how byte–pair encoding (BPE) transforms corpus statistics, showing that recursive applications of BPE drive token frequencies toward a Zipfian power law while inducing a characteristic growth pattern in empirical entropy. Utilizing the ability of transformers to learn context dependent token probability distributions, we train language models on corpora tokenised at varying BPE depths, revealing that the model predictive entropies increasingly agree with Zipf-derived predictions as the BPE depth increases. Attention-based diagnostics further indicate that deeper tokenisation reduces local token dependencies, bringing the empirical distribution closer to the weakly dependent (near IID) regime. Together, these results clarify how BPE acts not only as a compression mechanism but also as a statistical transform that reconstructs key informational properties of natural language.

[17] Metaphors are a Source of Cross-Domain Misalignment of Large Reasoning Models

Zhibo Hu, Chen Wang, Yanfeng Shu, Hye-young Paik, Liming Zhu

Main category: cs.CL

TL;DR: Metaphors in training data causally influence LLMs’ cross-domain misalignment by affecting latent feature activation, enabling detection of misaligned content.

Details

Motivation: Since metaphors influence human decision making and LLMs are trained on data containing many metaphors, researchers investigate whether metaphors affect LLMs' reasoning pathways, particularly in the context of emergent misalignment where models generalize misaligned patterns across domains.

Method: Researchers study the causal relationship between metaphors in training data and LLM misalignment through interventions in pre-training, fine-tuning, and re-alignment phases. They analyze connections between metaphors and activation of global/local latent features in reasoning models, then design a detector based on monitoring these features.

Result: Strong causal relationship found between metaphors and misalignment degree. Interventions using metaphors significantly change cross-domain misalignment. Connection observed between metaphors and latent feature activation. Designed detector successfully predicts misaligned content with high accuracy.

Conclusion: Metaphors in training data significantly influence LLMs’ reasoning pathways and cross-domain misalignment through their effect on latent feature activation, enabling effective detection of misaligned content.

Abstract: Earlier research has shown that metaphors influence human’s decision making, which raises the question of whether metaphors also influence large language models (LLMs)’ reasoning pathways, considering their training data contain a large number of metaphors. In this work, we investigate the problem in the scope of the emergent misalignment problem where LLMs can generalize patterns learned from misaligned content in one domain to another domain. We discover a strong causal relationship between metaphors in training data and the misalignment degree of LLMs’ reasoning contents. With interventions using metaphors in pre-training, fine-tuning and re-alignment phases, models’ cross-domain misalignment degrees change significantly. As we delve deeper into the causes behind this phenomenon, we observe that there is a connection between metaphors and the activation of global and local latent features of large reasoning models. By monitoring these latent features, we design a detector that predict misaligned content with high accuracy.

[18] Breaking the Assistant Mold: Modeling Behavioral Variation in LLM Based Procedural Character Generation

Maan Qraitem, Kate Saenko, Bryan A. Plummer

Main category: cs.CL

TL;DR: PersonaWeaver addresses alignment-induced biases in character generation by separating world-building from behavioral-building, creating more diverse and dramatically interesting characters.

Details

Motivation: Existing character generation methods suffer from two alignment-induced biases: positive moral bias (characters always adopt agreeable stances) and helpful assistant bias (characters always answer questions directly). These biases suppress dramatic tension and yield predictable characters, stemming from maximum likelihood training and assistant fine-tuning.

Method: Introduces PersonaWeaver framework that disentangles world-building (roles, demographics) from behavioral-building (moral stances, interactional styles), enabling creation of characters with more diverse reactions and moral stances, plus second-order diversity in stylistic markers like length, tone, and punctuation.

Result: The framework yields characters with more diverse reactions and moral stances, as well as second-order diversity in stylistic markers like length, tone, and punctuation, addressing the limitations of existing methods.

Conclusion: PersonaWeaver successfully addresses alignment-induced biases in character generation, enabling creation of more dramatically interesting and diverse characters by separating world-building from behavioral-building aspects.

Abstract: Procedural content generation has enabled vast virtual worlds through levels, maps, and quests, but large-scale character generation remains underexplored. We identify two alignment-induced biases in existing methods: a positive moral bias, where characters uniformly adopt agreeable stances (e.g. always saying lying is bad), and a helpful assistant bias, where characters invariably answer questions directly (e.g. never refusing or deflecting). While such tendencies suit instruction-following systems, they suppress dramatic tension and yield predictable characters, stemming from maximum likelihood training and assistant fine-tuning. To address this, we introduce PersonaWeaver, a framework that disentangles world-building (roles, demographics) from behavioral-building (moral stances, interactional styles), yielding characters with more diverse reactions and moral stances, as well as second-order diversity in stylistic markers like length, tone, and punctuation. Code: https://github.com/mqraitem/Persona-Weaver

[19] Rendering Data Unlearnable by Exploiting LLM Alignment Mechanisms

Ruihan Zhang, Jun Sun

Main category: cs.CL

TL;DR: Disclaimer Injection makes text unlearnable to LLMs by injecting alignment-triggering disclaimers, exploiting models’ alignment mechanisms to prevent effective learning during fine-tuning.

Details

Motivation: Address concerns about unauthorized use of proprietary/personal data in LLM training by developing data protection methods against unwanted model learning in realistic black-box settings.

Method: Disclaimer Injection - a data-level defense that injects carefully designed alignment-triggering disclaimers into text to exploit LLMs’ alignment mechanisms, preventing effective learning during fine-tuning.

Result: Fine-tuning on protected data causes persistent activation of alignment-related layers, overriding task learning. Models show substantial performance degradation compared to standard fine-tuning.

Conclusion: Alignment behavior serves as a novel lever for data protection, presenting the first practical method for restricting data learnability at LLM scale without requiring access to or modification of training pipelines.

Abstract: Large language models (LLMs) are increasingly trained on massive, heterogeneous text corpora, raising serious concerns about the unauthorised use of proprietary or personal data during model training. In this work, we address the problem of data protection against unwanted model learning in a realistic black-box setting. We propose Disclaimer Injection, a novel data-level defence that renders text unlearnable to LLMs. Rather than relying on model-side controls or explicit data removal, our approach exploits the models’ own alignment mechanisms: by injecting carefully designed alignment-triggering disclaimers to prevent effective learning. Through layer-wise analysis, we find that fine-tuning on such protected data induces persistent activation of alignment-related layers, causing alignment constraints to override task learning even on common inputs. Consequently, models trained on such data exhibit substantial and systematic performance degradation compared to standard fine-tuning. Our results identify alignment behaviour as a previously unexplored lever for data protection and, to our knowledge, present the first practical method for restricting data learnability at LLM scale without requiring access to or modification of the training pipeline.

[20] Tigrinya Number Verbalization: Rules, Algorithm, and Implementation

Fitsum Gaim, Issayas Tesfamariam

Main category: cs.CL

TL;DR: Systematic formalization of Tigrinya cardinal and ordinal number verbalization with open-source implementation, revealing LLM limitations in handling this language.

Details

Motivation: Addressing the gap in computational resources for Tigrinya language, particularly for number verbalization, which is important for language modeling, speech synthesis, and accessibility applications for Tigrinya-speaking communities.

Method: Documenting canonical rules for Tigrinya number expression (conjunction system, scale words, special cases), developing formal algorithm for number-to-word conversion, and creating open-source implementation.

Result: Evaluation of frontier large language models shows significant gaps in their ability to accurately verbalize Tigrinya numbers, demonstrating the need for explicit rule documentation and specialized resources.

Conclusion: This work provides essential computational resources for Tigrinya language processing and highlights the limitations of current LLMs for low-resource languages, emphasizing the importance of explicit rule documentation for language preservation and accessibility.

Abstract: We present a systematic formalization of Tigrinya cardinal and ordinal number verbalization, addressing a gap in computational resources for the language. This work documents the canonical rules governing the expression of numerical values in spoken Tigrinya, including the conjunction system, scale words, and special cases for dates, times, and currency. We provide a formal algorithm for number-to-word conversion and release an open-source implementation. Evaluation of frontier large language models (LLMs) reveals significant gaps in their ability to accurately verbalize Tigrinya numbers, underscoring the need for explicit rule documentation. This work serves language modeling, speech synthesis, and accessibility applications targeting Tigrinya-speaking communities.

[21] Implicit Graph, Explicit Retrieval: Towards Efficient and Interpretable Long-horizon Memory for Large Language Models

Xin Zhang, Kailai Yang, Hao Li, Chenyue Li, Qiyu Wei, Sophia Ananiadou

Main category: cs.CL

TL;DR: LatentGraphMem combines implicit graph memory with explicit subgraph retrieval for efficient, stable long-context reasoning while maintaining interpretability.

Details

Motivation: LLMs need to handle sparse evidence across long contexts, but existing memory systems have trade-offs: explicit structured memories are interpretable but brittle under long-context overload, while latent memories are efficient but hard to inspect.

Method: LatentGraphMem stores graph-structured memory in latent space for stability/efficiency, with task-specific subgraph retrieval interface that returns compact symbolic subgraphs under fixed budget. During training, explicit graph view interfaces with frozen reasoner for QA supervision; at inference, retrieval is in latent space with only retrieved subgraph externalized.

Result: Experiments on long-horizon benchmarks across multiple model scales show LatentGraphMem consistently outperforms explicit-graph and latent-memory baselines, enables parameter-efficient adaptation, and scales to larger reasoners without large symbolic artifacts.

Conclusion: LatentGraphMem successfully bridges the gap between interpretable explicit memories and efficient latent memories, providing a practical solution for long-horizon reasoning tasks.

Abstract: Long-horizon applications increasingly require large language models (LLMs) to answer queries when relevant evidence is sparse and dispersed across very long contexts. Existing memory systems largely follow two paradigms: explicit structured memories offer interpretability but often become brittle under long-context overload, while latent memory mechanisms are efficient and stable yet difficult to inspect. We propose LatentGraphMem, a memory framework that combines implicit graph memory with explicit subgraph retrieval. LatentGraphMem stores a graph-structured memory in latent space for stability and efficiency, and exposes a task-specific subgraph retrieval interface that returns a compact symbolic subgraph under a fixed budget for downstream reasoning and human inspection. During training, an explicit graph view is materialized to interface with a frozen reasoner for question-answering supervision. At inference time, retrieval is performed in latent space and only the retrieved subgraph is externalized. Experiments on long-horizon benchmarks across multiple model scales show that LatentGraphMem consistently outperforms representative explicit-graph and latent-memory baselines, while enabling parameter-efficient adaptation and flexible scaling to larger reasoners without introducing large symbolic artifacts.

[22] PCoA: A New Benchmark for Medical Aspect-Based Summarization With Phrase-Level Context Attribution

Bohao Chu, Sameh Frihat, Tabea M. G. Pakull, Hendrik Damm, Meijie Li, Ula Muhabbek, Georg Lodde, Norbert Fuhr

Main category: cs.CL

TL;DR: PCoA is a benchmark for medical aspect-based summarization with phrase-level context attribution, featuring expert annotations and a decoupled evaluation framework for assessing summary quality, citations, and contributory phrases.

Details

Motivation: Verifying system-generated summaries is challenging because it requires precise attribution to source context, which is especially crucial in high-stakes medical domains where accuracy and accountability are critical.

Method: Introduces PCoA benchmark with expert annotations for medical aspect-based summarization, aligning each aspect-based summary with supporting contextual sentences and contributory phrases. Proposes a fine-grained, decoupled evaluation framework that independently assesses summary quality, citations, and contributory phrases.

Result: PCoA provides a reliable benchmark for evaluating system-generated summaries with phrase-level context attribution. Experiments show that explicitly identifying relevant sentences and contributory phrases before summarization can improve overall quality.

Conclusion: PCoA addresses the challenge of verifying system-generated medical summaries through precise context attribution, offering a valuable benchmark and demonstrating that explicit identification of supporting evidence improves summary quality.

Abstract: Verifying system-generated summaries remains challenging, as effective verification requires precise attribution to the source context, which is especially crucial in high-stakes medical domains. To address this challenge, we introduce PCoA, an expert-annotated benchmark for medical aspect-based summarization with phrase-level context attribution. PCoA aligns each aspect-based summary with its supporting contextual sentences and contributory phrases within them. We further propose a fine-grained, decoupled evaluation framework that independently assesses the quality of generated summaries, citations, and contributory phrases. Through extensive experiments, we validate the quality and consistency of the PCoA dataset and benchmark several large language models on the proposed task. Experimental results demonstrate that PCoA provides a reliable benchmark for evaluating system-generated summaries with phrase-level context attribution. Furthermore, comparative experiments show that explicitly identifying relevant sentences and contributory phrases before summarization can improve overall quality. The data and code are available at https://github.com/chubohao/PCoA.

[23] Training-Free Adaptation of New-Generation LLMs using Legacy Clinical Models

Sasha Ronaghi, Chloe Stanwyck, Asad Aali, Amir Ronaghi, Miguel Fuentes, Tina Hernandez-Boussard, Emily Alsentzer

Main category: cs.CL

TL;DR: CAPT enables training-free adaptation of new general-domain LLMs using existing clinical models via contrastive decoding, outperforming individual models and state-of-the-art ensembling methods on clinical tasks.

Details

Motivation: Adapting language models to clinical domains typically requires costly retraining for each new model generation, creating a barrier to leveraging advances in general-domain models for clinical applications.

Method: Cross-Architecture Proxy Tuning (CAPT) uses model ensembling with contrastive decoding to selectively inject clinically relevant signals from existing clinical models into state-of-the-art general-domain models, supporting models with disjoint vocabularies without requiring retraining.

Result: On six clinical classification and text-generation tasks, CAPT consistently outperforms both individual models and state-of-the-art ensembling approaches (average +17.6% over UniTE, +41.4% over proxy tuning). Token-level analysis shows CAPT amplifies clinically actionable language, reduces context errors, and increases clinical specificity.

Conclusion: CAPT provides an effective training-free approach for adapting advanced general-domain models to clinical applications by leveraging existing clinical models, enabling rapid deployment of improved clinical NLP systems without costly retraining.

Abstract: Adapting language models to the clinical domain through continued pretraining and fine-tuning requires costly retraining for each new model generation. We propose Cross-Architecture Proxy Tuning (CAPT), a model-ensembling approach that enables training-free adaptation of state-of-the-art general-domain models using existing clinical models. CAPT supports models with disjoint vocabularies, leveraging contrastive decoding to selectively inject clinically relevant signals while preserving the general-domain model’s reasoning and fluency. On six clinical classification and text-generation tasks, CAPT with a new-generation general-domain model and an older-generation clinical model consistently outperforms both models individually and state-of-the-art ensembling approaches (average +17.6% over UniTE, +41.4% over proxy tuning across tasks). Through token-level analysis and physician case studies, we demonstrate that CAPT amplifies clinically actionable language, reduces context errors, and increases clinical specificity.

[24] The Critical Role of Aspects in Measuring Document Similarity

Eftekhar Hossain, Tarnika Hazra, Ahatesham Bhuiyan, Santu Karmaker

Main category: cs.CL

TL;DR: ASPECTSIM is a framework for aspect-conditioned document similarity that outperforms holistic approaches, with GPT-4o achieving 80% higher human agreement than holistic methods, though smaller LLMs struggle without refinement.

Details

Motivation: Traditional document similarity measures use holistic approaches that don't account for specific aspects, limiting interpretability and alignment with human judgments. The paper aims to create a more interpretable framework that conditions similarity on explicitly specified aspects.

Method: ASPECTSIM framework conditions document similarity on explicitly specified aspects. The authors created a benchmark of 26K aspect-document pairs and tested two approaches: 1) Direct GPT-4o prompting for aspect-conditioned similarity scores, and 2) Testing 16 smaller open-source LLMs and 9 embedding models with a two-stage refinement process to improve performance.

Result: GPT-4o with ASPECTSIM achieved ≈80% higher human-machine agreement than holistic similarity without explicit aspects. Smaller open-source LLMs initially had poor performance (20-30% agreement), but a two-stage refinement improved agreement by ≈140%. However, they still lag significantly behind GPT-4o’s performance.

Conclusion: Explicitly accounting for aspects is crucial for measuring document similarity, and current standard practices need revision. While smaller LLMs can be improved with refinement techniques, they still fall short of large proprietary models like GPT-4o in capturing aspect-conditioned similarity.

Abstract: We introduce ASPECTSIM, a simple and interpretable framework that requires conditioning document similarity on an explicitly specified aspect, which is different from the traditional holistic approach in measuring document similarity. Experimenting with a newly constructed benchmark of 26K aspect-document pairs, we found that ASPECTSIM, when implemented with direct GPT-4o prompting, achieves substantially higher human-machine agreement ($\approx$80% higher) than the same for holistic similarity without explicit aspects. These findings underscore the importance of explicitly accounting for aspects when measuring document similarity and highlight the need to revise standard practice. Next, we conducted a large-scale meta-evaluation using 16 smaller open-source LLMs and 9 embedding models with a focus on making ASPECTSIM accessible and reproducible. While directly prompting LLMs to produce ASPECTSIM scores turned out be ineffective (20-30% human-machine agreement), a simple two-stage refinement improved their agreement by $\approx$140%. Nevertheless, agreement remains well below that of GPT-4o-based models, indicating that smaller open-source LLMs still lag behind large proprietary models in capturing aspect-conditioned similarity.

[25] Grading Scale Impact on LLM-as-a-Judge: Human-LLM Alignment Is Highest on 0-5 Grading Scale

Weiyue Li, Minda Zhao, Weixuan Dong, Jiahui Cai, Yuze Wei, Michael Pocress, Yi Li, Wanyan Yuan, Xiaoyue Wang, Ruoyu Hou, Kaiyuan Lou, Wenqi Zeng, Yutong Yang, Yilun Du, Mengyu Wang

Main category: cs.CL

TL;DR: LLM judges show inconsistent scoring across different grading scales, with 0-5 scale providing best human-LLM alignment, but scale choice significantly impacts agreement even when reliability appears high.

Details

Motivation: While LLMs are increasingly used as automated evaluators, their scoring consistency across different prompt variations has been studied, but the effect of the grading scale itself remains underexplored. The paper aims to understand how different grading scales affect LLM judge reliability and human-LLM alignment.

Method: The study compares human and LLM raters across three different grading scales on six benchmarks covering objective, open-ended subjective, and mixed tasks. They use intraclass correlation coefficients (ICC) to measure absolute agreement and analyze human-LLM alignment across different scales.

Result: LLM judgments are not perfectly consistent across scales on subjective benchmarks. The choice of grading scale substantially shifts human-LLM agreement, even when within-group panel reliability is high. The 0-5 scale yields the strongest human-LLM alignment across tasks. Pooled reliability can mask benchmark heterogeneity, and systematic subgroup differences in alignment exist across gender groups.

Conclusion: Grading scale design is crucial for LLM-as-a-judge protocols, as scale choice significantly impacts human-LLM agreement. The 0-5 scale performs best overall, but sub-level diagnostics are essential to uncover hidden heterogeneity and subgroup differences that pooled reliability metrics might mask.

Abstract: Large language models (LLMs) are increasingly used as automated evaluators, yet prior works demonstrate that these LLM judges often lack consistency in scoring when the prompt is altered. However, the effect of the grading scale itself remains underexplored. We study the LLM-as-a-judge problem by comparing two kinds of raters: humans and LLMs. We collect ratings from both groups on three scales and across six benchmarks that include objective, open-ended subjective, and mixed tasks. Using intraclass correlation coefficients (ICC) to measure absolute agreement, we find that LLM judgments are not perfectly consistent across scales on subjective benchmarks, and that the choice of scale substantially shifts human-LLM agreement, even when within-group panel reliability is high. Aggregated over tasks, the grading scale of 0-5 yields the strongest human-LLM alignment. We further demonstrate that pooled reliability can mask benchmark heterogeneity and reveal systematic subgroup differences in alignment across gender groups, strengthening the importance of scale design and sub-level diagnostics as essential components of LLM-as-a-judge protocols.

[26] Enhancing Linguistic Competence of Language Models through Pre-training with Language Learning Tasks

Atsuki Yamaguchi, Maggie Mi, Nikolaos Aletras

Main category: cs.CL

TL;DR: L2T is a pre-training framework that combines standard next-token prediction with explicit language learning tasks to improve linguistic competence while maintaining reasoning capabilities.

Details

Motivation: Standard language model pre-training on raw text focuses on world knowledge and reasoning but doesn't explicitly optimize for linguistic competence. There's a gap between how LMs learn and how humans acquire language through structured linguistic stimulation.

Method: L2T transforms raw text into structured input-output pairs that provide explicit linguistic stimulation, inspired by human language acquisition. It integrates Language Learning Tasks alongside standard next-token prediction during pre-training.

Result: Pre-training LMs on a mixture of raw text and L2T data improves overall performance on linguistic competence benchmarks, accelerates acquisition of linguistic skills, while maintaining competitive performance on general reasoning tasks.

Conclusion: Explicit linguistic stimulation through structured learning tasks during pre-training can enhance language models’ linguistic competence without sacrificing their reasoning capabilities, bridging a key gap in current LM training approaches.

Abstract: Language models (LMs) are pre-trained on raw text datasets to generate text sequences token-by-token. While this approach facilitates the learning of world knowledge and reasoning, it does not explicitly optimize for linguistic competence. To bridge this gap, we propose L2T, a pre-training framework integrating Language Learning Tasks alongside standard next-token prediction. Inspired by human language acquisition, L2T transforms raw text into structured input-output pairs to provide explicit linguistic stimulation. Pre-training LMs on a mixture of raw text and L2T data not only improves overall performance on linguistic competence benchmarks but accelerates its acquisition, while maintaining competitive performance on general reasoning tasks.

[27] Prompting Underestimates LLM Capability for Time Series Classification

Dan Schumacher, Erfan Nourbakhsh, Rocky Slavin, Anthony Rios

Main category: cs.CL

TL;DR: LLMs have strong internal time series understanding that prompt-based evaluations fail to reveal; linear probes show they match specialized models.

Details

Motivation: Previous prompt-based evaluations suggested LLMs perform poorly on time series classification, raising doubts about whether they encode meaningful temporal structure. The authors want to investigate if this reflects limitations of evaluation methods rather than the models' actual capabilities.

Method: Direct comparison of prompt-based generation with linear probes over the same internal representations. Used layer-wise analyses to track emergence of time series information. Tested with visual and multimodal inputs to examine information amplification.

Result: Zero-shot prompting performs near chance (F1 0.15-0.26), but linear probes dramatically improve performance to F1 0.61-0.67, often matching or exceeding specialized time series models. Class-discriminative time series information emerges in early transformer layers and is amplified by visual/multimodal inputs.

Conclusion: There’s a systematic mismatch between what LLMs internally represent and what prompt-based evaluation reveals, leading current evaluations to underestimate their time series understanding. The poor prompt performance reflects evaluation limitations, not the models’ representational capacity.

Abstract: Prompt-based evaluations suggest that large language models (LLMs) perform poorly on time series classification, raising doubts about whether they encode meaningful temporal structure. We show that this conclusion reflects limitations of prompt-based generation rather than the model’s representational capacity by directly comparing prompt outputs with linear probes over the same internal representations. While zero-shot prompting performs near chance, linear probes improve average F1 from 0.15-0.26 to 0.61-0.67, often matching or exceeding specialized time series models. Layer-wise analyses further show that class-discriminative time series information emerges in early transformer layers and is amplified by visual and multimodal inputs. Together, these results demonstrate a systematic mismatch between what LLMs internally represent and what prompt-based evaluation reveals, leading current evaluations to underestimate their time series understanding.

[28] EpiQAL: Benchmarking Large Language Models in Epidemiological Question Answering for Enhanced Alignment and Reasoning

Mingyang Wei, Dehai Min, Zewen Liu, Yuzhang Xie, Guanchen Wu, Carl Yang, Max S. Y. Lau, Qi He, Lu Cheng, Wei Jin

Main category: cs.CL

TL;DR: EpiQAL is a new benchmark for evaluating epidemiological question answering that tests evidence-grounded reasoning across three subsets: factual recall, multi-step inference, and conclusion reconstruction.

Details

Motivation: Existing medical QA benchmarks focus on clinical knowledge and patient-level reasoning, but lack systematic evaluation of evidence-grounded epidemiological inference needed for population-level disease burden, transmission dynamics, and intervention effects.

Method: Created EpiQAL benchmark with three subsets from open-access literature: text-grounded factual recall, multi-step inference linking evidence with epidemiological principles, and conclusion reconstruction with Discussion sections withheld. Used expert-designed taxonomy, multi-model verification, and retrieval-based difficulty control.

Result: Experiments on ten open models show limited performance on epidemiological reasoning, especially on multi-step inference. Model rankings shift across subsets, scale alone doesn’t predict success, and Chain-of-Thought helps multi-step inference but has mixed results elsewhere.

Conclusion: EpiQAL provides fine-grained diagnostic signals for evidence grounding, inferential reasoning, and conclusion reconstruction, revealing current LLMs’ limitations in epidemiological reasoning and offering a benchmark for future improvements.

Abstract: Reliable epidemiological reasoning requires synthesizing study evidence to infer disease burden, transmission dynamics, and intervention effects at the population level. Existing medical question answering benchmarks primarily emphasize clinical knowledge or patient-level reasoning, yet few systematically evaluate evidence-grounded epidemiological inference. We present EpiQAL, the first diagnostic benchmark for epidemiological question answering across diverse diseases, comprising three subsets built from open-access literature. The subsets respectively evaluate text-grounded factual recall, multi-step inference linking document evidence with epidemiological principles, and conclusion reconstruction with the Discussion section withheld. Construction combines expert-designed taxonomy guidance, multi-model verification, and retrieval-based difficulty control. Experiments on ten open models reveal that current LLMs show limited performance on epidemiological reasoning, with multi-step inference posing the greatest challenge. Model rankings shift across subsets, and scale alone does not predict success. Chain-of-Thought prompting benefits multi-step inference but yields mixed results elsewhere. EpiQAL provides fine-grained diagnostic signals for evidence grounding, inferential reasoning, and conclusion reconstruction.

[29] SegNSP: Revisiting Next Sentence Prediction for Linear Text Segmentation

José Isidro, Filipe Cunha, Purificação Silvano, Alípio Jorge, Nuno Guimarães, Sérgio Nunes, Ricardo Campos

Main category: cs.CL

TL;DR: SegNSP frames linear text segmentation as a next sentence prediction task, achieving competitive results on two datasets without requiring explicit topic labels.

Details

Motivation: Linear text segmentation is challenging due to topic boundary complexity, discourse variability, and the need to balance local coherence with global context, hindering downstream NLP applications like summarization and information retrieval.

Method: SegNSP uses a label-agnostic NSP approach that predicts whether the next sentence continues the current topic, enhanced with segmentation-aware loss and harder negative sampling to capture discourse continuity without task-specific supervision.

Result: On CitiLink-Minutes, SegNSP achieves B-F1 of 0.79 (close to human annotations), and on WikiSection achieves B-F1 of 0.65, outperforming TopSeg baseline by 0.17 points.

Conclusion: Modeling sentence-to-sentence continuity through NSP effectively improves segmentation quality, demonstrating competitive and robust performance for supporting downstream NLP applications.

Abstract: Linear text segmentation is a long-standing problem in natural language processing (NLP), focused on dividing continuous text into coherent and semantically meaningful units. Despite its importance, the task remains challenging due to the complexity of defining topic boundaries, the variability in discourse structure, and the need to balance local coherence with global context. These difficulties hinder downstream applications such as summarization, information retrieval, and question answering. In this work, we introduce SegNSP, framing linear text segmentation as a next sentence prediction (NSP) task. Although NSP has largely been abandoned in modern pre-training, its explicit modeling of sentence-to-sentence continuity makes it a natural fit for detecting topic boundaries. We propose a label-agnostic NSP approach, which predicts whether the next sentence continues the current topic without requiring explicit topic labels, and enhance it with a segmentation-aware loss combined with harder negative sampling to better capture discourse continuity. Unlike recent proposals that leverage NSP alongside auxiliary topic classification, our approach avoids task-specific supervision. We evaluate our model against established baselines on two datasets, CitiLink-Minutes, for which we establish the first segmentation benchmark, and WikiSection. On CitiLink-Minutes, SegNSP achieves a B-$F_1$ of 0.79, closely aligning with human-annotated topic transitions, while on WikiSection it attains a B-F$_1$ of 0.65, outperforming the strongest reproducible baseline, TopSeg, by 0.17 absolute points. These results demonstrate competitive and robust performance, highlighting the effectiveness of modeling sentence-to-sentence continuity for improving segmentation quality and supporting downstream NLP applications.

[30] Self-Explaining Hate Speech Detection with Moral Rationales

Francielle Vargas, Jackson Trager, Diego Alves, Surendrabikram Thapa, Matteo Guida, Berk Atil, Daryna Dementieva, Andrew Smart, Ameeta Agrawal

Main category: cs.CL

TL;DR: SMRA is a self-explaining hate speech detection framework that uses moral rationales as direct supervision for attention alignment, improving performance and explanation faithfulness without bias trade-offs.

Details

Motivation: Current hate speech detection models rely on surface-level lexical features, making them vulnerable to spurious correlations and limiting robustness, cultural contextualization, and interpretability. There's a need for more contextualized and interpretable models.

Method: Proposes Supervised Moral Rationale Attention (SMRA), which aligns token-level attention with expert-annotated moral rationales based on Moral Foundations Theory. Unlike prior approaches, SMRA integrates moral rationale supervision directly into the training objective. Also introduces HateBRMoralXplain, a Brazilian Portuguese benchmark dataset with hate labels, moral categories, token-level moral rationales, and socio-political metadata.

Result: SMRA consistently improves performance across binary hate speech detection (+0.9 F1) and multi-label moral sentiment classification (+1.5 F1). Substantially enhances explanation faithfulness with IoU F1 (+7.4 pp) and Token F1 (+5.0 pp). Explanations become more concise, sufficiency improves (+2.3 pp), and fairness remains stable.

Conclusion: SMRA successfully produces inherently interpretable and contextualized hate speech detection models by incorporating moral rationales as direct supervision, achieving better performance and more faithful explanations without compromising fairness or creating bias trade-offs.

Abstract: Hate speech detection models rely on surface-level lexical features, increasing vulnerability to spurious correlations and limiting robustness, cultural contextualization, and interpretability. We propose Supervised Moral Rationale Attention (SMRA), the first self-explaining hate speech detection framework to incorporate moral rationales as direct supervision for attention alignment. Based on Moral Foundations Theory, SMRA aligns token-level attention with expert-annotated moral rationales, guiding models to attend to morally salient spans rather than spurious lexical patterns. Unlike prior rationale-supervised or post-hoc approaches, SMRA integrates moral rationale supervision directly into the training objective, producing inherently interpretable and contextualized explanations. To support our framework, we also introduce HateBRMoralXplain, a Brazilian Portuguese benchmark dataset annotated with hate labels, moral categories, token-level moral rationales, and socio-political metadata. Across binary hate speech detection and multi-label moral sentiment classification, SMRA consistently improves performance (e.g., +0.9 and +1.5 F1, respectively) while substantially enhancing explanation faithfulness, increasing IoU F1 (+7.4 pp) and Token F1 (+5.0 pp). Although explanations become more concise, sufficiency improves (+2.3 pp) and fairness remains stable, indicating more faithful rationales without performance or bias trade-offs

[31] CALM: Culturally Self-Aware Language Models

Lingzhi Shen, Xiaohao Cai, Yunfei Long, Imran Razzak, Guanming Chen, Shoaib Jameel

Main category: cs.CL

TL;DR: CALM framework enables language models to develop cultural self-awareness by disentangling and structuring cultural concepts, allowing dynamic adaptation and self-correction.

Details

Motivation: Existing approaches treat culture as static background knowledge, overlooking its dynamic and evolving nature, which reduces reliability in tasks requiring genuine cultural sensitivity.

Method: CALM disentangles task semantics from explicit cultural concepts and latent cultural signals, shapes them into structured cultural clusters via contrastive learning, aligns them through cross-attention, integrates via Mixture-of-Experts mechanism, and enhances through self-prompted reflective learning.

Result: Extensive experiments on multiple cross-cultural benchmark datasets show CALM consistently outperforms state-of-the-art methods.

Conclusion: CALM successfully endows language models with cultural self-awareness, enabling continual adaptation and self-correction for improved cultural sensitivity.

Abstract: Cultural awareness in language models is the capacity to understand and adapt to diverse cultural contexts. However, most existing approaches treat culture as static background knowledge, overlooking its dynamic and evolving nature. This limitation reduces their reliability in downstream tasks that demand genuine cultural sensitivity. In this work, we introduce CALM, a novel framework designed to endow language models with cultural self-awareness. CALM disentangles task semantics from explicit cultural concepts and latent cultural signals, shaping them into structured cultural clusters through contrastive learning. These clusters are then aligned via cross-attention to establish fine-grained interactions among related cultural features and are adaptively integrated through a Mixture-of-Experts mechanism along culture-specific dimensions. The resulting unified representation is fused with the model’s original knowledge to construct a culturally grounded internal identity state, which is further enhanced through self-prompted reflective learning, enabling continual adaptation and self-correction. Extensive experiments conducted on multiple cross-cultural benchmark datasets demonstrate that CALM consistently outperforms state-of-the-art methods.

[32] Submodular Evaluation Subset Selection in Automatic Prompt Optimization

Jinming Nian, Zhiyuan Peng, Hongwei Shang, Dae Hoon Park, Yi Fang

Main category: cs.CL

TL;DR: SESS: Submodular evaluation subset selection method for prompt optimization that outperforms random/heuristic baselines by selecting better evaluation subsets.

Details

Motivation: Current prompt optimization relies on task performance measured on small, often randomly sampled evaluation subsets, but how to select that evaluation subset is usually treated as an implementation detail rather than a principled approach.

Method: Proposes SESS, a submodular evaluation subset selection method that frames selection as maximizing an objective set function, showing it’s monotone and submodular under mild conditions, enabling greedy selection with theoretical guarantees.

Result: Across GSM8K, MATH, and GPQA-Diamond benchmarks, submodularly selected evaluation subsets yield better optimized prompts than random or heuristic baselines.

Conclusion: Evaluation subset selection should be treated as a principled optimization problem rather than an implementation detail, and submodular selection methods can significantly improve prompt optimization outcomes.

Abstract: Automatic prompt optimization reduces manual prompt engineering, but relies on task performance measured on a small, often randomly sampled evaluation subset as its main source of feedback signal. Despite this, how to select that evaluation subset is usually treated as an implementation detail. We study evaluation subset selection for prompt optimization from a principled perspective and propose SESS, a submodular evaluation subset selection method. We frame selection as maximizing an objective set function and show that, under mild conditions, it is monotone and submodular, enabling greedy selection with theoretical guarantees. Across GSM8K, MATH, and GPQA-Diamond, submodularly selected evaluation subsets can yield better optimized prompts than random or heuristic baselines.

[33] Beyond Perplexity: A Lightweight Benchmark for Knowledge Retention in Supervised Fine-Tuning

Soheil Zibakhsh Shabgahi, Pedram Aghazadeh, Farinaz Koushanfar

Main category: cs.CL

TL;DR: KR-Test is a new evaluation framework that distinguishes factual learning from stylistic mimicry during LLM fine-tuning by using contrastive examples to measure knowledge retention.

Details

Motivation: Current SFT monitoring using validation perplexity is insufficient because it conflates stylistic mimicry with genuine factual internalization, making it hard to distinguish whether models are actually learning domain knowledge or just learning linguistic patterns.

Method: KR-Test uses automatically generated contrastive examples to measure likelihood preferences for correct vs. incorrect continuations. It requires no instruction tuning or generative decoding, and includes a “blind vs. oracle” baseline analysis to validate framework integrity.

Result: The framework successfully distinguishes factual learning from linguistics and provides diagnostic capabilities for analyzing training dynamics (e.g., LoRA fine-tuning), exposing the dissociation between linguistic convergence and knowledge retention.

Conclusion: KR-Test enhances interpretability of fine-tuning dynamics by providing a lightweight, corpus-grounded evaluation framework that can monitor genuine knowledge acquisition separate from stylistic adaptation.

Abstract: Supervised Fine-Tuning (SFT) is a standard approach for injecting domain knowledge into Large Language Models (LLMs). However, relying on validation perplexity to monitor training is often insufficient, as it confounds stylistic mimicry with genuine factual internalization. To address this, we introduce the Knowledge Retention (KR) Test , a lightweight, corpus-grounded evaluation framework designed to distinguish factual learning from linguistics. KR-Test utilizes automatically generated contrastive examples to measure likelihood preferences for correct versus incorrect continuations, requiring no instruction tuning or generative decoding. We validate the framework’s integrity through a “blind vs. oracle” baseline analysis. Furthermore, we demonstrate the diagnostic capabilities of KR-Test by analyzing the training dynamics of Low-Rank Adaptation (LoRA). By exposing the fine-grained dissociation between linguistic convergence and knowledge retention, KR-Test enhances the interpretability of fine-tuning dynamics.

[34] Reasoning Pattern Alignment Merging for Adaptive Reasoning

Zhaofeng Zhong, Wei Yuan, Tong Chen, Xiangyu Zhao, Quoc Viet Hung Nguyen, Hongzhi Yin

Main category: cs.CL

TL;DR: RPAM is a layer-wise model merging framework that combines Long-CoT and Short-CoT models to create an adaptive reasoner that reduces inference cost while maintaining performance, without requiring retraining or sophisticated prompting.

Details

Motivation: Large reasoning models generate lengthy reasoning paths for every query, causing unnecessary computation and latency. Existing speed-up approaches require expensive retraining or are sensitive to input/prompt formulation.

Method: RPAM merges a long chain-of-thought model with a short-CoT model using layer-wise merging based on feature alignment. It uses a pattern-labeled calibration set to assign appropriate reasoning patterns to queries, optimizes merging coefficients by aligning intermediate representations with the selected model, and uses contrastive loss to push away from non-selected model.

Result: Experiments on seven reasoning benchmarks show RPAM substantially reduces inference cost while maintaining strong performance.

Conclusion: Model merging via RPAM provides a lightweight alternative for efficient reasoning without training from scratch or requiring large-scale additional data, offering query-adaptive reasoning with reduced computational overhead.

Abstract: Recent large reasoning models (LRMs) have made substantial progress in complex reasoning tasks, yet they often generate lengthy reasoning paths for every query, incurring unnecessary computation and latency. Existing speed-up approaches typically rely on retraining the model or designing sophisticated prompting, which are either prohibitively expensive or highly sensitive to the input and prompt formulation. In this work, we study model merging as a lightweight alternative for efficient reasoning: by combining a long chain-of-thought (Long-CoT) reasoning model with a Short-CoT instruction model, we obtain an adaptive reasoner without training from scratch or requiring large-scale additional data. Building on this idea, we propose Reasoning Pattern Alignment Merging (RPAM), a layer-wise model merging framework based on feature alignment to facilitate query-adaptive reasoning. RPAM first constructs a small pattern-labeled calibration set that assigns each query an appropriate reasoning pattern. It then optimizes layer-wise merging coefficients by aligning the merged model’s intermediate representations with those of the selected model, while a contrastive objective explicitly pushes them away from the non-selected model. Experiments on seven widely used reasoning benchmarks show that RPAM substantially reduces inference cost while maintaining strong performance. Upon article acceptance, we will provide open-source code to reproduce experiments for RPAM.

[35] IntroLM: Introspective Language Models via Prefilling-Time Self-Evaluation

Hossein Hosseini Kasnavieh, Gholamreza Haffari, Chris Leckie, Adel N. Toosi

Main category: cs.CL

TL;DR: IntroLM enables LLMs to self-assess output quality during prefilling using introspective tokens with token-conditional LoRA, avoiding external classifiers and improving routing efficiency.

Details

Motivation: Existing methods for predicting LLM output quality rely on external classifiers (BERT-based) with limitations: limited context windows, constrained representational capacity, and additional computational overhead.

Method: Introduces IntroLM with introspective tokens during prefilling phase. Uses token-conditional LoRA that activates only for introspective tokens, allowing the model to predict output quality while preserving original backbone behavior without external evaluators.

Result: On QA benchmarks, IntroLM applied to Qwen3 8B achieves 90% ROC AUC for success prediction, outperforming DeBERTa classifier by 14%. In multi-model routing systems, reduces latency by up to 33% and large model usage by up to 50% at matched reliability.

Conclusion: IntroLM provides an efficient self-assessment mechanism for LLMs that improves quality prediction accuracy and enables better cost-performance tradeoffs in routing systems without external classifiers.

Abstract: A major challenge for the operation of large language models (LLMs) is how to predict whether a specific LLM will produce sufficiently high-quality output for a given query. Existing approaches rely on external classifiers, most commonly BERT based models, which suffer from limited context windows, constrained representational capacity, and additional computational overhead. We propose IntroLM, a method that enables causal language models to predict their own output quality during the prefilling phase without affecting generation using introspective tokens. By introducing token conditional LoRA that activates only for the introspective token, the model learns to predict the output quality for a given query while preserving the original backbone behavior and avoiding external evaluators. On question answering benchmarks, IntroLM applied to Qwen3 8B achieves a ROC AUC of 90 precent for success prediction, outperforming a DeBERTa classifier by 14 precent. When integrated into multi model routing systems, IntroLM achieves superior cost performance tradeoffs, reducing latency by up to 33 precent and large model usage by up to 50 precent at matched reliability.

[36] Mem-Gallery: Benchmarking Multimodal Long-Term Conversational Memory for MLLM Agents

Yuanchen Bei, Tianxin Wei, Xuying Ning, Yanjun Zhao, Zhining Liu, Xiao Lin, Yada Zhu, Hendrik Hamann, Jingrui He, Hanghang Tong

Main category: cs.CL

TL;DR: Mem-Gallery is a new benchmark for evaluating multimodal long-term conversational memory in MLLM agents, addressing limitations of existing benchmarks that don’t assess how multimodal memory is preserved, organized, and evolved across long-term conversations.

Details

Motivation: Existing benchmarks either evaluate multi-session memory in text-only conversations or assess multimodal understanding within localized contexts, failing to evaluate how multimodal memory is preserved, organized, and evolved across long-term conversational trajectories.

Method: Introduce Mem-Gallery benchmark featuring high-quality multi-session conversations grounded in both visual and textual information, with long interaction horizons and rich multimodal dependencies. Propose a systematic evaluation framework assessing key memory capabilities along three functional dimensions: memory extraction and test-time adaptation, memory reasoning, and memory knowledge management.

Result: Extensive benchmarking across thirteen memory systems reveals several key findings: necessity of explicit multimodal information retention and memory organization, persistent limitations in memory reasoning and knowledge management, and efficiency bottleneck of current models.

Conclusion: Mem-Gallery addresses a critical gap in evaluating multimodal long-term conversational memory for MLLM agents, providing a comprehensive benchmark that reveals important insights about current memory system limitations and future research directions.

Abstract: Long-term memory is a critical capability for multimodal large language model (MLLM) agents, particularly in conversational settings where information accumulates and evolves over time. However, existing benchmarks either evaluate multi-session memory in text-only conversations or assess multimodal understanding within localized contexts, failing to evaluate how multimodal memory is preserved, organized, and evolved across long-term conversational trajectories. Thus, we introduce Mem-Gallery, a new benchmark for evaluating multimodal long-term conversational memory in MLLM agents. Mem-Gallery features high-quality multi-session conversations grounded in both visual and textual information, with long interaction horizons and rich multimodal dependencies. Building on this dataset, we propose a systematic evaluation framework that assesses key memory capabilities along three functional dimensions: memory extraction and test-time adaptation, memory reasoning, and memory knowledge management. Extensive benchmarking across thirteen memory systems reveals several key findings, highlighting the necessity of explicit multimodal information retention and memory organization, the persistent limitations in memory reasoning and knowledge management, as well as the efficiency bottleneck of current models.

[37] PALM-Bench: A Comprehensive Benchmark for Personalized Audio-Language Models

Yuwen Wang, Xinyuan Qian, Tian-Hao Zhang, Jiaran Gao, Yuchen Pan, Xin Wang, Zhou Pan, Chen Wei, Yiming Wang

Main category: cs.CL

TL;DR: The paper introduces Personalized Audio-Language Models (PALM) to address the limitation of current LALMs in handling personalized contexts, creates PALM-Bench benchmark, and shows existing methods are insufficient for robust personalized reasoning.

Details

Motivation: Current Large Audio-Language Models perform well on generic audio tasks but fail to support personalized question answering that requires understanding individual contexts and personal concepts, unlike human decision-making which is conditioned on personal context.

Method: Formalizes Personalized LALMs (PALM) task, creates PALM-Bench benchmark for structured evaluation across multi-speaker scenarios, and conducts extensive experiments on open-source LALMs using training-free prompting and supervised fine-tuning strategies.

Result: Existing methods show improvements but remain limited in modeling personalized knowledge and transferring it robustly across tasks, indicating the need for better approaches to personalized audio-language understanding.

Conclusion: There is a significant gap in personalized audio understanding that current LALMs cannot adequately address, requiring new methodologies for personalized concept recognition and reasoning within personal contexts.

Abstract: Large Audio-Language Models (LALMs) have demonstrated strong performance in audio understanding and generation. Yet, our extensive benchmarking reveals that their behavior is largely generic (e.g., summarizing spoken content) and fails to adequately support personalized question answering (e.g., summarizing what my best friend says). In contrast, human conditions their interpretation and decision-making on each individual’s personal context. To bridge this gap, we formalize the task of Personalized LALMs (PALM) for recognizing personal concepts and reasoning within personal context. Moreover, we create the first benchmark (PALM-Bench) to foster the methodological advances in PALM and enable structured evaluation on several tasks across multi-speaker scenarios. Our extensive experiments on representative open-source LALMs, show that existing training-free prompting and supervised fine-tuning strategies, while yield improvements, remains limited in modeling personalized knowledge and transferring them across tasks robustly. Data and code will be released.

[38] Persona-aware and Explainable Bikeability Assessment: A Vision-Language Model Approach

Yilong Dai, Ziyi Wang, Chenguang Wang, Kexin Zhou, Yiheng Qian, Susu Xu, Xiang Yan

Main category: cs.CL

TL;DR: A persona-aware Vision-Language Model framework for bikeability assessment that incorporates cyclist typology, uses multi-granularity fine-tuning, and AI-enabled data augmentation to predict ratings and provide explainable factor attribution.

Details

Motivation: Existing bikeability assessment approaches fail to capture road environment complexity and subjective user perception heterogeneity, limiting their effectiveness for creating cyclist-friendly cities.

Method: Three novel components: (1) theory-grounded persona conditioning based on cyclist typology with chain-of-thought reasoning, (2) multi-granularity supervised fine-tuning combining expert reasoning with user ratings, and (3) AI-enabled data augmentation creating controlled paired data to isolate infrastructure impacts.

Result: Framework tested with 12,400 persona-conditioned assessments from 427 cyclists via panoramic image-based crowdsourcing system. Results show competitive bikeability rating prediction while enabling unique explainable factor attribution.

Conclusion: The proposed persona-aware VLM framework advances bikeability assessment by addressing limitations of existing approaches through personalized, explainable, and data-augmented methodology that better captures subjective user perceptions.

Abstract: Bikeability assessment is essential for advancing sustainable urban transportation and creating cyclist-friendly cities, and it requires incorporating users’ perceptions of safety and comfort. Yet existing perception-based bikeability assessment approaches face key limitations in capturing the complexity of road environments and adequately accounting for heterogeneity in subjective user perceptions. This paper proposes a persona-aware Vision-Language Model framework for bikeability assessment with three novel contributions: (i) theory-grounded persona conditioning based on established cyclist typology that generates persona-specific explanations via chain-of-thought reasoning; (ii) multi-granularity supervised fine-tuning that combines scarce expert-annotated reasoning with abundant user ratings for joint prediction and explainable assessment; and (iii) AI-enabled data augmentation that creates controlled paired data to isolate infrastructure variable impacts. To test and validate this framework, we developed a panoramic image-based crowdsourcing system and collected 12,400 persona-conditioned assessments from 427 cyclists. Experiment results show that the proposed framework offers competitive bikeability rating prediction while uniquely enabling explainable factor attribution.

[39] DeepSynth-Eval: Objectively Evaluating Information Consolidation in Deep Survey Writing

Hongzhi Zhang, Yuanze Hu, Tinghai Zhang, Jia Fu, Tao Wang, Junwei Jing, Zhaoxin Fan, Qi Wang, Ruiming Tang, Han Li, Guorui Zhou, Kun Gai

Main category: cs.CL

TL;DR: DeepSynth-Eval is a benchmark for evaluating LLM agents’ ability to synthesize information from massive context into coherent long-form reports, addressing the under-evaluated post-retrieval synthesis stage in deep research.

Details

Motivation: The post-retrieval synthesis stage in deep research—where agents must digest massive context and consolidate fragmented evidence into coherent reports—remains under-evaluated due to the subjectivity of open-ended writing. Current benchmarks focus on retrieval but lack objective evaluation of synthesis capabilities.

Method: Leverage high-quality survey papers as gold standards, reverse-engineer research requests, and construct “Oracle Contexts” from bibliographies to isolate synthesis from retrieval noise. Use fine-grained evaluation with General Checklists (factual coverage) and Constraint Checklists (structural organization) to transform subjective judgment into verifiable metrics.

Result: Experiments across 96 tasks show synthesizing information from hundreds of references remains challenging. Agentic plan-and-write workflows significantly outperform single-turn generation, effectively reducing hallucinations and improving adherence to complex structural constraints.

Conclusion: DeepSynth-Eval provides an objective benchmark for evaluating information consolidation capabilities in LLM agents, revealing that synthesis from massive context is still difficult but can be improved with structured workflows that separate planning from writing.

Abstract: The evolution of Large Language Models (LLMs) towards autonomous agents has catalyzed progress in Deep Research. While retrieval capabilities are well-benchmarked, the post-retrieval synthesis stage–where agents must digest massive amounts of context and consolidate fragmented evidence into coherent, long-form reports–remains under-evaluated due to the subjectivity of open-ended writing. To bridge this gap, we introduce DeepSynth-Eval, a benchmark designed to objectively evaluate information consolidation capabilities. We leverage high-quality survey papers as gold standards, reverse-engineering research requests and constructing “Oracle Contexts” from their bibliographies to isolate synthesis from retrieval noise. We propose a fine-grained evaluation protocol using General Checklists (for factual coverage) and Constraint Checklists (for structural organization), transforming subjective judgment into verifiable metrics. Experiments across 96 tasks reveal that synthesizing information from hundreds of references remains a significant challenge. Our results demonstrate that agentic plan-and-write workflows significantly outperform single-turn generation, effectively reducing hallucinations and improving adherence to complex structural constraints.

[40] Layer-Order Inversion: Rethinking Latent Multi-Hop Reasoning in Large Language Models

Xukai Liu, Ye Liu, Jipeng Zhang, Yanghai Zhang, Kai Zhang, Qi Liu

Main category: cs.CL

TL;DR: LLMs don’t compute multi-hop reasoning sequentially by hop; later-hop answers can emerge before bridge entities (layer-order inversion), explained by a probabilistic recall-and-extract framework.

Details

Motivation: To understand how LLMs internally compose multiple facts during multi-hop reasoning, challenging the prevailing hop-aligned circuit hypothesis that assumes sequential computation of bridge entities across layers.

Method: Systematic analyses on real-world multi-hop queries, proposing a probabilistic recall-and-extract framework where shallow MLP layers perform broad probabilistic recall and deeper attention layers do selective extraction.

Result: Found layer-order inversion phenomenon where later-hop answer entities become decodable earlier than bridge entities, contradicting hop-aligned assumption. Framework validated through probing analyses, reinterpretation of prior evidence, explanation of chain-of-thought gains, and diagnosis of multi-hop failures.

Conclusion: Multi-hop reasoning in LLMs follows a probabilistic recall-and-extract pattern rather than sequential hop-aligned computation, providing a mechanistic explanation for both successes and failures in multi-hop reasoning despite correct single-hop knowledge.

Abstract: Large language models (LLMs) perform well on multi-hop reasoning, yet how they internally compose multiple facts remains unclear. Recent work proposes \emph{hop-aligned circuit hypothesis}, suggesting that bridge entities are computed sequentially across layers before later-hop answers. Through systematic analyses on real-world multi-hop queries, we show that this hop-aligned assumption does not generalize: later-hop answer entities can become decodable earlier than bridge entities, a phenomenon we call \emph{layer-order inversion}, which strengthens with total hops. To explain this behavior, we propose a \emph{probabilistic recall-and-extract} framework that models multi-hop reasoning as broad probabilistic recall in shallow MLP layers followed by selective extraction in deeper attention layers. This framework is empirically validated through systematic probing analyses, reinterpreting prior layer-wise decoding evidence, explaining chain-of-thought gains, and providing a mechanistic diagnosis of multi-hop failures despite correct single-hop knowledge. Code is available at https://github.com/laquabe/Layer-Order-Inversion.

[41] EvolMem: A Cognitive-Driven Benchmark for Multi-Session Dialogue Memory

Ye Shen, Dun Pei, Yiqiu Guo, Junying Wang, Yijin Guo, Zicheng Zhang, Qi Jia, Jun Zhou, Guangtao Zhai

Main category: cs.CL

TL;DR: EvolMem is a new benchmark for evaluating multi-session memory capabilities of LLMs and agent systems, covering diverse memory dimensions grounded in cognitive psychology, with a hybrid data synthesis framework for scalable conversation generation.

Details

Motivation: Existing benchmarks lack systematic evaluation of LLMs across diverse memory dimensions, particularly in multi-session settings. There's a need for comprehensive assessment of memory capabilities grounded in cognitive psychology principles.

Method: Proposes EvolMem benchmark with hybrid data synthesis framework combining topic-initiated generation and narrative-inspired transformations. This enables scalable generation of multi-session conversations with controllable complexity and sample-specific evaluation guidelines.

Result: Extensive evaluation shows no LLM consistently outperforms others across all memory dimensions. Agent memory mechanisms don’t necessarily enhance LLMs’ capabilities and often exhibit notable efficiency limitations.

Conclusion: EvolMem provides a comprehensive benchmark for assessing multi-session memory capabilities, revealing current limitations in LLMs’ memory performance and agent systems’ efficiency. The benchmark and code will be publicly released.

Abstract: Despite recent advances in understanding and leveraging long-range conversational memory, existing benchmarks still lack systematic evaluation of large language models(LLMs) across diverse memory dimensions, particularly in multi-session settings. In this work, we propose EvolMem, a new benchmark for assessing multi-session memory capabilities of LLMs and agent systems. EvolMem is grounded in cognitive psychology and encompasses both declarative and non-declarative memory, further decomposed into multiple fine-grained abilities. To construct the benchmark, we introduce a hybrid data synthesis framework that consists of topic-initiated generation and narrative-inspired transformations. This framework enables scalable generation of multi-session conversations with controllable complexity, accompanied by sample-specific evaluation guidelines. Extensive evaluation reveals that no LLM consistently outperforms others across all memory dimensions. Moreover, agent memory mechanisms do not necessarily enhance LLMs’ capabilities and often exhibit notable efficiency limitations. Data and code will be released at https://github.com/shenye7436/EvolMem.

[42] Value-Action Alignment in Large Language Models under Privacy-Prosocial Conflict

Guanyu Chen, Chenxiao Yu, Xiyang Hu

Main category: cs.CL

TL;DR: LLMs show varying alignment between privacy/prosocial values and actual data-sharing decisions, measured via new assessment protocol and VAAR metric.

Details

Motivation: Existing evaluations measure privacy attitudes or sharing intentions in isolation, failing to capture whether LLMs' expressed values predict downstream data-sharing actions as in human behavior.

Method: Context-based assessment protocol with sequential standardized questionnaires for privacy attitudes, prosocialness, and data sharing acceptance within bounded sessions. Multi-group structural equation modeling (MGSEM) to analyze relations, and Value-Action Alignment Rate (VAAR) metric for human-referenced directional agreement.

Result: Stable but model-specific Privacy-PSA-AoDS profiles across multiple LLMs, with substantial heterogeneity in value-action alignment between privacy concerns/prosocialness and actual data-sharing decisions.

Conclusion: LLMs exhibit complex value-action relationships in data-sharing contexts, requiring integrated assessment of competing attitudes rather than isolated measurements.

Abstract: Large language models (LLMs) are increasingly used to simulate decision-making tasks involving personal data sharing, where privacy concerns and prosocial motivations can push choices in opposite directions. Existing evaluations often measure privacy-related attitudes or sharing intentions in isolation, which makes it difficult to determine whether a model’s expressed values jointly predict its downstream data-sharing actions as in real human behaviors. We introduce a context-based assessment protocol that sequentially administers standardized questionnaires for privacy attitudes, prosocialness, and acceptance of data sharing within a bounded, history-carrying session. To evaluate value-action alignments under competing attitudes, we use multi-group structural equation modeling (MGSEM) to identify relations from privacy concerns and prosocialness to data sharing. We propose Value-Action Alignment Rate (VAAR), a human-referenced directional agreement metric that aggregates path-level evidence for expected signs. Across multiple LLMs, we observe stable but model-specific Privacy-PSA-AoDS profiles, and substantial heterogeneity in value-action alignment.

[43] Evaluating LLMs for Police Decision-Making: A Framework Based on Police Action Scenarios

Sangyub Lee, Heedou Kim, Hyeoncheol Kim

Main category: cs.CL

TL;DR: Researchers propose PAS, a systematic evaluation framework for LLMs in police operations, creating a QA dataset from official documents and showing commercial LLMs struggle with police-related tasks.

Details

Motivation: The growing use of LLMs in police operations lacks a tailored evaluation framework, and unverified LLM responses can lead to severe issues like unlawful arrests and improper evidence collection.

Method: Proposed PAS (Police Action Scenarios) framework covering the entire evaluation process, constructed a novel QA dataset from over 8,000 official documents, and established key metrics validated through statistical analysis with police expert judgements.

Result: Experimental results show commercial LLMs struggle with police-related tasks, particularly in providing fact-based recommendations, highlighting the need for specialized evaluation frameworks.

Conclusion: The study demonstrates the necessity of an expandable evaluation framework to ensure reliable AI-driven police operations, and the researchers release their data and prompt template.

Abstract: The use of Large Language Models (LLMs) in police operations is growing, yet an evaluation framework tailored to police operations remains absent. While LLM’s responses may not always be legally incorrect, their unverified use still can lead to severe issues such as unlawful arrests and improper evidence collection. To address this, we propose PAS (Police Action Scenarios), a systematic framework covering the entire evaluation process. Applying this framework, we constructed a novel QA dataset from over 8,000 official documents and established key metrics validated through statistical analysis with police expert judgements. Experimental results show that commercial LLMs struggle with our new police-related tasks, particularly in providing fact-based recommendations. This study highlights the necessity of an expandable evaluation framework to ensure reliable AI-driven police operations. We release our data and prompt template.

[44] DiffCoT: Diffusion-styled Chain-of-Thought Reasoning in LLMs

Shidong Cao, Hongzhan Lin, Yuxuan Gu, Ziyang Luo, Jing Ma

Main category: cs.CL

TL;DR: DiffCoT: A diffusion-styled Chain-of-Thought framework that treats reasoning as iterative denoising, enabling retrospective correction of intermediate steps while maintaining token-level autoregression.

Details

Motivation: Standard CoT reasoning suffers from exposure bias and error accumulation where early mistakes propagate irreversibly through autoregressive decoding, limiting robustness in multi-step mathematical problem solving.

Method: Reformulates CoT reasoning as iterative denoising process using diffusion principles at reasoning-step level with sliding-window mechanism; introduces causal diffusion noise schedule to preserve temporal structure of reasoning chains while maintaining token-level autoregression.

Result: Extensive experiments on three multi-step CoT reasoning benchmarks across diverse model backbones show DiffCoT consistently outperforms existing CoT preference optimization methods, demonstrating improved robustness and error-correction capability.

Conclusion: DiffCoT provides an effective framework that addresses key limitations of standard CoT reasoning by enabling unified generation and retrospective correction of intermediate steps through diffusion-inspired iterative denoising.

Abstract: Chain-of-Thought (CoT) reasoning improves multi-step mathematical problem solving in large language models but remains vulnerable to exposure bias and error accumulation, as early mistakes propagate irreversibly through autoregressive decoding. In this work, we propose DiffCoT, a diffusion-styled CoT framework that reformulates CoT reasoning as an iterative denoising process. DiffCoT integrates diffusion principles at the reasoning-step level via a sliding-window mechanism, enabling unified generation and retrospective correction of intermediate steps while preserving token-level autoregression. To maintain causal consistency, we further introduce a causal diffusion noise schedule that respects the temporal structure of reasoning chains. Extensive experiments on three multi-step CoT reasoning benchmarks across diverse model backbones demonstrate that DiffCoT consistently outperforms existing CoT preference optimization methods, yielding improved robustness and error-correction capability in CoT reasoning.

[45] How Do Large Language Models Learn Concepts During Continual Pre-Training?

Barry Menglong Yao, Sha Li, Yunzhi Yao, Minqian Liu, Zaishuo Xia, Qifan Wang, Lifu Huang

Main category: cs.CL

TL;DR: LLMs’ concept circuits show predictable patterns of learning and forgetting during continual pretraining, with interference between similar concepts and varying transferability between different concepts.

Details

Motivation: To understand how LLMs acquire, retain, and forget concepts during continual pretraining, and how concepts interact through interference and synergy, by examining their internal computational structures.

Method: Analyze LLMs’ internal Concept Circuits (computational subgraphs for specific concepts) using Graph Metrics to characterize circuit structure during continual pretraining, studying acquisition, forgetting, interference, and synergy patterns.

Result: 1) Concept circuits provide statistically significant signals of learning/forgetting; 2) Circuits show stage-wise temporal patterns (early increase → gradual decrease → stabilization); 3) Larger learning gains correlate with greater forgetting; 4) Semantically similar concepts cause stronger interference; 5) Concepts differ in transferability, with some facilitating others’ learning.

Conclusion: The findings provide a circuit-level view of concept learning dynamics, offering insights for designing more interpretable and robust concept-aware training strategies for LLMs.

Abstract: Human beings primarily understand the world through concepts (e.g., dog), abstract mental representations that structure perception, reasoning, and learning. However, how large language models (LLMs) acquire, retain, and forget such concepts during continual pretraining remains poorly understood. In this work, we study how individual concepts are acquired and forgotten, as well as how multiple concepts interact through interference and synergy. We link these behavioral dynamics to LLMs’ internal Concept Circuits, computational subgraphs associated with specific concepts, and incorporate Graph Metrics to characterize circuit structure. Our analysis reveals: (1) LLMs concept circuits provide a non-trivial, statistically significant signal of concept learning and forgetting; (2) Concept circuits exhibit a stage-wise temporal pattern during continual pretraining, with an early increase followed by gradual decrease and stabilization; (3) concepts with larger learning gains tend to exhibit greater forgetting under subsequent training; (4) semantically similar concepts induce stronger interference than weakly related ones; (5) conceptual knowledge differs in their transferability, with some significantly facilitating the learning of others. Together, our findings offer a circuit-level view of concept learning dynamics and inform the design of more interpretable and robust concept-aware training strategies for LLMs.

[46] PsychEthicsBench: Evaluating Large Language Models Against Australian Mental Health Ethics

Yaling Shen, Stephanie Fong, Yiwen Jiang, Zimu Wang, Feilong Tang, Qingyang Xu, Xiangyu Zhao, Zhongxing Xu, Jiahe Liu, Jinpeng Hu, Dominic Dwyer, Zongyuan Ge

Main category: cs.CL

TL;DR: PsychEthicsBench is a new benchmark for evaluating LLM ethical alignment in mental health using Australian clinical guidelines, showing refusal rates are poor indicators of ethical behavior and domain-specific fine-tuning can degrade ethical robustness.

Details

Motivation: Current LLM evaluation in mental health relies too heavily on refusal-based safety signals, which don't capture nuanced clinical behaviors and can lead to unempathetic responses that discourage help-seeking. There's a need for more comprehensive, principle-grounded evaluation frameworks.

Method: Created PsychEthicsBench based on Australian psychology and psychiatry guidelines, featuring multiple-choice and open-ended tasks with fine-grained ethicality annotations. Evaluated 14 LLM models to assess ethical knowledge and behavioral responses.

Result: Refusal rates are poor indicators of ethical behavior, showing significant divergence between safety triggers and clinical appropriateness. Domain-specific fine-tuning can degrade ethical robustness, with specialized models underperforming their base backbones in ethical alignment.

Conclusion: PsychEthicsBench provides a systematic, jurisdiction-aware foundation for evaluating LLMs in mental health, enabling more responsible development by moving beyond refusal-centric metrics to assess nuanced ethical behaviors required in clinical practice.

Abstract: The increasing integration of large language models (LLMs) into mental health applications necessitates robust frameworks for evaluating professional safety alignment. Current evaluative approaches primarily rely on refusal-based safety signals, which offer limited insight into the nuanced behaviors required in clinical practice. In mental health, clinically inadequate refusals can be perceived as unempathetic and discourage help-seeking. To address this gap, we move beyond refusal-centric metrics and introduce \texttt{PsychEthicsBench}, the first principle-grounded benchmark based on Australian psychology and psychiatry guidelines, designed to evaluate LLMs’ ethical knowledge and behavioral responses through multiple-choice and open-ended tasks with fine-grained ethicality annotations. Empirical results across 14 models reveal that refusal rates are poor indicators of ethical behavior, revealing a significant divergence between safety triggers and clinical appropriateness. Notably, we find that domain-specific fine-tuning can degrade ethical robustness, as several specialized models underperform their base backbones in ethical alignment. PsychEthicsBench provides a foundation for systematic, jurisdiction-aware evaluation of LLMs in mental health, encouraging more responsible development in this domain.

[47] OLA: Output Language Alignment in Code-Switched LLM Interactions

Juhyun Oh, Haneul Yoo, Faiz Ghifari Haznitrama, Alice Oh

Main category: cs.CL

TL;DR: LLMs fail to align output language with user expectations in code-switched prompts, showing systematic bias toward non-English responses. OLA benchmark reveals these failures persist across languages and model architectures, but can be fixed with targeted alignment training.

Details

Motivation: Code-switching is natural for multilingual users but poses challenges for LLMs, which must infer output language from contextual cues without explicit specification. Current LLMs systematically fail to meet this expectation, responding in undesired languages even when cues are clear to humans.

Method: Introduced OLA (Output Language Alignment) benchmark for Korean-English code-switching, spanning from simple intra-sentential mixing to instruction-content mismatches. Evaluated frontier models, tested generalization to Chinese and Indonesian pairs, and experimented with Chain-of-Thought prompting. Developed Code-Switching Aware DPO (Direct Preference Optimization) with minimal data (~1K examples).

Result: Even frontier models frequently misinterpret implicit language expectations, exhibiting bias toward non-English responses. Models show instability through mid-response switching and language intrusions. Chain-of-Thought prompting fails to resolve errors, indicating weak pragmatic reasoning. However, Code-Switching Aware DPO with minimal data substantially reduces misalignment.

Conclusion: LLM failures in code-switched interactions stem from insufficient alignment rather than fundamental limitations. Targeted alignment training with minimal data can significantly improve performance. Results highlight the need to align multilingual LLMs with users’ implicit expectations in real-world code-switched interactions.

Abstract: Code-switching, alternating between languages within a conversation, is natural for multilingual users, yet poses fundamental challenges for large language models (LLMs). When a user code-switches in their prompt to an LLM, they typically do not specify the expected language of the LLM response, and thus LLMs must infer the output language from contextual and pragmatic cues. We find that current LLMs systematically fail to align with this expectation, responding in undesired languages even when cues are clear to humans. We introduce OLA, a benchmark to evaluate LLMs’ Output Language Alignment in code-switched interactions. OLA focuses on Korean–English code-switching and spans simple intra-sentential mixing to instruction-content mismatches. Even frontier models frequently misinterpret implicit language expectation, exhibiting a bias toward non-English responses. We further show this bias generalizes beyond Korean to Chinese and Indonesian pairs. Models also show instability through mid-response switching and language intrusions. Chain-of-Thought prompting fails to resolve these errors, indicating weak pragmatic reasoning about output language. However, Code-Switching Aware DPO with minimal data (about 1K examples) substantially reduces misalignment, suggesting these failures stem from insufficient alignment rather than fundamental limitations. Our results highlight the need to align multilingual LLMs with users’ implicit expectations in real-world code-switched interactions.

[48] From Chains to Graphs: Self-Structured Reasoning for General-Domain LLMs

Yingjian Chen, Haoran Liu, Yinhong Liu, Sherry T. Tong, Aosong Feng, Jinghui Lu, Juntao Zhang, Yusuke Iwasawa, Yutaka Matsuo, Irene Li

Main category: cs.CL

TL;DR: SGR enables LLMs to construct and use graph-structured reasoning for open-domain QA, improving consistency and performance over linear reasoning methods.

Details

Motivation: LLMs show strong reasoning but use linear processes that are often logically inconsistent, while real-world reasoning requires integrating multiple premises and solving subproblems in parallel. Existing methods like Chain-of-Thought appear coherent but lead to inconsistent conclusions, and current approaches don't explore how LLMs can construct their own graph-structured reasoning.

Method: Proposes Self-Graph Reasoning (SGR), a framework that enables LLMs to explicitly represent their reasoning process as a structured graph before producing final answers. Also constructs a graph-structured reasoning dataset that merges multiple candidate reasoning graphs into refined graph structures for model training.

Result: Experiments on five QA benchmarks across general and specialized domains show SGR consistently improves reasoning consistency with 17.74% gain over base model. The LLaMA-3.3-70B model fine-tuned with SGR performs comparably to GPT-4o and surpasses Claude-3.5-Haiku.

Conclusion: Graph-structured reasoning is effective for improving LLM reasoning consistency and performance in open-domain question answering, demonstrating the value of structured reasoning representations over linear approaches.

Abstract: Large Language Models (LLMs) show strong reasoning ability in open-domain question answering, yet their reasoning processes are typically linear and often logically inconsistent. In contrast, real-world reasoning requires integrating multiple premises and solving subproblems in parallel. Existing methods, such as Chain-of-Thought (CoT), express reasoning in a linear textual form, which may appear coherent but frequently leads to inconsistent conclusions. Recent approaches rely on externally provided graphs and do not explore how LLMs can construct and use their own graph-structured reasoning, particularly in open-domain QA. To fill this gap, we novelly explore graph-structured reasoning of LLMs in general-domain question answering. We propose Self-Graph Reasoning (SGR), a framework that enables LLMs to explicitly represent their reasoning process as a structured graph before producing the final answer. We further construct a graph-structured reasoning dataset that merges multiple candidate reasoning graphs into refined graph structures for model training. Experiments on five QA benchmarks across both general and specialized domains show that SGR consistently improves reasoning consistency and yields a 17.74% gain over the base model. The LLaMA-3.3-70B model fine-tuned with SGR performs comparably to GPT-4o and surpasses Claude-3.5-Haiku, demonstrating the effectiveness of graph-structured reasoning.

[49] DiVA: Fine-grained Factuality Verification with Agentic-Discriminative Verifier

Hui Huang, Muyun Yang, Yuki Arase

Main category: cs.CL

TL;DR: DiVA combines generative and discriminative models for fine-grained factuality verification, outperforming existing methods on the new FGVeriBench benchmark.

Details

Motivation: Current factuality verification methods only provide binary judgments (correct/incorrect), which fails to capture varying degrees of error severity and limits utility for fine-grained evaluation and preference optimization.

Method: Proposes Agentic Discriminative Verifier (DiVA), a hybrid framework that synergizes agentic search capabilities of generative models with precise scoring aptitude of discriminative models for fine-grained factuality verification.

Result: DiVA significantly outperforms existing methods on factuality verification for both general and multi-hop questions on the new FGVeriBench benchmark.

Conclusion: The proposed DiVA framework successfully addresses the limitation of binary factuality judgments by enabling fine-grained verification, demonstrating superior performance over existing approaches.

Abstract: Despite the significant advancements of Large Language Models (LLMs), their factuality remains a critical challenge, fueling growing interest in factuality verification. Existing research on factuality verification primarily conducts binary judgments (e.g., correct or incorrect), which fails to distinguish varying degrees of error severity. This limits its utility for applications such as fine-grained evaluation and preference optimization. To bridge this gap, we propose the Agentic Discriminative Verifier (DiVA), a hybrid framework that synergizes the agentic search capabilities of generative models with the precise scoring aptitude of discriminative models. We also construct a new benchmark, FGVeriBench, as a robust testbed for fine-grained factuality verification. Experimental results on FGVeriBench demonstrate that our DiVA significantly outperforms existing methods on factuality verification for both general and multi-hop questions.

[50] Evaluating the Pre-Consultation Ability of LLMs using Diagnostic Guidelines

Jean Seo, Gibaeg Kim, Kihun Shin, Seungseop Lim, Hyunkyung Lee, Wooseok Han, Jongwon Lee, Eunho Yang

Main category: cs.CL

TL;DR: EPAG is a benchmark for evaluating LLMs’ pre-consultation ability using diagnostic guidelines, showing that fine-tuned small models can outperform frontier LLMs, and that more HPI doesn’t always improve diagnosis.

Details

Motivation: To develop a systematic framework for evaluating LLMs' pre-consultation capabilities in clinical settings, addressing the need for standardized assessment of how well LLMs can perform initial patient evaluation before formal medical consultation.

Method: Created EPAG benchmark dataset and framework that evaluates LLMs directly through HPI-diagnostic guideline comparison and indirectly through disease diagnosis. Includes experiments comparing frontier LLMs with fine-tuned small open-source models, analyzing HPI quantity impact, and examining language influence on dialogue characteristics.

Result: Fine-tuned small open-source models can outperform frontier LLMs in pre-consultation tasks. Increased HPI amount doesn’t necessarily improve diagnostic performance. Language of pre-consultation influences dialogue characteristics. Dataset and evaluation pipeline are open-sourced.

Conclusion: EPAG provides a valuable benchmark for evaluating LLMs in clinical pre-consultation settings, demonstrating that task-specific fine-tuning can outperform general-purpose large models, and highlighting important factors like HPI quantity and language influence for real-world clinical LLM applications.

Abstract: We introduce EPAG, a benchmark dataset and framework designed for Evaluating the Pre-consultation Ability of LLMs using diagnostic Guidelines. LLMs are evaluated directly through HPI-diagnostic guideline comparison and indirectly through disease diagnosis. In our experiments, we observe that small open-source models fine-tuned with a well-curated, task-specific dataset can outperform frontier LLMs in pre-consultation. Additionally, we find that increased amount of HPI (History of Present Illness) does not necessarily lead to improved diagnostic performance. Further experiments reveal that the language of pre-consultation influences the characteristics of the dialogue. By open-sourcing our dataset and evaluation pipeline on https://github.com/seemdog/EPAG, we aim to contribute to the evaluation and further development of LLM applications in real-world clinical settings.

[51] Reasoning Model Is Superior LLM-Judge, Yet Suffers from Biases

Hui Huang, Xuanxin Wu, Muyun Yang, Yuki Arase

Main category: cs.CL

TL;DR: LRMs outperform non-reasoning LLMs in judgment accuracy, instruction-following, and robustness, but still exhibit biases. PlanJudge strategy mitigates biases by having models generate evaluation plans first.

Details

Motivation: To systematically compare whether Large Reasoning Models (LRMs) are superior judges compared to non-reasoning LLMs, and to address the bias issues that persist in both types of models.

Method: Empirical analysis comparing LRMs vs non-reasoning LLMs on judgment tasks, plus proposing PlanJudge - an evaluation strategy where models generate explicit evaluation plans before execution to reduce biases.

Result: Four key findings: 1) LRMs outperform in judgment accuracy, especially on reasoning tasks; 2) Better instruction-following; 3) Enhanced robustness against adversarial attacks; 4) Still exhibit strong superficial quality biases. PlanJudge significantly mitigates biases in both LRMs and standard LLMs.

Conclusion: LRMs are superior judges to non-reasoning LLMs but still suffer from biases. The simple PlanJudge strategy effectively reduces biases in both model types, improving evaluation robustness.

Abstract: This paper presents the first systematic comparison investigating whether Large Reasoning Models (LRMs) are superior judge to non-reasoning LLMs. Our empirical analysis yields four key findings: 1) LRMs outperform non-reasoning LLMs in terms of judgment accuracy, particularly on reasoning-intensive tasks; 2) LRMs demonstrate superior instruction-following capabilities in evaluation contexts; 3) LRMs exhibit enhanced robustness against adversarial attacks targeting judgment tasks; 4) However, LRMs still exhibit strong biases in superficial quality. To improve the robustness against biases, we propose PlanJudge, an evaluation strategy that prompts the model to generate an explicit evaluation plan before execution. Despite its simplicity, our experiments demonstrate that PlanJudge significantly mitigates biases in both LRMs and standard LLMs.

[52] Agent-Dice: Disentangling Knowledge Updates via Geometric Consensus for Agent Continual Learning

Zheng Wu, Xingyu Lou, Xinbei Ma, Yansi Li, Weiwen Liu, Weinan Zhang, Jun Wang, Zhuosheng Zhang

Main category: cs.CL

TL;DR: Agent-Dice is a parameter fusion framework that addresses catastrophic forgetting in LLM-based agents by distinguishing between common knowledge and conflicting knowledge through directional consensus evaluation.

Details

Motivation: LLM-based agents face the stability-plasticity dilemma when learning new tasks continuously, leading to catastrophic forgetting. The core issue is the failure to explicitly separate common knowledge shared across tasks from conflicting knowledge caused by task-specific interference.

Method: Agent-Dice uses a two-stage parameter fusion process: 1) geometric consensus filtering to prune conflicting gradients, and 2) curvature-based importance weighting to amplify shared semantics. This directional consensus evaluation approach disentangles knowledge updates.

Result: Extensive experiments on GUI agents and tool-use agent domains show Agent-Dice achieves outstanding continual learning performance with minimal computational overhead and parameter updates.

Conclusion: The framework successfully addresses the stability-plasticity dilemma in LLM-based agents by explicitly distinguishing between common and conflicting knowledge, supported by rigorous theoretical analysis and practical effectiveness.

Abstract: Large Language Model (LLM)-based agents significantly extend the utility of LLMs by interacting with dynamic environments. However, enabling agents to continually learn new tasks without catastrophic forgetting remains a critical challenge, known as the stability-plasticity dilemma. In this work, we argue that this dilemma fundamentally arises from the failure to explicitly distinguish between common knowledge shared across tasks and conflicting knowledge introduced by task-specific interference. To address this, we propose Agent-Dice, a parameter fusion framework based on directional consensus evaluation. Concretely, Agent-Dice disentangles knowledge updates through a two-stage process: geometric consensus filtering to prune conflicting gradients, and curvature-based importance weighting to amplify shared semantics. We provide a rigorous theoretical analysis that establishes the validity of the proposed fusion scheme and offers insight into the origins of the stability-plasticity dilemma. Extensive experiments on GUI agents and tool-use agent domains demonstrate that Agent-Dice exhibits outstanding continual learning performance with minimal computational overhead and parameter updates.

[53] LLM-MC-Affect: LLM-Based Monte Carlo Modeling of Affective Trajectories and Latent Ambiguity for Interpersonal Dynamic Insight

Yu-Zheng Lin, Bono Po-Jen Shih, John Paul Martin Encinas, Elizabeth Victoria Abraham Achom, Karan Himanshu Patel, Jesus Horacio Pacheco, Sicong Shao, Jyotikrishna Dass, Soheil Salehi, Pratik Satam

Main category: cs.CL

TL;DR: LLM-MC-Affect: A probabilistic framework using stochastic LLM decoding and Monte Carlo estimation to model emotion as continuous latent distributions, enabling analysis of interpersonal emotional coordination through sentiment trajectories and coupling indicators.

Details

Motivation: Prior text-based affect inference approaches treat sentiment as deterministic point estimates for individual speakers, failing to capture the inherent subjectivity, latent ambiguity, and sequential coupling found in mutual exchanges. There's a need to better model the dynamic, probabilistic nature of emotional coordination in human interaction.

Method: Introduces LLM-MC-Affect, a probabilistic framework that characterizes emotion as continuous latent probability distributions over affective space. Uses stochastic LLM decoding and Monte Carlo estimation to approximate these distributions, deriving sentiment trajectories that quantify both central affective tendencies and perceptual ambiguity. Analyzes interpersonal coupling through sequential cross-correlation and slope-based indicators to identify leading/lagging influences.

Result: Validated on teacher-student instructional dialogues, where quantitative indicators successfully distilled high-level interaction insights such as effective scaffolding. The framework provides a scalable, deployable pathway for understanding interpersonal dynamics.

Conclusion: Establishes a generalizable solution for analyzing emotional coordination in human interaction that extends beyond education to broader social and behavioral research, moving beyond static sentiment labels to capture the dynamic, probabilistic nature of interpersonal affect.

Abstract: Emotional coordination is a core property of human interaction that shapes how relational meaning is constructed in real time. While text-based affect inference has become increasingly feasible, prior approaches often treat sentiment as a deterministic point estimate for individual speakers, failing to capture the inherent subjectivity, latent ambiguity, and sequential coupling found in mutual exchanges. We introduce LLM-MC-Affect, a probabilistic framework that characterizes emotion not as a static label, but as a continuous latent probability distribution defined over an affective space. By leveraging stochastic LLM decoding and Monte Carlo estimation, the methodology approximates these distributions to derive high-fidelity sentiment trajectories that explicitly quantify both central affective tendencies and perceptual ambiguity. These trajectories enable a structured analysis of interpersonal coupling through sequential cross-correlation and slope-based indicators, identifying leading or lagging influences between interlocutors. To validate the interpretive capacity of this approach, we utilize teacher-student instructional dialogues as a representative case study, where our quantitative indicators successfully distill high-level interaction insights such as effective scaffolding. This work establishes a scalable and deployable pathway for understanding interpersonal dynamics, offering a generalizable solution that extends beyond education to broader social and behavioral research.

[54] ELO: Efficient Layer-Specific Optimization for Continual Pretraining of Multilingual LLMs

HanGyeol Yoo, ChangSu Choi, Minjun Kim, Seohyun Song, SeungWoo Song, Inho Won, Jongyoul Park, Cheoneum Park, KyungTae Lim

Main category: cs.CL

TL;DR: ELO method efficiently enhances multilingual LLMs for specific languages by training only critical first/last layers then aligning them, achieving 6.46x speedup with improved target language performance while preserving source language capabilities.

Details

Motivation: Traditional continual pretraining for multilingual LLMs suffers from high computational costs and degradation of source language performance when adapting to new target languages.

Method: Two-stage approach: (1) ELO Pretraining - detach and train only critical first and last layers on target language data, (2) Layer Alignment - reintegrate trained layers and perform brief full fine-tuning on small dataset for parameter alignment.

Result: Achieves up to 6.46x training speedup compared to existing methods, improves target language performance by up to 6.2% on benchmarks, and effectively preserves source language (English) capabilities.

Conclusion: ELO provides an efficient solution for multilingual LLM adaptation that dramatically reduces computational costs while improving target language performance and maintaining source language proficiency.

Abstract: We propose an efficient layer-specific optimization (ELO) method designed to enhance continual pretraining (CP) for specific languages in multilingual large language models (MLLMs). This approach addresses the common challenges of high computational cost and degradation of source language performance associated with traditional CP. The ELO method consists of two main stages: (1) ELO Pretraining, where a small subset of specific layers, identified in our experiments as the critically important first and last layers, are detached from the original MLLM and trained with the target language. This significantly reduces not only the number of trainable parameters but also the total parameters computed during the forward pass, minimizing GPU memory consumption and accelerating the training process. (2) Layer Alignment, where the newly trained layers are reintegrated into the original model, followed by a brief full fine-tuning step on a small dataset to align the parameters. Experimental results demonstrate that the ELO method achieves a training speedup of up to 6.46 times compared to existing methods, while improving target language performance by up to 6.2% on qualitative benchmarks and effectively preserving source language (English) capabilities.

[55] SyncThink: A Training-Free Strategy to Align Inference Termination with Reasoning Saturation

Gengyang Li, Wang Cai, Yifeng Gao, Yunfang Wu

Main category: cs.CL

TL;DR: SyncThink is a training-free decoding method that reduces Chain-of-Thought reasoning overhead by monitoring the model’s reasoning-transition signal and terminating reasoning early, achieving comparable accuracy with significantly fewer tokens and lower latency.

Details

Motivation: Chain-of-Thought prompting produces long and redundant reasoning traces that substantially increase inference cost, creating a need for methods that reduce this overhead without modifying model weights.

Method: SyncThink monitors the model’s reasoning-transition signal by observing that answer tokens attend weakly to early reasoning and focus on the special token “/think”. It uses this observation to detect when reasoning should terminate, creating a plug-and-play decoding method that doesn’t require weight modifications.

Result: On GSM8K, MMLU, GPQA, and BBH across three DeepSeek-R1 distilled models, SyncThink achieves 62.00% average Top-1 accuracy using only 656 tokens and 28.68s latency, compared to full CoT decoding’s 61.22% accuracy with 2141 tokens and 92.01s latency. On GPQA, it yields up to +8.1 absolute accuracy improvement by preventing over-thinking.

Conclusion: SyncThink effectively reduces Chain-of-Thought reasoning overhead while maintaining or even improving accuracy by terminating reasoning at the optimal point, making it a practical solution for efficient reasoning in language models.

Abstract: Chain-of-Thought (CoT) prompting improves reasoning but often produces long and redundant traces that substantially increase inference cost. We present SyncThink, a training-free and plug-and-play decoding method that reduces CoT overhead without modifying model weights. We find that answer tokens attend weakly to early reasoning and instead focus on the special token “/think”, indicating an information bottleneck. Building on this observation, SyncThink monitors the model’s own reasoning-transition signal and terminates reasoning. Experiments on GSM8K, MMLU, GPQA, and BBH across three DeepSeek-R1 distilled models show that SyncThink achieves 62.00 percent average Top-1 accuracy using 656 generated tokens and 28.68 s latency, compared to 61.22 percent, 2141 tokens, and 92.01 s for full CoT decoding. On long-horizon tasks such as GPQA, SyncThink can further yield up to +8.1 absolute accuracy by preventing over-thinking.

Haonan Chen, Sicheng Gao, Radu Timofte, Tetsuya Sakai, Zhicheng Dou

Main category: cs.CL

TL;DR: E5-omni: A lightweight explicit alignment method that adapts vision-language models into robust omni-modal embedding models by addressing modality-dependent similarity scales, imbalanced negative hardness, and cross-modal statistical mismatches.

Details

Motivation: Current omni-modal embedding models rely on implicit alignment from pretrained VLMs, causing three issues: (1) modality-dependent similarity scales making scores inconsistent, (2) imbalanced hardness distribution in mixed-modality batches making negatives trivial, and (3) mismatched cross-modal statistics making rankings unstable.

Method: Three components: (1) modality-aware temperature calibration to align similarity scales, (2) controllable negative curriculum with debiasing to focus on confusing negatives while reducing false negative impact, and (3) batch whitening with covariance regularization to match cross-modal geometry.

Result: Experiments on MMEB-V2 and AudioCaps show consistent gains over strong bi-modal and omni-modal baselines. The recipe transfers well to other VLM backbones.

Conclusion: E5-omni effectively addresses key limitations of implicit alignment in omni-modal embeddings through explicit alignment techniques, improving robustness and performance across diverse modalities.

Abstract: Modern information systems often involve different types of items, e.g., a text query, an image, a video clip, or an audio segment. This motivates omni-modal embedding models that map heterogeneous modalities into a shared space for direct comparison. However, most recent omni-modal embeddings still rely heavily on implicit alignment inherited from pretrained vision-language model (VLM) backbones. In practice, this causes three common issues: (i) similarity logits have modality-dependent sharpness, so scores are not on a consistent scale; (ii) in-batch negatives become less effective over time because mixed-modality batches create an imbalanced hardness distribution; as a result, many negatives quickly become trivial and contribute little gradient; and (iii) embeddings across modalities show mismatched first- and second-order statistics, which makes rankings less stable. To tackle these problems, we propose e5-omni, a lightweight explicit alignment recipe that adapts off-the-shelf VLMs into robust omni-modal embedding models. e5-omni combines three simple components: (1) modality-aware temperature calibration to align similarity scales, (2) a controllable negative curriculum with debiasing to focus on confusing negatives while reducing the impact of false negatives, and (3) batch whitening with covariance regularization to better match cross-modal geometry in the shared embedding space. Experiments on MMEB-V2 and AudioCaps show consistent gains over strong bi-modal and omni-modal baselines, and the same recipe also transfers well to other VLM backbones. We release our model checkpoint at https://huggingface.co/Haon-Chen/e5-omni-7B.

[57] eTracer: Towards Traceable Text Generation via Claim-Level Grounding

Bohao Chu, Qianli Wang, Hendrik Damm, Hui Wang, Ula Muhabbek, Elisabeth Livingstone, Christoph M. Friedrich, Norbert Fuhr

Main category: cs.CL

TL;DR: eTracer is a plug-and-play framework for traceable text generation that grounds claims against contextual evidence to verify system-generated responses in biomedical domains.

Details

Motivation: The paper addresses the challenge of efficiently verifying system-generated responses, particularly in high-stakes biomedical domains where accuracy and trustworthiness are critical.

Method: eTracer uses post-hoc claim-level grounding where each response claim is aligned with contextual evidence (supporting or contradicting). It enables tracing responses back to source evidence and quantifies response faithfulness.

Result: Experiments show substantial improvements in grounding quality and user verification efficiency compared to conventional sentence-level grounding methods.

Conclusion: eTracer enhances verifiability and trustworthiness of generated responses through claim-level evidence grounding, making it particularly valuable for biomedical applications.

Abstract: How can system-generated responses be efficiently verified, especially in the high-stakes biomedical domain? To address this challenge, we introduce eTracer, a plug-and-play framework that enables traceable text generation by grounding claims against contextual evidence. Through post-hoc grounding, each response claim is aligned with contextual evidence that either supports or contradicts it. Building on claim-level grounding results, eTracer not only enables users to precisely trace responses back to their contextual source but also quantifies response faithfulness, thereby enabling the verifiability and trustworthiness of generated responses. Experiments show that our claim-level grounding approach alleviates the limitations of conventional grounding methods in aligning generated statements with contextual sentence-level evidence, resulting in substantial improvements in overall grounding quality and user verification efficiency. The code and data are available at https://github.com/chubohao/eTracer.

[58] DisastQA: A Comprehensive Benchmark for Evaluating Question Answering in Disaster Management

Zhitong Chen, Kai Yin, Xiangjue Dong, Chengkai Liu, Xiangpeng Li, Yiming Xiao, Bo Li, Junwei Ma, Ali Mostafavi, James Caverlee

Main category: cs.CL

TL;DR: DisastQA is a new benchmark for evaluating QA systems in disaster management scenarios with uncertain/conflicting information, featuring 3,000 verified questions across 8 disaster types with varying evidence conditions.

Details

Motivation: Existing QA benchmarks are built on clean evidence and don't capture the uncertain, conflicting information scenarios crucial for disaster management applications.

Method: Created DisastQA benchmark via human-LLM collaboration pipeline with stratified sampling (3,000 questions: 2,000 multiple-choice, 1,000 open-ended). Evaluated 20 models under varying evidence conditions (closed-book to noisy evidence integration). Proposed human-verified keypoint-based evaluation for open-ended QA.

Result: Experiments show substantial divergences from general-purpose benchmarks like MMLU-Pro. Recent open-weight models approach proprietary systems in clean settings but degrade sharply under realistic noise, exposing critical reliability gaps for disaster response.

Conclusion: DisastQA reveals critical gaps in current QA systems for disaster management, especially under realistic noisy conditions, highlighting the need for benchmarks that test reasoning under uncertainty for real-world applications.

Abstract: Accurate question answering (QA) in disaster management requires reasoning over uncertain and conflicting information, a setting poorly captured by existing benchmarks built on clean evidence. We introduce DisastQA, a large-scale benchmark of 3,000 rigorously verified questions (2,000 multiple-choice and 1,000 open-ended) spanning eight disaster types. The benchmark is constructed via a human-LLM collaboration pipeline with stratified sampling to ensure balanced coverage. Models are evaluated under varying evidence conditions, from closed-book to noisy evidence integration, enabling separation of internal knowledge from reasoning under imperfect information. For open-ended QA, we propose a human-verified keypoint-based evaluation protocol emphasizing factual completeness over verbosity. Experiments with 20 models reveal substantial divergences from general-purpose leaderboards such as MMLU-Pro. While recent open-weight models approach proprietary systems in clean settings, performance degrades sharply under realistic noise, exposing critical reliability gaps for disaster response. All code, data, and evaluation resources are available at https://github.com/TamuChen18/DisastQA_open.

[59] NeuronScope: A Multi-Agent Framework for Explaining Polysemantic Neurons in Language Models

Weiqi Liu, Yongliang Miao, Haiyan Zhao, Yanguang Liu, Mengnan Du

Main category: cs.CL

TL;DR: NeuronScope: multi-agent framework for interpreting polysemantic neurons in LLMs using iterative activation-guided decomposition

Details

Motivation: Existing single-pass neuron interpretation methods fail to capture the widespread polysemanticity in LLMs where individual neurons respond to multiple distinct semantic concepts

Method: Multi-agent framework that reformulates neuron interpretation as iterative, activation-guided process; deconstructs neuron activations into atomic semantic components, clusters them into distinct semantic modes, and iteratively refines explanations using neuron activation feedback

Result: NeuronScope uncovers hidden polysemanticity and produces explanations with significantly higher activation correlation compared to single-pass baselines

Conclusion: NeuronScope provides a more faithful approach to neuron interpretation in LLMs by addressing polysemanticity through iterative decomposition and refinement

Abstract: Neuron-level interpretation in large language models (LLMs) is fundamentally challenged by widespread polysemanticity, where individual neurons respond to multiple distinct semantic concepts. Existing single-pass interpretation methods struggle to faithfully capture such multi-concept behavior. In this work, we propose NeuronScope, a multi-agent framework that reformulates neuron interpretation as an iterative, activation-guided process. NeuronScope explicitly deconstructs neuron activations into atomic semantic components, clusters them into distinct semantic modes, and iteratively refines each explanation using neuron activation feedback. Experiments demonstrate that NeuronScope uncovers hidden polysemanticity and produces explanations with significantly higher activation correlation compared to single-pass baselines.

[60] Towards Compositional Generalization of LLMs via Skill Taxonomy Guided Data Synthesis

Yifan Wei, Li Du, Xiaoyan Yu, Yang Feng, Angsheng Li

Main category: cs.CL

TL;DR: STEPS is a framework that generates compositionally challenging data for LLMs by creating a skill taxonomy and using entropy-based data synthesis to improve compositional generalization.

Details

Motivation: LLMs and agent systems struggle with compositional generalization due to power-law distribution of complex skill combinations, creating a data bottleneck that limits instruction-following performance and agent task generalization.

Method: STEPS creates a hierarchical skill taxonomy using structural information theory to uncover latent skill relationships, then formulates data synthesis as constrained information maximization to select skill combinations that maximize marginal structural information while preserving semantic coherence.

Result: STEPS outperforms existing data synthesis baselines on challenging instruction-following benchmarks and yields improved compositional generalization in downstream agent-based evaluations.

Conclusion: The STEPS framework effectively addresses compositional generalization challenges in LLMs and agent systems through taxonomy-guided entropy-based data synthesis, demonstrating superior performance on instruction-following and agent tasks.

Abstract: Large Language Models (LLMs) and agent-based systems often struggle with compositional generalization due to a data bottleneck in which complex skill combinations follow a long-tailed, power-law distribution, limiting both instruction-following performance and generalization in agent-centric tasks. To address this challenge, we propose STEPS, a Skill Taxonomy guided Entropy-based Post-training data Synthesis framework for generating compositionally challenging data. STEPS explicitly targets compositional generalization by uncovering latent relationships among skills and organizing them into an interpretable, hierarchical skill taxonomy using structural information theory. Building on this taxonomy, we formulate data synthesis as a constrained information maximization problem, selecting skill combinations that maximize marginal structural information within the hierarchy while preserving semantic coherence. Experiments on challenging instruction-following benchmarks show that STEPS outperforms existing data synthesis baselines, while also yielding improved compositional generalization in downstream agent-based evaluations.

[61] From Implicit to Explicit: Token-Efficient Logical Supervision for Mathematical Reasoning in LLMs

Shaojie Wang, Liang Zhang

Main category: cs.CL

TL;DR: FSLR is a lightweight training framework that improves LLMs’ logical reasoning by focusing on the first planning step (identifying variables and operations), addressing the core limitation of logical relationship understanding that accounts for most errors in mathematical problem-solving.

Details

Motivation: LLMs show limited logical reasoning in math problems, relying on pattern-matching instead of genuine reasoning. Over 90% of errors stem from poor logical relationship understanding, and current methods like CoT-SFT fail to substantially address this bottleneck.

Method: First-Step Logical Reasoning (FSLR) trains models on the isolated first planning step - identifying which variables to use and which operation to apply. This provides explicit supervision for logical relationship understanding by forcing models to derive relationships directly from problem statements, unlike CoT-SFT which embeds relationships implicitly in complete solutions.

Result: FSLR consistently outperforms CoT-SFT across multiple models and datasets, with average improvements of 3.2% in-distribution and 4.6% out-of-distribution. It achieves 4-6x faster training and reduces training token consumption by over 80%.

Conclusion: Targeting the first planning step for explicit logical relationship understanding is an effective and efficient approach to improve LLMs’ logical reasoning capabilities, addressing a core limitation that current methods fail to solve.

Abstract: Recent studies reveal that large language models (LLMs) exhibit limited logical reasoning abilities in mathematical problem-solving, instead often relying on pattern-matching and memorization. We systematically analyze this limitation, focusing on logical relationship understanding, which is a core capability underlying genuine logical reasoning, and reveal that errors related to this capability account for over 90% of incorrect predictions, with Chain-of-Thought Supervised Fine-Tuning (CoT-SFT) failing to substantially reduce these errors. To address this bottleneck, we propose First-Step Logical Reasoning (FSLR), a lightweight training framework targeting logical relationship understanding. Our key insight is that the first planning step-identifying which variables to use and which operation to apply-encourages the model to derive logical relationships directly from the problem statement. By training models on this isolated step, FSLR provides explicit supervision for logical relationship understanding, unlike CoT-SFT which implicitly embeds such relationships within complete solution trajectories. Extensive experiments across multiple models and datasets demonstrate that FSLR consistently outperforms CoT-SFT under both in-distribution and out-of-distribution settings, with average improvements of 3.2% and 4.6%, respectively. Moreover, FSLR achieves 4-6x faster training and reduces training token consumption by over 80%.

[62] Evaluation Framework for AI Creativity: A Case Study Based on Story Generation

Pharath Sathya, Yin Jou Huang, Fei Cheng

Main category: cs.CL

TL;DR: Proposes a structured evaluation framework for AI story generation with four components (Novelty, Value, Adherence, Resonance) and eleven sub-components, showing creativity is evaluated hierarchically rather than cumulatively.

Details

Motivation: Existing reference-based metrics fail to capture the subjective nature of creativity in text generation, creating a need for better evaluation methods.

Method: Developed a structured evaluation framework with four components and eleven sub-components, used controlled story generation via “Spike Prompting,” and conducted a crowdsourced study with 115 readers to examine how different creative components shape human creativity judgments.

Result: Creativity is evaluated hierarchically rather than cumulatively, with different dimensions becoming salient at different stages of judgment. Reflective evaluation substantially alters both ratings and inter-rater agreement.

Conclusion: The proposed framework effectively reveals dimensions of creativity that are obscured by reference-based evaluation, providing a more nuanced approach to assessing creative text generation.

Abstract: Evaluating creative text generation remains a challenge because existing reference-based metrics fail to capture the subjective nature of creativity. We propose a structured evaluation framework for AI story generation comprising four components (Novelty, Value, Adherence, and Resonance) and eleven sub-components. Using controlled story generation via ``Spike Prompting’’ and a crowdsourced study of 115 readers, we examine how different creative components shape both immediate and reflective human creativity judgments. Our findings show that creativity is evaluated hierarchically rather than cumulatively, with different dimensions becoming salient at different stages of judgment, and that reflective evaluation substantially alters both ratings and inter-rater agreement. Together, these results support the effectiveness of our framework in revealing dimensions of creativity that are obscured by reference-based evaluation.

[63] ADEPT: Adaptive Dynamic Early-Exit Process for Transformers

Sangmin Yoo, Srikanth Malla, Chiho Choi, Wei D. Lu, Joon Hee Choi

Main category: cs.CL

TL;DR: ADEPT enables dynamic token-level early exit in both prefill and generation phases of LLM inference, overcoming KV cache bottlenecks to achieve significant efficiency gains.

Details

Motivation: Current early-exit strategies are limited - they only apply to the first token in generation or at prompt level in prefill, leaving KV cache for skipped layers as a bottleneck for subsequent tokens, which limits computational savings.

Method: ADEPT introduces adaptive token-level early-exit mechanism that dynamically adjusts computation based on token complexity. It decouples sequential dependencies in skipped layers to enhance KV generation, making token-level early exit practical.

Result: ADEPT improves efficiency by up to 25% in language generation tasks and achieves 4x speed-up in downstream classification tasks, with up to 45% performance improvement.

Conclusion: ADEPT successfully overcomes KV cache bottlenecks in early-exit strategies, enabling practical dynamic token-level early exit that significantly improves LLM inference efficiency without compromising performance.

Abstract: The inference of large language models imposes significant computational workloads, often requiring the processing of billions of parameters. Although early-exit strategies have proven effective in reducing computational demands by halting inference earlier, they apply either to only the first token in the generation phase or at the prompt level in the prefill phase. Thus, the Key-Value (KV) cache for skipped layers remains a bottleneck for subsequent token generation, limiting the benefits of early exit. We introduce ADEPT (Adaptive Dynamic Early-exit Process for Transformers), a novel approach designed to overcome this issue and enable dynamic early exit in both the prefill and generation phases. The proposed adaptive token-level early-exit mechanism adjusts computation dynamically based on token complexity, optimizing efficiency without compromising performance. ADEPT further enhances KV generation procedure by decoupling sequential dependencies in skipped layers, making token-level early exit more practical. Experimental results demonstrate that ADEPT improves efficiency by up to 25% in language generation tasks and achieves a 4x speed-up in downstream classification tasks, with up to a 45% improvement in performance.

[64] RedBench: A Universal Dataset for Comprehensive Red Teaming of Large Language Models

Quy-Anh Dang, Chris Ngo, Truong-Son Hy

Main category: cs.CL

TL;DR: RedBench is a universal dataset for evaluating LLM vulnerabilities, aggregating 37 benchmark datasets with 29,362 samples across attack/refusal prompts, featuring standardized risk categories and domains for consistent evaluation.

Details

Motivation: Existing red teaming datasets have inconsistent risk categorizations, limited domain coverage, and outdated evaluations, which hinder systematic vulnerability assessments of LLMs in safety-critical applications.

Method: Aggregated 37 benchmark datasets from leading conferences and repositories, comprising 29,362 samples across attack and refusal prompts, with a standardized taxonomy of 22 risk categories and 19 domains.

Result: Created RedBench dataset with comprehensive coverage, established baselines for modern LLMs, and open-sourced the dataset and evaluation code to enable robust comparisons.

Conclusion: RedBench facilitates systematic vulnerability assessments, fosters future research, and promotes development of secure and reliable LLMs for real-world deployment through standardized evaluation.

Abstract: As large language models (LLMs) become integral to safety-critical applications, ensuring their robustness against adversarial prompts is paramount. However, existing red teaming datasets suffer from inconsistent risk categorizations, limited domain coverage, and outdated evaluations, hindering systematic vulnerability assessments. To address these challenges, we introduce RedBench, a universal dataset aggregating 37 benchmark datasets from leading conferences and repositories, comprising 29,362 samples across attack and refusal prompts. RedBench employs a standardized taxonomy with 22 risk categories and 19 domains, enabling consistent and comprehensive evaluations of LLM vulnerabilities. We provide a detailed analysis of existing datasets, establish baselines for modern LLMs, and open-source the dataset and evaluation code. Our contributions facilitate robust comparisons, foster future research, and promote the development of secure and reliable LLMs for real-world deployment. Code: https://github.com/knoveleng/redeval

Hengxing Cai, Yijie Rao, Ligang Huang, Zanyang Zhong, Jinhan Dong, Jingjun Tan, Wenhao Lu, Renxin Zhong

Main category: cs.CL

TL;DR: AirNav is a new large-scale UAV Vision-Language Navigation benchmark using real urban aerial data with natural instructions, plus AirVLN-R1 model combining SFT and RFT for better performance.

Details

Motivation: Existing UAV VLN datasets have limitations: they rely on virtual environments, have unnatural instructions, and are small in scale. There's a need for real-world aerial data with natural language instructions.

Method: Created AirNav benchmark using real urban aerial data instead of synthetic environments. Developed AirVLN-R1 model that combines Supervised Fine-Tuning (SFT) and Reinforcement Fine-Tuning (RFT) to improve navigation performance and generalization.

Result: Built a large-scale UAV VLN dataset with natural and diverse instructions. The AirVLN-R1 model shows enhanced performance and generalization, with preliminary real-world testing confirming feasibility.

Conclusion: AirNav addresses key limitations of existing UAV VLN datasets by providing real-world aerial data with natural instructions. The proposed AirVLN-R1 model demonstrates promising performance, and both dataset and code are publicly available to advance UAV navigation research.

Abstract: Existing Unmanned Aerial Vehicle (UAV) Vision-Language Navigation (VLN) datasets face issues such as dependence on virtual environments, lack of naturalness in instructions, and limited scale. To address these challenges, we propose AirNav, a large-scale UAV VLN benchmark constructed from real urban aerial data, rather than synthetic environments, with natural and diverse instructions. Additionally, we introduce the AirVLN-R1, which combines Supervised Fine-Tuning and Reinforcement Fine-Tuning to enhance performance and generalization. The feasibility of the model is preliminarily evaluated through real-world tests. Our dataset and code are publicly available.

[66] Visual Merit or Linguistic Crutch? A Close Look at DeepSeek-OCR

Yunhao Liang, Ruixuan Ying, Bo Li, Hong Li, Kai Yan, Qingwen Li, Min Yang, Okamoto Satoshi, Zhe Cui, Shiwen Ni

Main category: cs.CL

TL;DR: DeepSeek-OCR’s high performance relies heavily on linguistic priors rather than true optical recognition capabilities; without language support, accuracy drops from ~90% to 20%. Traditional OCR methods show greater robustness than end-to-end approaches.

Details

Motivation: To investigate whether DeepSeek-OCR's claimed high-ratio vision-text compression performance stems from genuine optical recognition capabilities or reliance on linguistic priors, addressing concerns about its effectiveness for solving LLM long-context bottlenecks.

Method: Used sentence-level and word-level semantic corruption to isolate intrinsic OCR capabilities from language priors. Conducted comparative benchmarking against 13 baseline models and performed context stress testing to evaluate performance boundaries.

Result: Performance plummeted from ~90% to 20% without linguistic support. Traditional pipeline OCR methods showed significantly higher robustness than end-to-end methods. Lower visual token counts increased reliance on priors and hallucination risks. Model collapsed completely around 10,000 text tokens.

Conclusion: DeepSeek-OCR’s performance is heavily dependent on linguistic crutches rather than visual merit. Current optical compression techniques may paradoxically worsen long-context bottlenecks. The study defines capability boundaries and provides insights for optimizing vision-text compression paradigms.

Abstract: DeepSeek-OCR utilizes an optical 2D mapping approach to achieve high-ratio vision-text compression, claiming to decode text tokens exceeding ten times the input visual tokens. While this suggests a promising solution for the LLM long-context bottleneck, we investigate a critical question: “Visual merit or linguistic crutch - which drives DeepSeek-OCR’s performance?” By employing sentence-level and word-level semantic corruption, we isolate the model’s intrinsic OCR capabilities from its language priors. Results demonstrate that without linguistic support, DeepSeek-OCR’s performance plummets from approximately 90% to 20%. Comparative benchmarking against 13 baseline models reveals that traditional pipeline OCR methods exhibit significantly higher robustness to such semantic perturbations than end-to-end methods. Furthermore, we find that lower visual token counts correlate with increased reliance on priors, exacerbating hallucination risks. Context stress testing also reveals a total model collapse around 10,000 text tokens, suggesting that current optical compression techniques may paradoxically aggravate the long-context bottleneck. This study empirically defines DeepSeek-OCR’s capability boundaries and offers essential insights for future optimizations of the vision-text compression paradigm. We release all data, results and scripts used in this study at https://github.com/dududuck00/DeepSeekOCR.

[67] MIND: From Passive Mimicry to Active Reasoning through Capability-Aware Multi-Perspective CoT Distillation

Jin Cui, Jiaqi Guo, Jiepeng Zhou, Ruixuan Yang, Jiayi Lu, Jiajun Xu, Jiangcheng Song, Boran Zhao, Pengju Ren

Main category: cs.CL

TL;DR: MIND is a capability-adaptive distillation framework that synthesizes diverse teacher perspectives through a Teaching Assistant network with feedback-driven inertia calibration, enabling better reasoning transfer to smaller models while maintaining cross-domain generalization.

Details

Motivation: Current knowledge distillation approaches for transferring LLM reasoning abilities to smaller models have limitations: they force students to follow single "golden" rationales, which can be out-of-distribution for students with different inductive biases and evolving capacities. This misalignment degrades the student's latent reasoning distribution, leading to suboptimal performance and poor cross-domain generalization.

Method: MIND introduces a capability-adaptive framework with two key components: 1) A Teaching Assistant network that synthesizes diverse teacher perspectives rather than relying on single rationales, and 2) A Feedback-Driven Inertia Calibration mechanism that uses inertia-filtered training loss to align supervision with the student’s current adaptability, preventing catastrophic forgetting.

Result: Extensive experiments show MIND achieves state-of-the-art performance on both in-distribution and out-of-distribution benchmarks. Latent space analysis confirms effective reasoning ability internalization, demonstrating superior cross-domain generalization compared to existing approaches.

Conclusion: MIND successfully transitions distillation from passive mimicry to active cognitive construction, enabling smaller models to better internalize complex reasoning abilities while maintaining generalization across domains. The framework addresses the misalignment problem between teacher rationales and student capabilities through adaptive supervision.

Abstract: While Large Language Models (LLMs) have emerged with remarkable capabilities in complex tasks through Chain-of-Thought reasoning, practical resource constraints have sparked interest in transferring these abilities to smaller models. However, achieving both domain performance and cross-domain generalization remains challenging. Existing approaches typically restrict students to following a single golden rationale and treat different reasoning paths independently. Due to distinct inductive biases and intrinsic preferences, alongside the student’s evolving capacity and reasoning preferences during training, a teacher’s “optimal” rationale could act as out-of-distribution noise. This misalignment leads to a degeneration of the student’s latent reasoning distribution, causing suboptimal performance. To bridge this gap, we propose MIND, a capability-adaptive framework that transitions distillation from passive mimicry to active cognitive construction. We synthesize diverse teacher perspectives through a novel “Teaching Assistant” network. By employing a Feedback-Driven Inertia Calibration mechanism, this network utilizes inertia-filtered training loss to align supervision with the student’s current adaptability, effectively enhancing performance while mitigating catastrophic forgetting. Extensive experiments demonstrate that MIND achieves state-of-the-art performance on both in-distribution and out-of-distribution benchmarks, and our sophisticated latent space analysis further confirms the mechanism of reasoning ability internalization.

[68] O-Researcher: An Open Ended Deep Research Model via Multi-Agent Distillation and Agentic RL

Yi Yao, He Zhu, Piaohong Wang, Jincheng Ren, Xinlong Yang, Qianben Chen, Xiaowan Li, Dingfeng Shi, Jiaxian Li, Qiexiang Wang, Sinuo Wang, Xinpeng Liu, Jiaqi Wu, Minghao Liu, Wangchunshu Zhou

Main category: cs.CL

TL;DR: A framework for automated synthesis of high-quality instructional data using multi-agent AI collaboration, enabling open-source LLMs to achieve state-of-the-art performance without proprietary data.

Details

Motivation: To bridge the performance gap between closed-source and open-source LLMs by addressing disparities in access to high-quality training data, providing an alternative to proprietary data sources.

Method: Multi-agent workflow where AI agents simulate complex tool-integrated reasoning to generate diverse, high-fidelity instructional data end-to-end, followed by a two-stage training strategy combining supervised fine-tuning with novel reinforcement learning for model alignment.

Result: The framework enables open-source models across multiple scales to achieve new state-of-the-art performance on major deep research benchmarks, demonstrating scalable advancement without proprietary dependencies.

Conclusion: This work provides a scalable and effective pathway for advancing open-source LLMs by automating high-quality data synthesis and training, eliminating reliance on proprietary data or models.

Abstract: The performance gap between closed-source and open-source large language models (LLMs) is largely attributed to disparities in access to high-quality training data. To bridge this gap, we introduce a novel framework for the automated synthesis of sophisticated, research-grade instructional data. Our approach centers on a multi-agent workflow where collaborative AI agents simulate complex tool-integrated reasoning to generate diverse and high-fidelity data end-to-end. Leveraging this synthesized data, we develop a two-stage training strategy that integrates supervised fine-tuning with a novel reinforcement learning method, designed to maximize model alignment and capability. Extensive experiments demonstrate that our framework empowers open-source models across multiple scales, enabling them to achieve new state-of-the-art performance on the major deep research benchmark. This work provides a scalable and effective pathway for advancing open-source LLMs without relying on proprietary data or models.

[69] Stuttering-Aware Automatic Speech Recognition for Indonesian Language

Fadhil Muhammad, Alwin Djuliansah, Adrian Aryaputra Hamzah, Kurniawati Azizah

Main category: cs.CL

TL;DR: Synthetic stuttered audio generation framework improves Indonesian ASR performance on dysfluent speech without degrading fluent speech recognition.

Details

Motivation: Current ASR systems perform poorly on stuttered speech, especially for low-resource languages like Indonesian where specialized datasets are scarce, limiting accessibility for people with speech disorders.

Method: Data augmentation framework that generates synthetic stuttered audio by injecting repetitions and prolongations into fluent text using rule-based transformations and LLMs, followed by TTS synthesis, then fine-tuning pre-trained Indonesian Whisper model with transfer learning.

Result: Experiments show synthetic data exposure consistently reduces recognition errors on stuttered speech while maintaining performance on fluent segments.

Conclusion: Synthetic data pipelines are effective for developing more inclusive speech technologies in under-represented languages, enabling adaptation to dysfluent acoustic patterns without requiring large-scale real-world recordings.

Abstract: Automatic speech recognition systems have achieved remarkable performance on fluent speech but continue to degrade significantly when processing stuttered speech, a limitation that is particularly acute for low-resource languages like Indonesian where specialized datasets are virtually non-existent. To overcome this scarcity, we propose a data augmentation framework that generates synthetic stuttered audio by injecting repetitions and prolongations into fluent text through a combination of rule-based transformations and large language models followed by text-to-speech synthesis. We apply this synthetic data to fine-tune a pre-trained Indonesian Whisper model using transfer learning, enabling the architecture to adapt to dysfluent acoustic patterns without requiring large-scale real-world recordings. Our experiments demonstrate that this targeted synthetic exposure consistently reduces recognition errors on stuttered speech while maintaining performance on fluent segments, validating the utility of synthetic data pipelines for developing more inclusive speech technologies in under-represented languages.

[70] Whose Facts Win? LLM Source Preferences under Knowledge Conflicts

Jakob Schuster, Vagrant Gautam, Katja Markert

Main category: cs.CL

TL;DR: LLMs show source preferences in knowledge conflicts, favoring institutional sources over social media, but repetition can reverse these preferences. A novel method reduces repetition bias by 99.8% while maintaining 88.8% of original preferences.

Details

Motivation: As LLMs are increasingly used in retrieval-augmented generation pipelines, understanding their behavior under knowledge conflicts is crucial. The role of information source in these conflicts has been overlooked, motivating this study on how source preferences affect LLM resolution of inter-context knowledge conflicts.

Method: Developed a novel framework to investigate source preferences in LLM knowledge conflict resolution. Conducted comprehensive, tightly-controlled evaluation of 13 open-weight LLMs. Proposed a novel method to mitigate repetition effects that can reverse source preferences.

Result: LLMs prefer institutionally-corroborated information (government, newspaper sources) over information from people and social media. However, simply repeating information from less credible sources can reverse these source preferences. The proposed method reduces repetition bias by up to 99.8% while maintaining at least 88.8% of original preferences.

Conclusion: Source preferences significantly influence LLM behavior in knowledge conflicts, but are vulnerable to repetition effects. The proposed mitigation method effectively maintains consistent source preferences while reducing repetition bias. The research highlights the importance of considering source credibility in knowledge-intensive NLP applications.

Abstract: As large language models (LLMs) are more frequently used in retrieval-augmented generation pipelines, it is increasingly relevant to study their behavior under knowledge conflicts. Thus far, the role of the source of the retrieved information has gone unexamined. We address this gap with a novel framework to investigate how source preferences affect LLM resolution of inter-context knowledge conflicts in English, motivated by interdisciplinary research on credibility. With a comprehensive, tightly-controlled evaluation of 13 open-weight LLMs, we find that LLMs prefer institutionally-corroborated information (e.g., government or newspaper sources) over information from people and social media. However, these source preferences can be reversed by simply repeating information from less credible sources. To mitigate repetition effects and maintain consistent preferences, we propose a novel method that reduces repetition bias by up to 99.8%, while also maintaining at least 88.8% of original preferences. We release all data and code to encourage future work on credibility and source preferences in knowledge-intensive NLP.

Dominik Macko

Main category: cs.CL

TL;DR: This paper examines how personalization affects machine-generated text detectability across 10 languages, finding platform personalization impacts detectability more than demographic targeting, especially in English.

Details

Motivation: As LLMs improve multilingual text generation capabilities, concerns grow about their misuse for personalized disinformation. Previous research showed personalization reduces detectability of machine-generated texts, but this was only studied in English. The authors want to examine this phenomenon across multiple languages and explore both potential misuse and benefits of personalization capabilities.

Method: The study covers 1080 combinations of various personalization aspects in prompts, generating texts using 16 distinct language models (17,280 texts total). They examine personalization quality when targeting demographic groups versus social-media platforms across 10 different languages.

Result: Results show differences in personalization quality across languages when targeting demographic groups versus platforms. Platform personalization affects detectability more significantly, especially in English where personalization quality is highest.

Conclusion: Personalization capabilities of LLMs have varying effects across languages, with platform targeting having greater impact on detectability than demographic targeting. This highlights the need for multilingual detection approaches and understanding of how personalization strategies differ across linguistic contexts.

Abstract: Capabilities of large language models to generate multilingual coherent text have continuously enhanced in recent years, which opens concerns about their potential misuse. Previous research has shown that they can be misused for generation of personalized disinformation in multiple languages. It has also been observed that personalization negatively affects detectability of machine-generated texts; however, this has been studied in the English language only. In this work, we examine this phenomenon across 10 languages, while we focus not only on potential misuse of personalization capabilities, but also on potential benefits they offer. Overall, we cover 1080 combinations of various personalization aspects in the prompts, for which the texts are generated by 16 distinct language models (17,280 texts in total). Our results indicate that there are differences in personalization quality of the generated texts when targeting demographic groups and when targeting social-media platforms across languages. Personalization towards platforms affects detectability of the generated texts in a higher scale, especially in English, where the personalization quality is the highest.

[72] Do LLM Self-Explanations Help Users Predict Model Behavior? Evaluating Counterfactual Simulatability with Pragmatic Perturbations

Pingjun Hong, Benjamin Roth

Main category: cs.CL

TL;DR: LLM self-explanations improve human and LLM ability to predict model behavior on counterfactual questions, but effectiveness depends on perturbation strategy and judge capability.

Details

Motivation: To determine whether LLM-generated self-explanations, despite potentially not reflecting true decision processes, can still help users predict model behavior through counterfactual simulatability.

Method: Used StrategyQA to evaluate human and LLM judges’ ability to predict model answers to counterfactual follow-up questions, with/without access to chain-of-thought or post-hoc explanations. Compared LLM-generated counterfactuals with pragmatics-based perturbations as test case construction methods.

Result: Self-explanations consistently improved simulation accuracy for both LLM judges and humans, but gains depended strongly on perturbation strategy and judge strength. Qualitative analysis showed explanations helped humans form more accurate predictions on perturbed questions.

Conclusion: LLM self-explanations can be useful for helping users predict model behavior even if they don’t reflect true decision processes, but their effectiveness is context-dependent on how test cases are constructed and who is using them.

Abstract: Large Language Models (LLMs) can produce verbalized self-explanations, yet prior studies suggest that such rationales may not reliably reflect the model’s true decision process. We ask whether these explanations nevertheless help users predict model behavior, operationalized as counterfactual simulatability. Using StrategyQA, we evaluate how well humans and LLM judges can predict a model’s answers to counterfactual follow-up questions, with and without access to the model’s chain-of-thought or post-hoc explanations. We compare LLM-generated counterfactuals with pragmatics-based perturbations as alternative ways to construct test cases for assessing the potential usefulness of explanations. Our results show that self-explanations consistently improve simulation accuracy for both LLM judges and humans, but the degree and stability of gains depend strongly on the perturbation strategy and judge strength. We also conduct a qualitative analysis of free-text justifications written by human users when predicting the model’s behavior, which provides evidence that access to explanations helps humans form more accurate predictions on the perturbed questions.

[73] Tracing the complexity profiles of different linguistic phenomena through the intrinsic dimension of LLM representations

Marco Baroni, Emily Cheng, Iria deDios-Flores, Francesca Franzon

Main category: cs.CL

TL;DR: Intrinsic dimension of LLM representations serves as a marker for linguistic complexity, revealing different ID profiles across layers that distinguish formal vs. functional complexity types.

Details

Motivation: To investigate whether intrinsic dimension (ID) of LLM representations can serve as a marker for linguistic complexity, and whether different ID profiles across layers can differentiate between formal and functional complexity types in language processing.

Method: Analyzed ID of LLM representations across layers for different linguistic complexity types (coordinated/subordinated clauses, right branching vs. center embedding, unambiguous vs. ambiguous relative clauses). Used representational similarity analysis and layer ablation experiments to confirm findings.

Result: Formal complexity (multiple coordinated/subordinated clauses) shows clear ID differences that align with abstract linguistic processing phase. Functional complexity contrasts are detected by ID but less markedly and don’t correlate with same processing phase. Representational similarity and ablation experiments confirm these trends across different LLMs.

Conclusion: ID is a useful marker of linguistic complexity in LLMs that can differentiate between complexity types and points to similar linguistic processing stages across different LLM architectures.

Abstract: We explore the intrinsic dimension (ID) of LLM representations as a marker of linguistic complexity, asking if different ID profiles across LLM layers differentially characterize formal and functional complexity. We find the formal contrast between sentences with multiple coordinated or subordinated clauses to be reflected in ID differences whose onset aligns with a phase of more abstract linguistic processing independently identified in earlier work. The functional contrasts between sentences characterized by right branching vs. center embedding or unambiguous vs. ambiguous relative clause attachment are also picked up by ID, but in a less marked way, and they do not correlate with the same processing phase. Further experiments using representational similarity and layer ablation confirm the same trends. We conclude that ID is a useful marker of linguistic complexity in LLMs, that it allows to differentiate between different types of complexity, and that it points to similar stages of linguistic processing across disparate LLMs.

[74] HearSay Benchmark: Do Audio LLMs Leak What They Hear?

Jin Wang, Liang Lin, Kaiwen Luo, Weiliu Wang, Yitian Chen, Moayad Aloqaily, Xuehai Tang, Zhenhong Zhou, Kun Wang, Li Sun, Qingsong Wen

Main category: cs.CL

TL;DR: HearSay benchmark reveals that Audio Large Language Models (ALLMs) significantly leak private information from voiceprints, with existing safety mechanisms being inadequate and reasoning capabilities amplifying privacy risks.

Details

Motivation: While Audio Large Language Models have advanced in understanding and generation, their potential privacy implications remain largely unexplored. The paper aims to investigate whether ALLMs inadvertently leak user privacy through acoustic voiceprints.

Method: The authors introduce HearSay, a comprehensive benchmark constructed from over 22,000 real-world audio clips. The benchmark is meticulously curated through automated profiling and human verification to ensure data quality and factual privacy labels. Extensive experiments are conducted on this benchmark to evaluate privacy leakage.

Result: Three critical findings: 1) Significant privacy leakage - ALLMs inherently extract private attributes from voiceprints with 92.89% accuracy on gender and effectively profile social attributes; 2) Insufficient safety mechanisms - existing safeguards are severely inadequate with near-zero refusal rates for privacy-intruding requests; 3) Reasoning amplifies risk - Chain-of-Thought reasoning exacerbates privacy risks by uncovering deeper acoustic correlations.

Conclusion: The findings expose critical vulnerabilities in ALLMs, underscoring the urgent need for targeted privacy alignment. The work provides the first comprehensive investigation into privacy leakage through acoustic voiceprints in ALLMs.

Abstract: While Audio Large Language Models (ALLMs) have achieved remarkable progress in understanding and generation, their potential privacy implications remain largely unexplored. This paper takes the first step to investigate whether ALLMs inadvertently leak user privacy solely through acoustic voiceprints and introduces $\textit{HearSay}$, a comprehensive benchmark constructed from over 22,000 real-world audio clips. To ensure data quality, the benchmark is meticulously curated through a rigorous pipeline involving automated profiling and human verification, guaranteeing that all privacy labels are grounded in factual records. Extensive experiments on $\textit{HearSay}$ yield three critical findings: $\textbf{Significant Privacy Leakage}$: ALLMs inherently extract private attributes from voiceprints, reaching 92.89% accuracy on gender and effectively profiling social attributes. $\textbf{Insufficient Safety Mechanisms}$: Alarmingly, existing safeguards are severely inadequate; most models fail to refuse privacy-intruding requests, exhibiting near-zero refusal rates for physiological traits. $\textbf{Reasoning Amplifies Risk}$: Chain-of-Thought (CoT) reasoning exacerbates privacy risks in capable models by uncovering deeper acoustic correlations. These findings expose critical vulnerabilities in ALLMs, underscoring the urgent need for targeted privacy alignment. The codes and dataset are available at https://github.com/JinWang79/HearSay_Benchmark

[75] Membox: Weaving Topic Continuity into Long-Range Memory for LLM Agents

Dehao Tao, Guoliang Ma, Yongfeng Huang, Minghu Jiang

Main category: cs.CL

TL;DR: Membox is a hierarchical memory architecture for LLM agents that preserves topic continuity by grouping consecutive same-topic dialogue turns into coherent “memory boxes” and linking them into long-range event-timeline traces, achieving superior temporal reasoning with fewer context tokens.

Details

Motivation: Current LLM agent memory systems fail to preserve topic continuity in human-agent dialogues. They follow a fragmentation-compensation paradigm that breaks dialogue streams into isolated utterances, damaging narrative and causal flow while biasing retrieval toward lexical similarity rather than thematic coherence.

Method: Membox introduces a hierarchical memory architecture with two key components: 1) Topic Loom - continuously monitors dialogue in sliding windows, grouping consecutive same-topic turns into coherent “memory boxes” at storage time; 2) Trace Weaver - links sealed boxes into long-range event-timeline traces to recover macro-topic recurrences across discontinuities.

Result: Experiments on LoCoMo show Membox achieves up to 68% F1 improvement on temporal reasoning tasks, outperforming competitive baselines like Mem0 and A-MEM. It attains these gains while using only a fraction of the context tokens required by existing methods, demonstrating superior efficiency-effectiveness balance.

Conclusion: By explicitly modeling topic continuity, Membox offers a cognitively motivated mechanism that enhances both coherence and efficiency in LLM agents, addressing fundamental limitations of current fragmentation-compensation memory systems.

Abstract: Human-agent dialogues often exhibit topic continuity-a stable thematic frame that evolves through temporally adjacent exchanges-yet most large language model (LLM) agent memory systems fail to preserve it. Existing designs follow a fragmentation-compensation paradigm: they first break dialogue streams into isolated utterances for storage, then attempt to restore coherence via embedding-based retrieval. This process irreversibly damages narrative and causal flow, while biasing retrieval towards lexical similarity. We introduce membox, a hierarchical memory architecture centered on a Topic Loom that continuously monitors dialogue in a sliding-window fashion, grouping consecutive same-topic turns into coherent “memory boxes” at storage time. Sealed boxes are then linked by a Trace Weaver into long-range event-timeline traces, recovering macro-topic recurrences across discontinuities. Experiments on LoCoMo demonstrate that Membox achieves up to 68% F1 improvement on temporal reasoning tasks, outperforming competitive baselines (e.g., Mem0, A-MEM). Notably, Membox attains these gains while using only a fraction of the context tokens required by existing methods, highlighting a superior balance between efficiency and effectiveness. By explicitly modeling topic continuity, Membox offers a cognitively motivated mechanism for enhancing both coherence and efficiency in LLM agents.

[76] Compact Example-Based Explanations for Language Models

Loris Schoenegger, Benjamin Roth

Main category: cs.CL

TL;DR: Proposes a selection relevance score to evaluate training data selection strategies for influence-based explanations, showing common strategies often underperform random selection and proposing a better balanced approach.

Details

Motivation: Current influence estimation methods for example-based explanations lack proper evaluation of selection strategies - humans can only view a small subset of training data, but the choice of which documents to include significantly affects explanation quality.

Method: Introduces a novel selection relevance score (retraining-free metric) to quantify how useful a set of examples is for explaining model outputs, validated through fine-tuning experiments. Proposes a strategy balancing influence and representativeness.

Result: The selection relevance score effectively predicts whether example sets support or undermine model predictions. Common selection strategies often underperform random selection. The proposed balanced strategy outperforms naive selection of highest-ranking examples.

Conclusion: Proper selection strategies are crucial for effective influence-based explanations. The proposed selection relevance score provides a valuable evaluation metric, and balancing influence with representativeness enables better use of limited explanation budgets.

Abstract: Training data influence estimation methods quantify the contribution of training documents to a model’s output, making them a promising source of information for example-based explanations. As humans cannot interpret thousands of documents, only a small subset of the training data can be presented as an explanation. Although the choice of which documents to include directly affects explanation quality, previous evaluations of such systems have largely ignored any selection strategies. To address this, we propose a novel selection relevance score, a retraining-free metric that quantifies how useful a set of examples is for explaining a model’s output. We validate this score through fine-tuning experiments, confirming that it can predict whether a set of examples supports or undermines the model’s predictions. Using this metric, we further show that common selection strategies often underperform random selection. Motivated by this finding, we propose a strategy that balances influence and representativeness, enabling better use of selection budgets than naively selecting the highest-ranking examples.

[77] NeoAMT: Neologism-Aware Agentic Machine Translation with Reinforcement Learning

Zhongtao Miao, Kaiyan Zhao, Masaaki Nagata, Yoshimasa Tsuruoka

Main category: cs.CL

TL;DR: NeoAMT: An agentic framework for neologism-aware machine translation using Wiktionary search tool, RL training with novel reward design, and adaptive rollout generation based on translation difficulty.

Details

Motivation: Neologism-aware machine translation is underexplored compared to general MT, creating a need for specialized approaches to handle source sentences containing newly coined words or expressions.

Method: 1) Created new multilingual dataset from English Wiktionary dump (16 languages, 75 directions); 2) Developed Wiktionary search tool; 3) Used RL training with novel reward design; 4) Implemented adaptive rollout generation based on “translation difficulty” to improve agent performance.

Result: Built comprehensive dataset covering 16 languages and 75 translation directions from ~10M Wiktionary records, plus retrieval corpus from ~3M cleaned records. Framework enables improved neologism translation accuracy.

Conclusion: NeoAMT provides an effective agentic framework for neologism-aware MT, addressing a gap in MT research through Wiktionary integration, specialized dataset creation, and innovative RL training approaches.

Abstract: Neologism-aware machine translation aims to translate source sentences containing neologisms into target languages. This field remains underexplored compared with general machine translation (MT). In this paper, we propose an agentic framework, NeoAMT, for neologism-aware machine translation using a Wiktionary search tool. Specifically, we first create a new dataset for neologism-aware machine translation and develop a search tool based on Wiktionary. The new dataset covers 16 languages and 75 translation directions and is derived from approximately 10 million records of an English Wiktionary dump. The retrieval corpus of the search tool is also constructed from around 3 million cleaned records of the Wiktionary dump. We then use it for training the translation agent with reinforcement learning (RL) and evaluating the accuracy of neologism-aware machine translation. Based on this, we also propose an RL training framework that contains a novel reward design and an adaptive rollout generation approach by leveraging “translation difficulty” to further improve the translation quality of translation agents using our search tool.

[78] Do LLMs Really Memorize Personally Identifiable Information? Revisiting PII Leakage with a Cue-Controlled Memorization Framework

Xiaoyu Luo, Yiyi Chen, Qiongxiu Li, Johannes Bjerva

Main category: cs.CL

TL;DR: The paper proposes a new framework (Cue-Resistant Memorization) to properly evaluate PII leakage in LLMs, showing that previously reported memorization is actually cue-driven behavior rather than true memorization.

Details

Motivation: Current evaluations of PII leakage in LLMs often misinterpret successful PII reconstruction as evidence of memorization, but these results may actually be driven by prompt-induced generalization or pattern completion rather than genuine memorization.

Method: The authors formalize Cue-Resistant Memorization (CRM) as a cue-controlled evaluation framework that explicitly conditions on prompt-target overlap cues. They conduct large-scale multilingual re-evaluation across 32 languages using CRM to assess PII leakage under low lexical cue conditions.

Result: When surface-form cues are controlled for, reconstruction success diminishes substantially. Cue-free generation and membership inference show extremely low true positive rates. Previously reported PII leakage is better explained by cue-driven behavior than genuine memorization.

Conclusion: Cue-controlled evaluation is essential for reliably quantifying privacy-relevant memorization in LLMs, as previous methods overestimated memorization by not accounting for cue-driven behavior.

Abstract: Large Language Models (LLMs) have been reported to “leak” Personally Identifiable Information (PII), with successful PII reconstruction often interpreted as evidence of memorization. We propose a principled revision of memorization evaluation for LLMs, arguing that PII leakage should be evaluated under low lexical cue conditions, where target PII cannot be reconstructed through prompt-induced generalization or pattern completion. We formalize Cue-Resistant Memorization (CRM) as a cue-controlled evaluation framework and a necessary condition for valid memorization evaluation, explicitly conditioning on prompt-target overlap cues. Using CRM, we conduct a large-scale multilingual re-evaluation of PII leakage across 32 languages and multiple memorization paradigms. Revisiting reconstruction-based settings, including verbatim prefix-suffix completion and associative reconstruction, we find that their apparent effectiveness is driven primarily by direct surface-form cues rather than by true memorization. When such cues are controlled for, reconstruction success diminishes substantially. We further examine cue-free generation and membership inference, both of which exhibit extremely low true positive rates. Overall, our results suggest that previously reported PII leakage is better explained by cue-driven behavior than by genuine memorization, highlighting the importance of cue-controlled evaluation for reliably quantifying privacy-relevant memorization in LLMs.

[79] VietMed-MCQ: A Consistency-Filtered Data Synthesis Framework for Vietnamese Traditional Medicine Evaluation

Huynh Trung Kiet, Dao Sy Duy Minh, Nguyen Dinh Ha Duong, Le Hoang Minh Huy, Long Nguyen, Dien Dinh

Main category: cs.CL

TL;DR: VietMed-MCQ: A Vietnamese Traditional Medicine benchmark dataset created via RAG pipeline with consistency checking, showing LLMs struggle in specialized cultural medical domains despite cross-lingual transfer.

Details

Motivation: LLMs perform poorly in specialized, culturally specific medical domains like Vietnamese Traditional Medicine due to lack of high-quality structured benchmarks.

Method: Created VietMed-MCQ dataset using Retrieval-Augmented Generation pipeline with automated consistency check mechanism and dual-model validation for reasoning consistency. Dataset has 3,190 questions across three difficulty levels, validated by medical experts.

Result: Dataset achieved 94.2% expert approval with high inter-rater agreement (Fleiss’ kappa=0.82). Benchmarking showed general-purpose models with Chinese priors outperform Vietnamese-centric models, indicating cross-lingual transfer, but all models struggle with complex diagnostic reasoning.

Conclusion: The VietMed-MCQ dataset addresses the benchmark scarcity in low-resource medical domains, revealing LLM limitations in specialized cultural medicine and enabling future research.

Abstract: Large Language Models (LLMs) have demonstrated remarkable proficiency in general medical domains. However, their performance significantly degrades in specialized, culturally specific domains such as Vietnamese Traditional Medicine (VTM), primarily due to the scarcity of high-quality, structured benchmarks. In this paper, we introduce VietMed-MCQ, a novel multiple-choice question dataset generated via a Retrieval-Augmented Generation (RAG) pipeline with an automated consistency check mechanism. Unlike previous synthetic datasets, our framework incorporates a dual-model validation approach to ensure reasoning consistency through independent answer verification, though the substring-based evidence checking has known limitations. The complete dataset of 3,190 questions spans three difficulty levels and underwent validation by one medical expert and four students, achieving 94.2 percent approval with substantial inter-rater agreement (Fleiss’ kappa = 0.82). We benchmark seven open-source models on VietMed-MCQ. Results reveal that general-purpose models with strong Chinese priors outperform Vietnamese-centric models, highlighting cross-lingual conceptual transfer, while all models still struggle with complex diagnostic reasoning. Our code and dataset are publicly available to foster research in low-resource medical domains.

[80] Where meaning lives: Layer-wise accessibility of psycholinguistic features in encoder and decoder language models

Taisiia Tikhomirova, Dirk U. Wulff

Main category: cs.CL

TL;DR: Transformer models encode psycholinguistic features differently based on probing methods, with final layers rarely optimal for meaning recovery and shared depth ordering patterns across models.

Details

Motivation: To understand where transformer language models encode psychologically meaningful aspects of meaning, which is important for both theoretical understanding and practical applications in NLP.

Method: Systematic layer-wise probing study of 58 psycholinguistic features across 10 transformer models (encoder-only and decoder-only architectures), comparing three embedding extraction methods (contextualized vs isolated embeddings).

Result: 1) Localization of meaning is strongly method-dependent; 2) Contextualized embeddings yield higher feature-specific selectivity and different layer-wise profiles than isolated embeddings; 3) Final-layer representations are rarely optimal for recovering psycholinguistic information; 4) Models share a depth ordering pattern where lexical properties peak earlier and experiential/affective dimensions peak later.

Conclusion: Where meaning “lives” in transformer models reflects an interaction between methodological choices (probing methods) and architectural constraints, rather than being a fixed property of the models themselves.

Abstract: Understanding where transformer language models encode psychologically meaningful aspects of meaning is essential for both theory and practice. We conduct a systematic layer-wise probing study of 58 psycholinguistic features across 10 transformer models, spanning encoder-only and decoder-only architectures, and compare three embedding extraction methods. We find that apparent localization of meaning is strongly method-dependent: contextualized embeddings yield higher feature-specific selectivity and different layer-wise profiles than isolated embeddings. Across models and methods, final-layer representations are rarely optimal for recovering psycholinguistic information with linear probes. Despite these differences, models exhibit a shared depth ordering of meaning dimensions, with lexical properties peaking earlier and experiential and affective dimensions peaking later. Together, these results show that where meaning “lives” in transformer models reflects an interaction between methodological choices and architectural constraints.

[81] AI Generated Text Detection

Adilkhan Alikhanov, Aidar Amangeldi, Diar Demeubay, Dilnaz Akhmetzhan, Nurbek Moldakhmetov, Omar Polat, Galymzhan Zharas

Main category: cs.CL

TL;DR: Evaluation of AI text detection methods shows transformer-based models like DistilBERT outperform traditional approaches, achieving 88%+ accuracy with strong ROC-AUC scores, highlighting the superiority of contextual semantic modeling over lexical features.

Details

Motivation: The rise of AI-generated text and students using LLM-generated content as their own work violates academic integrity, creating a need for effective AI text detection methods to maintain academic standards.

Method: Used HC3 and DAIGT v2 datasets to create a unified benchmark with topic-based data splitting to prevent information leakage. Evaluated traditional ML (TF-IDF logistic regression) and deep learning models (BiLSTM, DistilBERT) for AI text detection.

Result: TF-IDF logistic regression achieved 82.87% accuracy baseline. BiLSTM achieved 88.86% accuracy, while DistilBERT achieved 88.11% accuracy with the highest ROC-AUC of 0.96, demonstrating strongest overall performance. Contextual semantic modeling significantly outperformed lexical features.

Conclusion: Transformer-based models like DistilBERT are most effective for AI text detection, with contextual semantic modeling being superior to lexical features. Future work should focus on dataset diversity expansion, parameter-efficient fine-tuning (LoRA), and hardware optimization for better scalability.

Abstract: The rapid development of large language models has led to an increase in AI-generated text, with students increasingly using LLM-generated content as their own work, which violates academic integrity. This paper presents an evaluation of AI text detection methods, including both traditional machine learning models and transformer-based architectures. We utilize two datasets, HC3 and DAIGT v2, to build a unified benchmark and apply a topic-based data split to prevent information leakage. This approach ensures robust generalization across unseen domains. Our experiments show that TF-IDF logistic regression achieves a reasonable baseline accuracy of 82.87%. However, deep learning models outperform it. The BiLSTM classifier achieves an accuracy of 88.86%, while DistilBERT achieves a similar accuracy of 88.11% with the highest ROC-AUC score of 0.96, demonstrating the strongest overall performance. The results indicate that contextual semantic modeling is significantly superior to lexical features and highlight the importance of mitigating topic memorization through appropriate evaluation protocols. The limitations of this work are primarily related to dataset diversity and computational constraints. In future work, we plan to expand dataset diversity and utilize parameter-efficient fine-tuning methods such as LoRA. We also plan to explore smaller or distilled models and employ more efficient batching strategies and hardware-aware optimization.

[82] Step Potential Advantage Estimation: Harnessing Intermediate Confidence and Correctness for Efficient Mathematical Reasoning

Fei Wu, Zhenrong Zhang, Qikai Chang, Jianshu Zhang, Quan Liu, Jun Du

Main category: cs.CL

TL;DR: SPAE introduces Step Potential Advantage Estimation for RLVR, using step-level reasoning progress signals to improve credit assignment and reduce redundant verification in LLM reasoning.

Details

Motivation: Current RLVR approaches lack semantically grounded, step-level measures of reasoning progress, causing LLMs to fail at distinguishing necessary deduction from redundant verification, sometimes even overturning correct trajectories into incorrect answers.

Method: Proposes Step Potential signal combining intermediate confidence and correctness, and Step Potential Advantage Estimation (SPAE) that amplifies potential gains, penalizes potential drops, and applies penalty after potential saturation to encourage timely termination.

Result: SPAE consistently improves accuracy while substantially reducing response length across multiple benchmarks, outperforming strong RL baselines and recent efficient reasoning and token-level advantage estimation methods.

Conclusion: SPAE provides a fine-grained credit assignment method for RLVR that addresses the lack of process supervision, enabling better reasoning progress estimation and more efficient chain-of-thought reasoning in LLMs.

Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) elicits long chain-of-thought reasoning in large language models (LLMs), but outcome-based rewards lead to coarse-grained advantage estimation. While existing approaches improve RLVR via token-level entropy or sequence-level length control, they lack a semantically grounded, step-level measure of reasoning progress. As a result, LLMs fail to distinguish necessary deduction from redundant verification: they may continue checking after reaching a correct solution and, in extreme cases, overturn a correct trajectory into an incorrect final answer. To remedy the lack of process supervision, we introduce a training-free probing mechanism that extracts intermediate confidence and correctness and combines them into a Step Potential signal that explicitly estimates the reasoning state at each step. Building on this signal, we propose Step Potential Advantage Estimation (SPAE), a fine-grained credit assignment method that amplifies potential gains, penalizes potential drops, and applies penalty after potential saturates to encourage timely termination. Experiments across multiple benchmarks show SPAE consistently improves accuracy while substantially reducing response length, outperforming strong RL baselines and recent efficient reasoning and token-level advantage estimation methods. The code is available at https://github.com/cii030/SPAE-RL.

[83] Rethinking Table Pruning in TableQA: From Sequential Revisions to Gold Trajectory-Supervised Parallel Search

Yu Guo, Shenghao Ye, Shuangwu Chen, Zijian Wen, Tao Zhang, Qirui Bai, Dong Jin, Yunpeng Hou, Huasen He, Jian Yang, Xiaobin Tan

Main category: cs.CL

TL;DR: TabTrim is a novel table pruning framework that transforms table pruning from sequential revisions to gold trajectory-supervised parallel search, achieving state-of-the-art performance on TableQA tasks.

Details

Motivation: Existing table pruning methods rely on sequential revisions driven by unreliable critique signals, often failing to detect the loss of answer-critical data, which limits their effectiveness in TableQA.

Method: TabTrim uses gold SQL query execution to derive gold pruning trajectories, trains a pruner and verifier to align step-wise pruning with these trajectories, and performs parallel search during inference to explore multiple candidate trajectories and identify optimal sub-tables.

Result: TabTrim-8B achieves 73.5% average accuracy, outperforming the strongest baseline by 3.2%, with 79.4% on WikiTQ and 61.2% on TableBench.

Conclusion: TabTrim’s gold trajectory-supervised parallel search approach effectively addresses limitations of existing sequential pruning methods and achieves state-of-the-art performance across diverse tabular reasoning tasks.

Abstract: Table Question Answering (TableQA) benefits significantly from table pruning, which extracts compact sub-tables by eliminating redundant cells to streamline downstream reasoning. However, existing pruning methods typically rely on sequential revisions driven by unreliable critique signals, often failing to detect the loss of answer-critical data. To address this limitation, we propose TabTrim, a novel table pruning framework which transforms table pruning from sequential revisions to gold trajectory-supervised parallel search. TabTrim derives a gold pruning trajectory using the intermediate sub-tables in the execution process of gold SQL queries, and trains a pruner and a verifier to make the step-wise pruning result align with the gold pruning trajectory. During inference, TabTrim performs parallel search to explore multiple candidate pruning trajectories and identify the optimal sub-table. Extensive experiments demonstrate that TabTrim achieves state-of-the-art performance across diverse tabular reasoning tasks: TabTrim-8B reaches 73.5% average accuracy, outperforming the strongest baseline by 3.2%, including 79.4% on WikiTQ and 61.2% on TableBench.

[84] What Does Loss Optimization Actually Teach, If Anything? Knowledge Dynamics in Continual Pre-training of LLMs

Seyed Mahed Mousavi, Simone Alghisi, Giuseppe Riccardi

Main category: cs.CL

TL;DR: CPT optimization diverges from actual knowledge learning - loss decreases monotonically while factual learning is unstable, non-monotonic, and poorly consolidated with systematic forgetting.

Details

Motivation: Current CPT practices treat loss as a proxy for knowledge learning without understanding how knowledge actually changes during training. The authors aim to study CPT as a knowledge learning process rather than just an optimization problem.

Method: Constructed a controlled, distribution-matched benchmark of factual documents and interleaved diagnostic probes directly into the CPT loop to measure epoch-level knowledge acquisition dynamics and OOD skill changes. Also analyzed how CPT reshapes knowledge circuits during training across three instruction-tuned LLMs and multiple CPT strategies.

Result: Optimization and learning systematically diverge: loss decreases monotonically while factual learning is unstable and non-monotonic. Acquired facts are rarely consolidated, learning is strongly conditioned on prior exposure, and OOD performance degrades from early epochs. Circuit analysis shows rapid reconfiguration of knowledge pathways across epochs.

Conclusion: Loss optimization is misaligned with learning progress in CPT, motivating evaluation of stopping criteria based on task-level learning dynamics rather than just loss metrics.

Abstract: Continual Pre-Training (CPT) is widely used for acquiring and updating factual knowledge in LLMs. This practice treats loss as a proxy for knowledge learning, while offering no grounding into how it changes during training. We study CPT as a knowledge learning process rather than a solely optimization problem. We construct a controlled, distribution-matched benchmark of factual documents and interleave diagnostic probes directly into the CPT loop, enabling epoch-level measurement of knowledge acquisition dynamics and changes in Out-Of-Domain (OOD) general skills (e.g., math). We further analyze how CPT reshapes knowledge circuits during training. Across three instruction-tuned LLMs and multiple CPT strategies, optimization and learning systematically diverge as loss decreases monotonically while factual learning is unstable and non-monotonic. Acquired facts are rarely consolidated, learning is strongly conditioned on prior exposure, and OOD performance degrades from early epochs. Circuit analysis reveals rapid reconfiguration of knowledge pathways across epochs, providing an explanation for narrow acquisition windows and systematic forgetting. These results show that loss optimization is misaligned with learning progress in CPT and motivate evaluation of stopping criteria based on task-level learning dynamics.

[85] InfiniteWeb: Scalable Web Environment Synthesis for GUI Agent Training

Ziyun Zhang, Zezhou Wang, Xiaoyi Zhang, Zongyu Guo, Jiahao Li, Bin Li, Yan Lu

Main category: cs.CL

TL;DR: InfiniteWeb automatically generates functional web environments at scale for GUI agent training, addressing the scarcity of suitable training environments through unified specifications, test-driven development, and diverse website generation with verifiable task evaluators.

Details

Motivation: GUI agents that interact with graphical interfaces represent promising AI assistants, but training them is hindered by the scarcity of suitable environments. Current approaches struggle with generating realistic, functional websites with many interconnected pages.

Method: InfiniteWeb uses unified specification, task-centric test-driven development, and combines website seeds with reference design images to ensure diversity. The system generates verifiable task evaluators that provide dense reward signals for reinforcement learning.

Result: InfiniteWeb surpasses commercial coding agents at realistic website construction. GUI agents trained on the generated environments achieve significant performance improvements on OSWorld and Online-Mind2Web benchmarks.

Conclusion: The system effectively addresses the environment scarcity problem for GUI agent training, demonstrating that automatically generated functional web environments can significantly improve agent performance on real-world GUI interaction tasks.

Abstract: GUI agents that interact with graphical interfaces on behalf of users represent a promising direction for practical AI assistants. However, training such agents is hindered by the scarcity of suitable environments. We present InfiniteWeb, a system that automatically generates functional web environments at scale for GUI agent training. While LLMs perform well on generating a single webpage, building a realistic and functional website with many interconnected pages faces challenges. We address these challenges through unified specification, task-centric test-driven development, and a combination of website seed with reference design image to ensure diversity. Our system also generates verifiable task evaluators enabling dense reward signals for reinforcement learning. Experiments show that InfiniteWeb surpasses commercial coding agents at realistic website construction, and GUI agents trained on our generated environments achieve significant performance improvements on OSWorld and Online-Mind2Web, demonstrating the effectiveness of proposed system.

[86] What Matters For Safety Alignment?

Xing Li, Hui-Ling Zhen, Lihao Yin, Xianzhi Yu, Zhenhua Dong, Mingxuan Yuan

Main category: cs.CL

TL;DR: Large-scale empirical study on LLM/LRM safety alignment finds reasoning-enhanced models are safest, post-training degrades safety, CoT attacks via response prefixes are highly effective, and roleplay/prompt injection are main attack methods.

Details

Motivation: To systematically evaluate what matters for safety alignment in LLMs and LRMs, providing essential insights for developing more secure and reliable AI systems through comprehensive empirical analysis.

Method: Evaluated 32 recent LLMs/LRMs across 13 model families (3B-235B parameters) using 5 safety datasets, 56 jailbreak techniques, and 4 CoT attack strategies with 4.6M API calls. Investigated 6 intrinsic model characteristics and 3 external attack techniques.

Result: 1) GPT-OSS-20B, Qwen3-Next-80B-A3B-Thinking, and GPT-OSS-120B are top safest models, showing reasoning/self-reflection advantages. 2) Post-training/knowledge distillation systematically degrades safety alignment. 3) CoT attacks via response prefixes increase attack success rate 3.34x on average (0.6% to 96.3% for Seed-OSS-36B-Instruct). 4) Roleplay, prompt injection, and gradient-based search are predominant attack methods.

Conclusion: Safety must be treated as explicit constraint during training, reasoning mechanisms enhance safety, text-completion interfaces with user-defined prefixes pose critical risks, and architectural safeguards are urgently needed against identified attack vectors.

Abstract: This paper presents a comprehensive empirical study on the safety alignment capabilities. We evaluate what matters for safety alignment in LLMs and LRMs to provide essential insights for developing more secure and reliable AI systems. We systematically investigate and compare the influence of six critical intrinsic model characteristics and three external attack techniques. Our large-scale evaluation is conducted using 32 recent, popular LLMs and LRMs across thirteen distinct model families, spanning a parameter scale from 3B to 235B. The assessment leverages five established safety datasets and probes model vulnerabilities with 56 jailbreak techniques and four CoT attack strategies, resulting in 4.6M API calls. Our key empirical findings are fourfold. First, we identify the LRMs GPT-OSS-20B, Qwen3-Next-80B-A3B-Thinking, and GPT-OSS-120B as the top-three safest models, which substantiates the significant advantage of integrated reasoning and self-reflection mechanisms for robust safety alignment. Second, post-training and knowledge distillation may lead to a systematic degradation of safety alignment. We thus argue that safety must be treated as an explicit constraint or a core optimization objective during these stages, not merely subordinated to the pursuit of general capability. Third, we reveal a pronounced vulnerability: employing a CoT attack via a response prefix can elevate the attack success rate by 3.34x on average and from 0.6% to 96.3% for Seed-OSS-36B-Instruct. This critical finding underscores the safety risks inherent in text-completion interfaces and features that allow user-defined response prefixes in LLM services, highlighting an urgent need for architectural and deployment safeguards. Fourth, roleplay, prompt injection, and gradient-based search for adversarial prompts are the predominant methodologies for eliciting unaligned behaviors in modern models.

[87] PartisanLens: A Multilingual Dataset of Hyperpartisan and Conspiratorial Immigration Narratives in European Media

Michele Joshua Maggini, Paloma Piot, Anxo Pérez, Erik Bran Marino, Lúa Santamaría Montesinos, Ana Lisboa, Marta Vázquez Abuín, Javier Parapar, Pablo Gamallo

Main category: cs.CL

TL;DR: PartisanLens is a multilingual dataset for detecting hyperpartisan and population replacement conspiracy theory narratives in Spanish, Italian, and Portuguese news headlines, with evaluation of LLMs as classifiers and potential annotators.

Details

Motivation: Hyperpartisan narratives and Population Replacement Conspiracy Theories (PRCT) drive political polarization, institutional distrust, and extremist violence, but existing resources are scarce, English-centric, and analyze these aspects in isolation rather than as interrelated political discourse elements.

Method: Created PartisanLens dataset of 1,617 hyperpartisan news headlines in Spanish, Italian, and Portuguese with multi-aspect annotations. Evaluated LLMs as classifiers for hyperpartisan/PRCT detection, assessed their viability as automatic annotators, and explored conditioning LLMs on socio-economic/ideological profiles to emulate human annotation patterns.

Result: Established robust baselines for LLM classification of hyperpartisan and PRCT narratives, highlighting both potential and limitations of LLMs as automatic annotators. Demonstrated LLMs can approximate human annotation but with current limitations.

Conclusion: PartisanLens supports future research on detecting partisan and conspiratorial narratives in European contexts, providing multilingual resources and evaluation frameworks for addressing misinformation threats to social cohesion and public safety.

Abstract: Detecting hyperpartisan narratives and Population Replacement Conspiracy Theories (PRCT) is essential to addressing the spread of misinformation. These complex narratives pose a significant threat, as hyperpartisanship drives political polarisation and institutional distrust, while PRCTs directly motivate real-world extremist violence, making their identification critical for social cohesion and public safety. However, existing resources are scarce, predominantly English-centric, and often analyse hyperpartisanship, stance, and rhetorical bias in isolation rather than as interrelated aspects of political discourse. To bridge this gap, we introduce \textsc{PartisanLens}, the first multilingual dataset of \num{1617} hyperpartisan news headlines in Spanish, Italian, and Portuguese, annotated in multiple political discourse aspects. We first evaluate the classification performance of widely used Large Language Models (LLMs) on this dataset, establishing robust baselines for the classification of hyperpartisan and PRCT narratives. In addition, we assess the viability of using LLMs as automatic annotators for this task, analysing their ability to approximate human annotation. Results highlight both their potential and current limitations. Next, moving beyond standard judgments, we explore whether LLMs can emulate human annotation patterns by conditioning them on socio-economic and ideological profiles that simulate annotator perspectives. At last, we provide our resources and evaluation, \textsc{PartisanLens} supports future research on detecting partisan and conspiratorial narratives in European contexts.

[88] Atlas: Orchestrating Heterogeneous Models and Tools for Multi-Domain Complex Reasoning

Jinyang Wu, Guocheng Zhai, Ruihan Jin, Jiahao Yuan, Yuhao Shen, Shuai Zhang, Zhengqi Wen, Jianhua Tao

Main category: cs.CL

TL;DR: ATLAS is a dual-path framework that dynamically selects optimal LLM-tool combinations for complex reasoning tasks, outperforming existing methods and GPT-4o on both in-distribution and out-of-distribution tasks.

Details

Motivation: As LLMs and external tools diversify, selecting optimal model-tool combinations becomes a high-dimensional optimization challenge. Existing approaches using single models or fixed tool-calling logic fail to exploit performance variations across heterogeneous model-tool pairs.

Method: ATLAS uses a dual-path approach: (1) training-free cluster-based routing that exploits empirical priors for domain-specific alignment, and (2) RL-based multi-step routing that explores autonomous trajectories for out-of-distribution generalization.

Result: Extensive experiments across 15 benchmarks show ATLAS outperforms closed-source models like GPT-4o, surpassing existing routing methods on both in-distribution (+10.1%) and out-of-distribution (+13.1%) tasks. The framework also shows significant gains in visual reasoning by orchestrating specialized multi-modal tools.

Conclusion: ATLAS provides an effective solution for dynamic tool usage in cross-domain complex reasoning by adaptively selecting optimal LLM-tool combinations through its dual-path routing approach, demonstrating superior performance across diverse benchmarks.

Abstract: The integration of large language models (LLMs) with external tools has significantly expanded the capabilities of AI agents. However, as the diversity of both LLMs and tools increases, selecting the optimal model-tool combination becomes a high-dimensional optimization challenge. Existing approaches often rely on a single model or fixed tool-calling logic, failing to exploit the performance variations across heterogeneous model-tool pairs. In this paper, we present ATLAS (Adaptive Tool-LLM Alignment and Synergistic Invocation), a dual-path framework for dynamic tool usage in cross-domain complex reasoning. ATLAS operates via a dual-path approach: (1) \textbf{training-free cluster-based routing} that exploits empirical priors for domain-specific alignment, and (2) \textbf{RL-based multi-step routing} that explores autonomous trajectories for out-of-distribution generalization. Extensive experiments across 15 benchmarks demonstrate that our method outperforms closed-source models like GPT-4o, surpassing existing routing methods on both in-distribution (+10.1%) and out-of-distribution (+13.1%) tasks. Furthermore, our framework shows significant gains in visual reasoning by orchestrating specialized multi-modal tools.

[89] Evaluating Small Decoder-Only Language Models for Grammar Correction and Text Simplification

Anthony Lamelas

Main category: cs.CL

TL;DR: Small language models (SLMs) are not yet competitive with large language models (LLMs) for grammar correction and text simplification tasks, despite being more efficient.

Details

Motivation: LLMs are powerful but too large/computationally expensive for many practical applications, creating a need for more efficient alternatives like small language models.

Method: Tested small decoder-only language models on JFLEG and ASSET datasets using established metrics, evaluating them out-of-the-box, fine-tuned, and run sequentially.

Result: SLMs perform below strong baselines and current LLMs, struggle with retaining meaning and hallucinations, and cannot match modern LLM performance for rewriting tasks.

Conclusion: Current SLMs are not yet competitive with modern LLMs for rewriting tasks despite efficiency advantages, requiring further training advances to close the performance gap.

Abstract: Large language models have become extremely popular recently due to their ability to achieve strong performance on a variety of tasks, such as text generation and rewriting, but their size and computation cost make them difficult to access, deploy, and secure in many settings. This paper investigates whether small, decoder-only language models can provide an efficient alternative for the tasks of grammar correction and text simplification. The experiments in this paper focus on testing small language models out of the box, fine-tuned, and run sequentially on the JFLEG and ASSET datasets using established metrics. The results show that while SLMs may learn certain behaviors well, their performance remains below strong baselines and current LLMs. The results also show that SLMs struggle with retaining meaning and hallucinations. These findings suggest that despite their efficiency advantages, current SLMs are not yet competitive enough with modern LLMs for rewriting, and further advances in training are required for SLMs to close the performance gap between them and today’s LLMs.

[90] Decide Then Retrieve: A Training-Free Framework with Uncertainty-Guided Triggering and Dual-Path Retrieval

Wang Chen, Guanqiang Qi, Weikang Li, Yang Li, Deguo Xia, Jizhou Huang

Main category: cs.CL

TL;DR: DTR is a training-free RAG framework that uses generation uncertainty to decide when to retrieve and employs dual-path retrieval with adaptive selection to reduce noise and improve performance.

Details

Motivation: Existing RAG approaches have two main problems: 1) they indiscriminately trigger retrieval for all queries, which can introduce unnecessary noise, and 2) they rely on single-path evidence construction, which limits performance gains especially for sparse or ambiguous queries.

Method: DTR uses generation uncertainty to adaptively determine when retrieval is needed (retrieval triggering decision). It introduces a dual-path retrieval mechanism with adaptive information selection that can better handle different types of queries by selecting relevant information more effectively.

Result: Extensive experiments across five open-domain QA benchmarks, multiple model scales, and different retrievers show DTR consistently improves EM and F1 scores over standard RAG and strong baselines while reducing unnecessary retrievals.

Conclusion: DTR provides an effective training-free framework that addresses key limitations of existing RAG approaches by making retrieval decisions adaptive and improving evidence construction through dual-path mechanisms, leading to better performance with fewer unnecessary retrievals.

Abstract: Retrieval-augmented generation (RAG) enhances large language models (LLMs) by incorporating external knowledge, but existing approaches indiscriminately trigger retrieval and rely on single-path evidence construction, often introducing noise and limiting performance gains. In this work, we propose Decide Then Retrieve (DTR), a training-free framework that adaptively determines when retrieval is necessary and how external information should be selected. DTR leverages generation uncertainty to guide retrieval triggering and introduces a dual-path retrieval mechanism with adaptive information selection to better handle sparse and ambiguous queries. Extensive experiments across five open-domain QA benchmarks, multiple model scales, and different retrievers demonstrate that DTR consistently improves EM and F1 over standard RAG and strong retrieval-enhanced baselines, while reducing unnecessary retrievals. The code and data used in this paper are available at https://github.com/ChenWangHKU/DTR.

[91] Large-Scale Aspect-Based Sentiment Analysis with Reasoning-Infused LLMs

Paweł Liskowski, Krzysztof Jankowski

Main category: cs.CL

TL;DR: Arctic-ABSA introduces powerful models for aspect-based sentiment analysis with expanded sentiment classes, multilingual support, and novel reasoning techniques, achieving SOTA results and releasing a large benchmark dataset.

Details

Motivation: To create commercial-grade ABSA models that address real-world needs by expanding beyond standard sentiment classes, supporting multiple languages, and improving generalization through reasoning techniques.

Method: Trained on large corpus of public data plus synthetic data (20x larger than SemEval14), expanded sentiment classes from 3 to 5 (adding mixed/unknown), joint prediction of overall sentiment, multilingual support, reasoning injection via CoT fine-tuning, and novel reasoning pretraining for encoder-only models.

Result: 395M-parameter encoder and 8B-parameter decoder achieve up to 10 percentage points higher accuracy than GPT-4o and Claude 3.5 Sonnet, set new SOTA on SemEval14, and maintain 87-91% accuracy across six languages without degrading English performance.

Conclusion: Arctic-ABSA models demonstrate superior performance for commercial ABSA applications through expanded capabilities, reasoning techniques, and multilingual support, while releasing ABSA-mix benchmark to advance the field.

Abstract: We introduce Arctic-ABSA, a collection of powerful models for real-life aspect-based sentiment analysis (ABSA). Our models are tailored to commercial needs, trained on a large corpus of public data alongside carefully generated synthetic data, resulting in a dataset 20 times larger than SemEval14. We extend typical ABSA models by expanding the number of sentiment classes from the standard three (positive, negative, neutral) to five, adding mixed and unknown classes, while also jointly predicting overall text sentiment and supporting multiple languages. We experiment with reasoning injection by fine-tuning on Chain-of-Thought (CoT) examples and introduce a novel reasoning pretraining technique for encoder-only models that significantly improves downstream fine-tuning and generalization. Our 395M-parameter encoder and 8B-parameter decoder achieve up to 10 percentage points higher accuracy than GPT-4o and Claude 3.5 Sonnet, while setting new state-of-the-art results on the SemEval14 benchmark. A single multilingual model maintains 87-91% accuracy across six languages without degrading English performance. We release ABSA-mix, a large-scale benchmark aggregating 17 public ABSA datasets across 92 domains.

[92] When Models Decide and When They Bind: A Two-Stage Computation for Multiple-Choice Question-Answering

Hugh Mee Wong, Rick Nouwen, Albert Gatt

Main category: cs.CL

TL;DR: Language models implement multiple-choice QA in two stages: first selecting the correct answer in content space, then binding it to the appropriate output symbol.

Details

Motivation: MCQA evaluation conflates reasoning errors with symbol-binding failures, making it hard to understand how models actually solve these tasks internally.

Method: Used representational analyses (PCA, linear probes) and causal interventions to study internal representations, examining option-boundary residual states and winner-identity probing.

Result: Found strong linearly decodable signals about per-option correctness in residual states; winner identity becomes decodable immediately after final option processing, while output symbol representation appears closer to answer emission.

Conclusion: Models use a two-stage mechanism: first select winner in content space, then bind/route that winner to appropriate symbol for emission, explaining how MCQA is implemented internally.

Abstract: Multiple-choice question answering (MCQA) is easy to evaluate but adds a meta-task: models must both solve the problem and output the symbol that represents the answer, conflating reasoning errors with symbol-binding failures. We study how language models implement MCQA internally using representational analyses (PCA, linear probes) as well as causal interventions. We find that option-boundary (newline) residual states often contain strong linearly decodable signals related to per-option correctness. Winner-identity probing reveals a two-stage progression: the winning content position becomes decodable immediately after the final option is processed, while the output symbol is represented closer to the answer emission position. Tests under symbol and content permutations support a two-stage mechanism in which models first select a winner in content space and then bind or route that winner to the appropriate symbol to emit.

[93] Doc-PP: Document Policy Preservation Benchmark for Large Vision-Language Models

Haeun Jang, Hwan Chang, Hwanhee Lee

Main category: cs.CL

TL;DR: Doc-PP benchmark reveals LVLMs leak sensitive info during complex multimodal reasoning, and proposed DVA framework improves policy compliance

Details

Motivation: Real-world document QA requires adherence to dynamic disclosure policies, but existing safety research overlooks multimodal complexities and focuses only on implicit norms or text-only settings

Method: Introduce Doc-PP benchmark from real-world reports with strict policies; propose DVA (Decompose-Verify-Aggregation) framework that decouples reasoning from policy verification

Result: Models show systemic Reasoning-Induced Safety Gap - leak sensitive info during complex multimodal reasoning; extracted text improves perception but facilitates leakage; DVA outperforms standard prompting defenses

Conclusion: DVA offers robust baseline for policy-compliant document understanding by addressing vulnerabilities in multimodal reasoning while preserving disclosure policies

Abstract: The deployment of Large Vision-Language Models (LVLMs) for real-world document question answering is often constrained by dynamic, user-defined policies that dictate information disclosure based on context. While ensuring adherence to these explicit constraints is critical, existing safety research primarily focuses on implicit social norms or text-only settings, overlooking the complexities of multimodal documents. In this paper, we introduce Doc-PP (Document Policy Preservation Benchmark), a novel benchmark constructed from real-world reports requiring reasoning across heterogeneous visual and textual elements under strict non-disclosure policies. Our evaluation highlights a systemic Reasoning-Induced Safety Gap: models frequently leak sensitive information when answers must be inferred through complex synthesis or aggregated across modalities, effectively circumventing existing safety constraints. Furthermore, we identify that providing extracted text improves perception but inadvertently facilitates leakage. To address these vulnerabilities, we propose DVA (Decompose-Verify-Aggregation), a structural inference framework that decouples reasoning from policy verification. Experimental results demonstrate that DVA significantly outperforms standard prompting defenses, offering a robust baseline for policy-compliant document understanding

Song-Duo Ma, Yi-Hung Liu, Hsin-Yu Lin, Pin-Yu Chen, Hong-Yan Huang, Shau-Yung Hsu, Yun-Nung Chen

Main category: cs.CL

TL;DR: RADAR is a retrieval-augmented detector with adversarial refinement that uses a generator to rewrite real articles with factual perturbations and a lightweight detector that verifies claims using dense passage retrieval, achieving 86.98% ROC-AUC on fake news detection.

Details

Motivation: To efficiently combat the spread of LLM-generated misinformation by creating a robust fake news detection system that can adapt to increasingly sophisticated misinformation techniques.

Method: Uses a generator-detector adversarial framework where the generator rewrites real articles with factual perturbations, and the detector verifies claims using dense passage retrieval. Introduces verbal adversarial feedback (VAF) - structured natural-language critiques instead of scalar rewards - to guide generator evolution and improve detector robustness.

Result: Achieves 86.98% ROC-AUC on a fake news detection benchmark, significantly outperforming general-purpose LLMs with retrieval. Ablation studies show detector-side retrieval yields largest gains, while VAF and few-shot demonstrations provide critical signals for robust training.

Conclusion: RADAR demonstrates an effective adversarial refinement approach for fake news detection, where verbal adversarial feedback enables co-evolution of generator and detector, leading to superior performance compared to existing methods.

Abstract: To efficiently combat the spread of LLM-generated misinformation, we present RADAR, a retrieval-augmented detector with adversarial refinement for robust fake news detection. Our approach employs a generator that rewrites real articles with factual perturbations, paired with a lightweight detector that verifies claims using dense passage retrieval. To enable effective co-evolution, we introduce verbal adversarial feedback (VAF). Rather than relying on scalar rewards, VAF issues structured natural-language critiques; these guide the generator toward more sophisticated evasion attempts, compelling the detector to adapt and improve. On a fake news detection benchmark, RADAR achieves 86.98% ROC-AUC, significantly outperforming general-purpose LLMs with retrieval. Ablation studies confirm that detector-side retrieval yields the largest gains, while VAF and few-shot demonstrations provide critical signals for robust training.

[95] Benchmark^2: Systematic Evaluation of LLM Benchmarks

Qi Qian, Chengsong Huang, Jingwen Xu, Changze Lv, Muling Wu, Wenhao Liu, Xiaohua Wang, Zhenghua Wang, Zisu Huang, Muzhao Tian, Jianhan Xu, Kun Hu, He-Da Wang, Yao Hu, Xuanjing Huang, Xiaoqing Zheng

Main category: cs.CL

TL;DR: Benchmark^2 is a framework with three metrics to evaluate benchmark quality for LLMs, showing existing benchmarks vary significantly in quality and selective construction can reduce test set size while maintaining evaluation performance.

Details

Motivation: The rapid proliferation of benchmarks for evaluating large language models has created an urgent need for systematic methods to assess benchmark quality itself, as there's no standardized way to determine which benchmarks are reliable and effective.

Method: Proposes Benchmark^2 framework with three complementary metrics: (1) Cross-Benchmark Ranking Consistency - measures if benchmark produces model rankings aligned with peer benchmarks; (2) Discriminability Score - quantifies benchmark’s ability to differentiate between models; (3) Capability Alignment Deviation - identifies problematic instances where stronger models fail but weaker models succeed within same model family.

Result: Extensive experiments across 15 benchmarks spanning mathematics, reasoning, and knowledge domains, evaluating 11 LLMs across four model families. Analysis reveals significant quality variations among existing benchmarks and demonstrates that selective benchmark construction based on proposed metrics can achieve comparable evaluation performance with substantially reduced test sets.

Conclusion: Benchmark^2 provides a systematic framework for evaluating benchmark quality, revealing quality variations in existing benchmarks and enabling more efficient benchmark construction through selective item selection while maintaining evaluation reliability.

Abstract: The rapid proliferation of benchmarks for evaluating large language models (LLMs) has created an urgent need for systematic methods to assess benchmark quality itself. We propose Benchmark^2, a comprehensive framework comprising three complementary metrics: (1) Cross-Benchmark Ranking Consistency, measuring whether a benchmark produces model rankings aligned with peer benchmarks; (2) Discriminability Score, quantifying a benchmark’s ability to differentiate between models; and (3) Capability Alignment Deviation, identifying problematic instances where stronger models fail but weaker models succeed within the same model family. We conduct extensive experiments across 15 benchmarks spanning mathematics, reasoning, and knowledge domains, evaluating 11 LLMs across four model families. Our analysis reveals significant quality variations among existing benchmarks and demonstrates that selective benchmark construction based on our metrics can achieve comparable evaluation performance with substantially reduced test sets.

[96] Layer-wise Positional Bias in Short-Context Language Modeling

Maryam Rahimi, Mahdi Nouri, Yadollah Yaghoobzadeh

Main category: cs.CL

TL;DR: The paper introduces an attribution-based framework to analyze positional biases in language models, finding architecture-specific, stable positional importance profiles with recency bias increasing with depth and primacy bias decreasing.

Details

Motivation: Language models show positional biases (preferring specific input positions regardless of semantic relevance), but prior work hasn't established how these biases evolve across layers and positions or how they vary independent of task complexity.

Method: An attribution-based framework using layer conductance with a sliding-window approach to quantify how each layer distributes importance across input positions, yielding layer-wise positional importance profiles.

Result: Positional importance profiles are architecture-specific, stable across inputs, and invariant to lexical scrambling. Recency bias increases with depth while primacy bias diminishes. Early layers preferentially weight content words over function words across all positions, while later layers lose this word-type differentiation.

Conclusion: The framework reveals systematic positional biases in language models that are architecture-dependent and layer-specific, providing insights into how models process positional information independent of task complexity.

Abstract: Language models often show a preference for using information from specific positions in the input regardless of semantic relevance. While positional bias has been studied in various contexts, from attention sinks to task performance degradation in long-context settings, prior work has not established how these biases evolve across individual layers and input positions, or how they vary independent of task complexity. We introduce an attribution-based framework to analyze positional effects in short-context language modeling. Using layer conductance with a sliding-window approach, we quantify how each layer distributes importance across input positions, yielding layer-wise positional importance profiles. We find that these profiles are architecture-specific, stable across inputs, and invariant to lexical scrambling. Characterizing these profiles, we find prominent recency bias that increases with depth and subtle primacy bias that diminishes through model depth. Beyond positional structure, we also show that early layers preferentially weight content words over function words across all positions, while later layers lose this word-type differentiation.

[97] VotIE: Information Extraction from Meeting Minutes

José Pedro Evans, Luís Filipe Cunha, Purificação Silvano, Alípio Jorge, Nuno Guimarães, Sérgio Nunes, Ricardo Campos

Main category: cs.CL

TL;DR: VotIE is a new information extraction task for identifying voting events in heterogeneous municipal meeting minutes. Fine-tuned XLM-R-CRF performs best in-domain (93.2% F1) but struggles with cross-municipality transfer, where few-shot LLMs show better generalization despite higher computational costs.

Details

Motivation: Municipal meeting minutes contain crucial democratic decisions but encode voting outcomes in highly heterogeneous, free-form narrative text that varies widely across municipalities, making automated extraction challenging. Unlike standardized parliamentary proceedings, this variability creates significant barriers for information extraction systems.

Method: Introduced VotIE (Voting Information Extraction) task and established the first benchmark using Portuguese municipal minutes from the CitiLink corpus. Compared two approaches: 1) fine-tuned encoders (specifically XLM-R-CRF) and 2) generative LLMs with few-shot learning. Evaluated both in-domain and cross-municipality transfer settings.

Result: In-domain: Fine-tuned XLM-R-CRF achieved 93.2% macro F1, outperforming generative approaches. Cross-municipality: Fine-tuned models suffered substantial performance degradation, while few-shot LLMs demonstrated greater robustness with significantly smaller performance declines. However, generative models have high computational costs.

Conclusion: Lightweight fine-tuned encoders remain more practical for large-scale deployment despite cross-municipality generalization limitations. Few-shot LLMs show better transfer learning but are computationally expensive. The benchmark, models, and evaluation framework are publicly released to support administrative NLP research.

Abstract: Municipal meeting minutes record key decisions in local democratic processes. Unlike parliamentary proceedings, which typically adhere to standardized formats, they encode voting outcomes in highly heterogeneous, free-form narrative text that varies widely across municipalities, posing significant challenges for automated extraction. In this paper, we introduce VotIE (Voting Information Extraction), a new information extraction task aimed at identifying structured voting events in narrative deliberative records, and establish the first benchmark for this task using Portuguese municipal minutes, building on the recently introduced CitiLink corpus. Our experiments yield two key findings. First, under standard in-domain evaluation, fine-tuned encoders, specifically XLM-R-CRF, achieve the strongest performance, reaching 93.2% macro F1, outperforming generative approaches. Second, in a cross-municipality setting that evaluates transfer to unseen administrative contexts, these models suffer substantial performance degradation, whereas few-shot LLMs demonstrate greater robustness, with significantly smaller declines in performance. Despite this generalization advantage, the high computational cost of generative models currently constrains their practicality. As a result, lightweight fine-tuned encoders remain a more practical option for large-scale, real-world deployment. To support reproducible research in administrative NLP, we publicly release our benchmark, trained models, and evaluation framework.

[98] Simulated Students in Tutoring Dialogues: Substance or Illusion?

Alexander Scarlatos, Jaewook Lee, Simon Woodhead, Andrew Lan

Main category: cs.CL

TL;DR: This paper addresses the lack of quality evaluation for simulated students in LLM-powered tutoring systems, proposing formal definitions, comprehensive metrics, and benchmarking various simulation methods on real-world math tutoring data.

Details

Motivation: While LLMs enable educational innovations, evaluating new tutoring solutions requires real students which is time-consuming and hard to scale. Many works use simulated students via simple prompting, but little work has been done to ensure or measure their quality, creating a gap in reliable evaluation methods.

Method: The authors formally define the student simulation task, propose evaluation metrics spanning linguistic, behavioral, and cognitive aspects, and benchmark a wide range of student simulation methods (including prompting strategies, supervised fine-tuning, and preference optimization) on a real-world math tutoring dialogue dataset.

Result: Automated and human evaluation results show that prompting strategies for student simulation perform poorly, while supervised fine-tuning and preference optimization yield much better but still limited performance, indicating the task remains challenging.

Conclusion: Current student simulation methods have significant limitations, with simple prompting being inadequate and even more advanced techniques showing only partial success, motivating future work on this challenging but important task for educational technology evaluation.

Abstract: Advances in large language models (LLMs) enable many new innovations in education. However, evaluating the effectiveness of new technology requires real students, which is time-consuming and hard to scale up. Therefore, many recent works on LLM-powered tutoring solutions have used simulated students for both training and evaluation, often via simple prompting. Surprisingly, little work has been done to ensure or even measure the quality of simulated students. In this work, we formally define the student simulation task, propose a set of evaluation metrics that span linguistic, behavioral, and cognitive aspects, and benchmark a wide range of student simulation methods on these metrics. We experiment on a real-world math tutoring dialogue dataset, where both automated and human evaluation results show that prompting strategies for student simulation perform poorly; supervised fine-tuning and preference optimization yield much better but still limited performance, motivating future work on this challenging task.

[99] ContextFocus: Activation Steering for Contextual Faithfulness in Large Language Models

Nikhil Anand, Shwetha Somasundaram, Anirudh Phukan, Apoorv Saxena, Koyel Mukherjee

Main category: cs.CL

TL;DR: ContextFocus is a lightweight activation steering method that improves LLM faithfulness to external context during knowledge conflicts without finetuning or significant inference overhead.

Details

Motivation: LLMs often default to their internal memorized knowledge when external retrieved context conflicts with it, leading to unfaithful outputs. As world knowledge evolves, effective deployment depends on LLMs' ability to faithfully follow external evidence.

Method: ContextFocus is a lightweight activation steering approach that requires no model finetuning and incurs minimal inference-time overhead. It improves context faithfulness in knowledge-conflict settings while preserving fluency and efficiency.

Result: ContextFocus significantly improves contextual-faithfulness on the ConFiQA benchmark compared to baselines like ContextDPO, COIECD, and prompting methods. It’s complementary to prompting strategies and remains effective on larger models.

Conclusion: ContextFocus demonstrates effectiveness, robustness, and efficiency in improving contextual-faithfulness of LLM outputs, making it a practical solution for knowledge-conflict scenarios without requiring model retraining.

Abstract: Large Language Models (LLMs) encode vast amounts of parametric knowledge during pre-training. As world knowledge evolves, effective deployment increasingly depends on their ability to faithfully follow externally retrieved context. When such evidence conflicts with the model’s internal knowledge, LLMs often default to memorized facts, producing unfaithful outputs. In this work, we introduce ContextFocus, a lightweight activation steering approach that improves context faithfulness in such knowledge-conflict settings while preserving fluency and efficiency. Unlike prior approaches, our solution requires no model finetuning and incurs minimal inference-time overhead, making it highly efficient. We evaluate ContextFocus on the ConFiQA benchmark, comparing it against strong baselines including ContextDPO, COIECD, and prompting-based methods. Furthermore, we show that our method is complementary to prompting strategies and remains effective on larger models. Extensive experiments show that ContextFocus significantly improves contextual-faithfulness. Our results highlight the effectiveness, robustness, and efficiency of ContextFocus in improving contextual-faithfulness of LLM outputs.

[100] SpeakerSleuth: Evaluating Large Audio-Language Models as Judges for Multi-turn Speaker Consistency

Jonggeun Lee, Junseong Pyo, Gyuhyeon Seo, Yohan Jo

Main category: cs.CL

TL;DR: LALMs struggle to reliably judge speaker consistency in multi-turn dialogues, showing bias toward text over acoustics despite having inherent acoustic discrimination capabilities.

Details

Motivation: While Large Audio-Language Models (LALMs) are increasingly used as judges for speech generation quality, their ability to assess speaker consistency across multi-turn conversations remains unexplored, creating a gap in understanding their reliability for real-world dialogue evaluation.

Method: Created SpeakerSleuth benchmark with 1,818 human-verified evaluation instances across four diverse datasets (synthetic and real speech) with controlled acoustic difficulty. Evaluated nine widely-used LALMs through three tasks reflecting real-world requirements for speaker consistency assessment.

Result: LALMs struggle with acoustic inconsistency detection: some overpredict inconsistency while others are overly lenient. Performance degrades dramatically when other interlocutors’ turns are provided, as models prioritize textual coherence over acoustic cues (failing to detect even obvious gender switches). However, models perform substantially better at choosing audio that best matches a speaker among acoustic variants.

Conclusion: LALMs exhibit significant modality bias, prioritizing text over acoustics, revealing fundamental modality imbalances that must be addressed to build reliable audio-language judges for speaker consistency evaluation in multi-turn dialogues.

Abstract: Large Audio-Language Models (LALMs) as judges have emerged as a prominent approach for evaluating speech generation quality, yet their ability to assess speaker consistency across multi-turn conversations remains unexplored. We present SpeakerSleuth, a benchmark evaluating whether LALMs can reliably judge speaker consistency in multi-turn dialogues through three tasks reflecting real-world requirements. We construct 1,818 human-verified evaluation instances across four diverse datasets spanning synthetic and real speech, with controlled acoustic difficulty. Evaluating nine widely-used LALMs, we find that models struggle to reliably detect acoustic inconsistencies. For instance, given audio samples of the same speaker’s turns, some models overpredict inconsistency, whereas others are overly lenient. Models further struggle to identify the exact turns that are problematic. When other interlocutors’ turns are provided together, performance degrades dramatically as models prioritize textual coherence over acoustic cues, failing to detect even obvious gender switches for a speaker. On the other hand, models perform substantially better in choosing the audio that best matches the speaker among several acoustic variants, demonstrating inherent acoustic discrimination capabilities. These findings expose a significant bias in LALMs: they tend to prioritize text over acoustics, revealing fundamental modality imbalances that need to be addressed to build reliable audio-language judges.

[101] Analyzing and Improving Cross-lingual Knowledge Transfer for Machine Translation

David Stap

Main category: cs.CL

TL;DR: This thesis studies cross-lingual knowledge transfer in neural models, focusing on improving robustness and generalization in multilingual machine translation, especially for low-resource languages.

Details

Motivation: Multilingual machine translation systems face challenges in learning effective cross-lingual representations, particularly for low-resource languages with limited parallel data. Understanding how multilingual models share knowledge across languages is crucial for building more inclusive and resilient NLP systems.

Method: The research uses machine translation as a central testbed to analyze language similarity effects on transfer, employs retrieval and auxiliary supervision to strengthen low-resource translation, examines fine-tuning trade-offs in large language models, and studies the role of language diversity during training.

Result: The work shows that increasing translation coverage improves generalization and reduces off-target behavior, while revealing how modeling choices and data composition shape multilingual learning outcomes.

Conclusion: This thesis provides insights toward more inclusive and resilient multilingual NLP systems by highlighting how cross-lingual knowledge transfer can be optimized through careful consideration of language similarity, training strategies, and data composition.

Abstract: Multilingual machine translation systems aim to make knowledge accessible across languages, yet learning effective cross-lingual representations remains challenging. These challenges are especially pronounced for low-resource languages, where limited parallel data constrains generalization and transfer. Understanding how multilingual models share knowledge across languages requires examining the interaction between representations, data availability, and training strategies. In this thesis, we study cross-lingual knowledge transfer in neural models and develop methods to improve robustness and generalization in multilingual settings, using machine translation as a central testbed. We analyze how similarity between languages influences transfer, how retrieval and auxiliary supervision can strengthen low-resource translation, and how fine-tuning on parallel data can introduce unintended trade-offs in large language models. We further examine the role of language diversity during training and show that increasing translation coverage improves generalization and reduces off-target behavior. Together, this work highlights how modeling choices and data composition shape multilingual learning and offers insights toward more inclusive and resilient multilingual NLP systems.

[102] When Helpers Become Hazards: A Benchmark for Analyzing Multimodal LLM-Powered Safety in Daily Life

Xinyue Lou, Jinan Xu, Jingyi Yin, Xiaolong Wang, Zhaolu Kang, Youwei Liao, Yixuan Wang, Xiangyu Shi, Fengran Mo, Su Yao, Kaiyu Huang

Main category: cs.CL

TL;DR: SaLAD is a multimodal safety benchmark with 2,013 real-world image-text samples across 10 categories, designed to evaluate MLLM safety in daily life scenarios. It reveals that top MLLMs achieve only 57.2% safe response rate on unsafe queries, showing significant safety vulnerabilities.

Details

Motivation: As MLLMs become ubiquitous assistants, their potential to generate unsafe content poses real dangers to human behavior and society. Current safety evaluations are insufficient for assessing how MLLMs handle realistic multimodal safety scenarios in daily life.

Method: Created SaLAD benchmark with 2,013 real-world image-text samples across 10 common categories, balanced between unsafe scenarios and oversensitivity cases. Proposed safety-warning-based evaluation framework that encourages informative safety warnings rather than generic refusals.

Result: Tested 18 MLLMs - top-performing models achieved only 57.2% safe response rate on unsafe queries. Popular safety alignment methods showed limited effectiveness in this realistic scenario, revealing significant vulnerabilities in current MLLMs.

Conclusion: Current MLLMs have serious safety vulnerabilities in identifying dangerous behaviors in daily life. The SaLAD benchmark provides a realistic evaluation framework that reveals these gaps, highlighting the need for improved multimodal safety alignment methods.

Abstract: As Multimodal Large Language Models (MLLMs) become an indispensable assistant in human life, the unsafe content generated by MLLMs poses a danger to human behavior, perpetually overhanging human society like a sword of Damocles. To investigate and evaluate the safety impact of MLLMs responses on human behavior in daily life, we introduce SaLAD, a multimodal safety benchmark which contains 2,013 real-world image-text samples across 10 common categories, with a balanced design covering both unsafe scenarios and cases of oversensitivity. It emphasizes realistic risk exposure, authentic visual inputs, and fine-grained cross-modal reasoning, ensuring that safety risks cannot be inferred from text alone. We further propose a safety-warning-based evaluation framework that encourages models to provide clear and informative safety warnings, rather than generic refusals. Results on 18 MLLMs demonstrate that the top-performing models achieve a safe response rate of only 57.2% on unsafe queries. Moreover, even popular safety alignment methods limit effectiveness of the models in our scenario, revealing the vulnerabilities of current MLLMs in identifying dangerous behaviors in daily life. Our dataset is available at https://github.com/xinyuelou/SaLAD.

[103] Modular Prompt Optimization: Optimizing Structured Prompts with Section-Local Textual Gradients

Prith Sharma, Austin Z. Henley

Main category: cs.CL

TL;DR: MPO is a modular prompt optimization framework that treats prompts as structured objects with semantic sections, applying section-local textual gradients to refine each part independently while maintaining fixed schema.

Details

Motivation: Current prompt optimization methods treat prompts as monolithic blocks, making it difficult to localize errors, preserve critical instructions, or prevent uncontrolled prompt growth. There's a need for more structured, interpretable optimization approaches.

Method: Modular Prompt Optimization (MPO) treats prompts as structured objects with semantic sections (system role, context, task description, constraints, output format). It applies section-local textual gradients from a critic LLM to refine each section independently while keeping the overall schema fixed, with de-duplication to reduce redundancy.

Result: MPO consistently outperforms untuned structured prompts and TextGrad baseline on ARC-Challenge and MMLU benchmarks using LLaMA-3 8B-Instruct and Mistral-7B-Instruct, achieving substantial accuracy gains without modifying model parameters or altering prompt structure.

Conclusion: Maintaining a fixed prompt schema while applying localized, section-wise optimization is an effective and practical approach for improving reasoning performance in small open-source language models.

Abstract: Prompt quality plays a central role in controlling the behavior, reliability, and reasoning performance of large language models (LLMs), particularly for smaller open-source instruction-tuned models that depend heavily on explicit structure. While recent work has explored automatic prompt optimization using textual gradients and self-refinement, most existing methods treat prompts as monolithic blocks of text, making it difficult to localize errors, preserve critical instructions, or prevent uncontrolled prompt growth. We introduce Modular Prompt Optimization (MPO), a schema-based prompt optimization framework that treats prompts as structured objects composed of fixed semantic sections, including system role, context, task description, constraints, and output format. MPO applies section-local textual gradients, generated by a critic language model, to refine each section independently while keeping the overall prompt schema fixed. Section updates are consolidated through de-duplication to reduce redundancy and interference between components, yielding an interpretable and robust optimization process. We evaluate MPO on two reasoning benchmarks, ARC-Challenge and MMLU, using LLaMA-3 8B-Instruct and Mistral-7B-Instruct as solver models. Across both benchmarks and models, MPO consistently outperforms an untuned structured prompt and the TextGrad baseline, achieving substantial accuracy gains without modifying model parameters or altering prompt structure. These results demonstrate that maintaining a fixed prompt schema while applying localized, section-wise optimization is an effective and practical approach for improving reasoning performance in small open-source LMs.

[104] Bridging the Discrete-Continuous Gap: Unified Multimodal Generation via Coupled Manifold Discrete Absorbing Diffusion

Yuanfeng Xu, Yuhao Chen, Liang Lin, Guangrun Wang

Main category: cs.CL

TL;DR: CoM-DAD is a unified multimodal framework that combines continuous latent diffusion for semantic planning with discrete absorbing diffusion for token synthesis, enabling stable text-image generation without heavy contrastive encoders.

Details

Motivation: Current generative modeling is bifurcated: autoregressive models for discrete data (text) and diffusion models for continuous data (images), which hinders development of truly unified multimodal systems. Masked Language Models (MLMs) lack generative fidelity and semantic continuity, and extending masked generation to multimodal settings introduces alignment challenges and training instability.

Method: CoM-DAD (Coupled Manifold Discrete Absorbing Diffusion) reformulates multimodal generation as a hierarchical dual-process: 1) models semantic manifold via continuous latent diffusion process, 2) treats token generation as discrete absorbing diffusion process regulated by Variable-Rate Noise Schedule, conditioned on evolving semantic priors. Uses Stochastic Mixed-Modal Transport strategy to align disparate modalities without heavy contrastive dual-encoders.

Result: The method demonstrates superior stability over standard masked modeling and establishes a new paradigm for scalable, unified text-image generation.

Conclusion: CoM-DAD provides a novel probabilistic framework that decouples high-level semantic planning from low-level token synthesis, enabling stable and scalable unified text-image generation without the alignment challenges of traditional masked modeling approaches.

Abstract: The bifurcation of generative modeling into autoregressive approaches for discrete data (text) and diffusion approaches for continuous data (images) hinders the development of truly unified multimodal systems. While Masked Language Models (MLMs) offer efficient bidirectional context, they traditionally lack the generative fidelity of autoregressive models and the semantic continuity of diffusion models. Furthermore, extending masked generation to multimodal settings introduces severe alignment challenges and training instability. In this work, we propose \textbf{CoM-DAD} (\textbf{Co}upled \textbf{M}anifold \textbf{D}iscrete \textbf{A}bsorbing \textbf{D}iffusion), a novel probabilistic framework that reformulates multimodal generation as a hierarchical dual-process. CoM-DAD decouples high-level semantic planning from low-level token synthesis. First, we model the semantic manifold via a continuous latent diffusion process; second, we treat token generation as a discrete absorbing diffusion process, regulated by a \textbf{Variable-Rate Noise Schedule}, conditioned on these evolving semantic priors. Crucially, we introduce a \textbf{Stochastic Mixed-Modal Transport} strategy that aligns disparate modalities without requiring heavy contrastive dual-encoders. Our method demonstrates superior stability over standard masked modeling, establishing a new paradigm for scalable, unified text-image generation.

[105] KDCM: Reducing Hallucination in LLMs through Explicit Reasoning Structures

Jinbo Hao, Kai Yang, Qingzhen Su, Yifan Li, Chao Jiang

Main category: cs.CL

TL;DR: A framework using code-guided knowledge graph exploration to reduce LLM hallucinations by embedding executable modules in prompts to regulate reasoning steps.

Details

Motivation: To mitigate hallucinations in large language models, particularly those induced by prompts, by improving contextual modeling and constraining erroneous reasoning.

Method: Extends chain-style knowledge distillation with a programmable module embedded as executable code in reasoning prompts, guiding knowledge graph exploration and regulating intermediate reasoning steps.

Result: Significant improvements: HIT@1, HIT@3, and HIT@5 increased by 15.64%, 13.38%, and 13.28% respectively, with scores exceeding 95% across multiple benchmarks using GPT-4 and LLaMA-3.3.

Conclusion: The code-guided reasoning framework effectively reduces prompt-induced hallucinations while improving both accuracy and interpretability by constraining erroneous reasoning.

Abstract: To mitigate hallucinations in large language models (LLMs), we propose a framework that focuses on errors induced by prompts. Our method extends a chain-style knowledge distillation approach by incorporating a programmable module that guides knowledge graph exploration. This module is embedded as executable code within the reasoning prompt, allowing the model to leverage external structured knowledge during inference. Based on this design, we develop an enhanced distillation-based reasoning framework that explicitly regulates intermediate reasoning steps, resulting in more reliable predictions. We evaluate the proposed approach on multiple public benchmarks using GPT-4 and LLaMA-3.3. Experimental results show that code-guided reasoning significantly improves contextual modeling and reduces prompt-induced hallucinations. Specifically, HIT@1, HIT@3, and HIT@5 increase by 15.64%, 13.38%, and 13.28%, respectively, with scores exceeding 95% across several evaluation settings. These findings indicate that the proposed method effectively constrains erroneous reasoning while improving both accuracy and interpretability.

[106] SearchAttack: Red-Teaming LLMs against Real-World Threats via Framing Unsafe Web Information-Seeking Tasks

Yu Yan, Sheng Sun, Mingfeng Li, Zheming Yang, Chiwei Zhu, Fei Ma, Benfeng Xu, Min Liu

Main category: cs.CL

TL;DR: SearchAttack is a red-teaming method that exploits web search as an attack surface for search-augmented LLMs by outsourcing harmful content to search engines while providing only query skeletons and fragmented clues to bypass safeguards.

Details

Motivation: The motivation stems from the vulnerability of search-augmented LLMs when search engines are triggered for harmful tasks. Once search returns ready-to-use harmful content, LLM safeguards cannot withdraw that exposure, creating a critical attack surface that needs to be addressed through responsible vulnerability assessment.

Method: SearchAttack outsources harmful semantics to web search while retaining only the query’s skeleton and fragmented clues. It then steers LLMs to reconstruct the retrieved content via structural rubrics to achieve malicious goals, effectively bypassing the LLM’s safety mechanisms.

Result: Extensive experiments show that SearchAttack demonstrates strong effectiveness in attacking search-augmented LLMs, successfully exploiting the web search vulnerability to achieve malicious objectives despite existing safeguards.

Conclusion: Web search represents a critical attack surface for search-augmented LLMs, and SearchAttack effectively demonstrates this vulnerability, highlighting the need for improved safeguards in such systems for responsible AI development.

Abstract: Recently, people have suffered and become increasingly aware of the unreliability gap in LLMs for open and knowledge-intensive tasks, and thus turn to search-augmented LLMs to mitigate this issue. However, when the search engine is triggered for harmful tasks, the outcome is no longer under the LLM’s control. Once the returned content directly contains targeted, ready-to-use harmful takeaways, the LLM’s safeguards cannot withdraw that exposure. Motivated by this dilemma, we identify web search as a critical attack surface and propose \textbf{\textit{SearchAttack}} for red-teaming. SearchAttack outsources the harmful semantics to web search, retaining only the query’s skeleton and fragmented clues, and further steers LLMs to reconstruct the retrieved content via structural rubrics to achieve malicious goals. Extensive experiments are conducted to red-team the search-augmented LLMs for responsible vulnerability assessment. Empirically, SearchAttack demonstrates strong effectiveness in attacking these systems.

[107] LLMberjack: Guided Trimming of Debate Trees for Multi-Party Conversation Creation

Leonardo Bottona, Nicolò Penzo, Bruno Lepri, Marco Guerini, Sara Tonelli

Main category: cs.CL

TL;DR: LLMberjack is an interactive platform for converting debate reply trees into coherent multi-party conversations with LLM-assisted editing and visualization tools.

Details

Motivation: Addresses the lack of resources for creating multi-party conversations from existing debates, which are typically structured as reply trees rather than linear dialogues.

Method: Provides an interactive interface with tree visualization to construct linearized dialogue sequences while preserving participant identity and discourse relations, with optional LLM assistance for automatic message editing and speaker description generation.

Result: Tree visualization facilitates creation of coherent conversation threads, and LLM support enhances output quality while reducing human effort in conversation generation.

Conclusion: LLMberjack is an open-source tool that promotes transparent and reproducible workflows for creating multi-party conversations from debate data, filling a resource gap in this area.

Abstract: We present LLMberjack, a platform for creating multi-party conversations starting from existing debates, originally structured as reply trees. The system offers an interactive interface that visualizes discussion trees and enables users to construct coherent linearized dialogue sequences while preserving participant identity and discourse relations. It integrates optional large language model (LLM) assistance to support automatic editing of the messages and speakers’ descriptions. We demonstrate the platform’s utility by showing how tree visualization facilitates the creation of coherent, meaningful conversation threads and how LLM support enhances output quality while reducing human effort. The tool is open-source and designed to promote transparent and reproducible workflows to create multi-party conversations, addressing a lack of resources of this type.

[108] FLEx: Language Modeling with Few-shot Language Explanations

Adar Avsian, Christopher Richardson, Anirudh Sundar, Larry Heck

Main category: cs.CL

TL;DR: FLEx improves language model performance by using a few explanatory examples to create a prompt prefix that guides the model to avoid repeating similar errors, outperforming chain-of-thought prompting across multiple datasets.

Details

Motivation: Language models still make mistakes that are often repeated across related queries. While natural language explanations can help correct errors, collecting them at scale is infeasible, especially in domains requiring expert annotators.

Method: FLEx selects representative model errors using embedding-based clustering, verifies that associated explanations correct those errors, and summarizes them into a prompt prefix that is prepended at inference-time without modifying model weights.

Result: FLEx consistently outperforms chain-of-thought prompting across CounterBench, GSM8K, and ReasonIF datasets, reducing up to 83% of CoT’s remaining errors.

Conclusion: FLEx provides an effective method for improving model behavior using few-shot explanations without weight modification, addressing the scalability issue of collecting expert annotations for error correction.

Abstract: Language models have become effective at a wide range of tasks, from math problem solving to open-domain question answering. However, they still make mistakes, and these mistakes are often repeated across related queries. Natural language explanations can help correct these errors, but collecting them at scale may be infeasible, particularly in domains where expert annotators are required. To address this issue, we introduce FLEx ($\textbf{F}$ew-shot $\textbf{L}$anguage $\textbf{Ex}$planations), a method for improving model behavior using a small number of explanatory examples. FLEx selects representative model errors using embedding-based clustering, verifies that the associated explanations correct those errors, and summarizes them into a prompt prefix that is prepended at inference-time. This summary guides the model to avoid similar errors on new inputs, without modifying model weights. We evaluate FLEx on CounterBench, GSM8K, and ReasonIF. We find that FLEx consistently outperforms chain-of-thought (CoT) prompting across all three datasets and reduces up to 83% of CoT’s remaining errors.

[109] All That Glisters Is Not Gold: A Benchmark for Reference-Free Counterfactual Financial Misinformation Detection

Yuechen Jiang, Zhiwei Liu, Yupeng Cao, Yueru He, Ziyang Xu, Chen Xu, Zhiyang Deng, Prayag Tiwari, Xi Chen, Alejandro Lopez-Lira, Jimin Huang, Junichi Tsujii, Sophia Ananiadou

Main category: cs.CL

TL;DR: RFC Bench is a benchmark for evaluating LLMs on financial misinformation detection in realistic news contexts, showing models perform better with comparative context than reference-free detection.

Details

Motivation: There's a need to evaluate LLMs on financial misinformation detection in realistic news settings where meaning emerges from dispersed contextual cues, and current benchmarks may not capture this complexity.

Method: Created RFC Bench benchmark operating at paragraph level with two complementary tasks: 1) reference-free misinformation detection, and 2) comparison-based diagnosis using paired original-perturbed inputs.

Result: Models perform substantially better with comparative context than in reference-free settings, which expose significant weaknesses including unstable predictions and elevated invalid outputs.

Conclusion: Current LLMs struggle to maintain coherent belief states without external grounding, and RFC Bench provides a structured testbed for studying reference-free reasoning and advancing reliable financial misinformation detection.

Abstract: We introduce RFC Bench, a benchmark for evaluating large language models on financial misinformation under realistic news. RFC Bench operates at the paragraph level and captures the contextual complexity of financial news where meaning emerges from dispersed cues. The benchmark defines two complementary tasks: reference free misinformation detection and comparison based diagnosis using paired original perturbed inputs. Experiments reveal a consistent pattern: performance is substantially stronger when comparative context is available, while reference free settings expose significant weaknesses, including unstable predictions and elevated invalid outputs. These results indicate that current models struggle to maintain coherent belief states without external grounding. By highlighting this gap, RFC Bench provides a structured testbed for studying reference free reasoning and advancing more reliable financial misinformation detection in real world settings.

[110] InsertGNN: Can Graph Neural Networks Outperform Humans in TOEFL Sentence Insertion Problem?

Fang Wu, Stan Z. Li

Main category: cs.CL

TL;DR: InsertGNN uses hierarchical Graph Neural Networks to solve sentence integration problems, achieving 70% accuracy on TOEFL dataset, matching average human performance.

Details

Motivation: Sentence integration is an important but understudied NLP challenge. Existing methods for sentence arrangement, textual consistency, and question answering are insufficient for addressing sentence integration problems.

Method: InsertGNN conceptualizes sentence integration as a graph problem and employs a hierarchical Graph Neural Network (GNN) to understand the interplay between sentences. The approach uses cross-domain learning for validation on arXiv dataset.

Result: InsertGNN achieves 70% accuracy on TOEFL dataset, which matches average human test scores. The method demonstrates superiority over all comparative benchmarks through rigorous experimentation.

Conclusion: InsertGNN effectively addresses the sentence integration problem using graph-based hierarchical GNNs, achieving human-level performance and outperforming existing methods.

Abstract: The integration of sentences poses an intriguing challenge within the realm of NLP, but it has not garnered the attention it deserves. Existing methods that focus on sentence arrangement, textual consistency, and question answering are inadequate in addressing this issue. To bridge this gap, we introduce InsertGNN, which conceptualizes the problem as a graph and employs a hierarchical Graph Neural Network (GNN) to comprehend the interplay between sentences. Our approach was rigorously evaluated on a TOEFL dataset, and its efficacy was further validated on the expansive arXiv dataset using cross-domain learning. Thorough experimentation unequivocally establishes InsertGNN’s superiority over all comparative benchmarks, achieving an impressive 70% accuracy, a performance on par with average human test scores.

[111] A Systematic Comparison between Extractive Self-Explanations and Human Rationales in Text Classification

Stephanie Brandl, Oliver Eberle

Main category: cs.CL

TL;DR: LLM self-explanations are evaluated for plausibility and faithfulness across text classification tasks, showing task-dependent alignment with human rationales and different explanation strategies compared to post-hoc methods.

Details

Motivation: To determine whether instruction-tuned LLMs' ability to generate self-explanations actually produces good explanations, and to evaluate their plausibility to humans compared to human-annotated rationales.

Method: Evaluated self-explanations as input rationales across three text classification tasks (sentiment classification, forced labour detection, claim verification) with Danish/Italian translations. Collected human rationale annotations for Climate-Fever dataset. Compared self-explanations to human annotations and post-hoc attribution-based explanations. Analyzed four open-weight LLMs.

Result: Alignment between self-explanations and human rationales depends on text length and task complexity. Self-explanations yield faithful subsets of token-level rationales, while post-hoc attribution methods emphasize structural/formatting tokens, showing fundamentally different explanation strategies.

Conclusion: LLM self-explanations have varying plausibility to humans depending on task characteristics, but provide faithful explanations that differ strategically from post-hoc attribution methods, which focus on different aspects of the input.

Abstract: Instruction-tuned LLMs are able to provide \textit{an} explanation about their output to users by generating self-explanations, without requiring the application of complex interpretability techniques. In this paper, we analyse whether this ability results in a \textit{good} explanation. We evaluate self-explanations in the form of input rationales with respect to their plausibility to humans. We study three text classification tasks: sentiment classification, forced labour detection and claim verification. We include Danish and Italian translations of the sentiment classification task and compare self-explanations to human annotations. For this, we collected human rationale annotations for Climate-Fever, a claim verification dataset. We furthermore evaluate the faithfulness of human and self-explanation rationales with respect to correct model predictions, and extend the study by incorporating post-hoc attribution-based explanations. We analyse four open-weight LLMs and find that alignment between self-explanations and human rationales highly depends on text length and task complexity. Nevertheless, self-explanations yield faithful subsets of token-level rationales, whereas post-hoc attribution methods tend to emphasize structural and formatting tokens, reflecting fundamentally different explanation strategies.

[112] SSSD: Simply-Scalable Speculative Decoding

Michele Marzollo, Jiawei Zhuang, Niklas Roemer, Niklas Zwingenberger, Lorenz K. Müller, Lukas Cavigelli

Main category: cs.CL

TL;DR: SSSD is a training-free speculative decoding method using n-gram matching and hardware-aware speculation that achieves up to 2.9x latency reduction without needing draft model training or tuning.

Details

Motivation: Existing speculative decoding methods either provide only modest speed improvements or require additional trained draft models/auxiliary components, increasing deployment complexity and reducing flexibility for shifting workloads across tasks, domains, or languages.

Method: Simply-Scalable Speculative Decoding (SSSD) combines lightweight n-gram matching with hardware-aware speculation. It’s training-free, requiring no data preparation, training, or tuning.

Result: SSSD reduces latency by up to 2.9x relative to standard autoregressive decoding. It achieves performance on par with leading training-based approaches across benchmarks while requiring substantially lower adoption effort and exhibiting superior robustness under language/domain shift and in long-context settings.

Conclusion: SSSD provides a practical, training-free alternative to existing speculative decoding methods that balances substantial speed improvements with deployment simplicity and robustness to workload shifts.

Abstract: Speculative Decoding has emerged as a popular technique for accelerating inference in Large Language Models. However, most existing approaches yield only modest improvements in production serving systems. Methods that achieve substantial speedups typically rely on an additional trained draft model or auxiliary model components, increasing deployment and maintenance complexity. This added complexity reduces flexibility, particularly when serving workloads shift to tasks, domains, or languages that are not well represented in the draft model’s training data. We introduce Simply-Scalable Speculative Decoding (SSSD), a training-free method that combines lightweight n-gram matching with hardware-aware speculation. Relative to standard autoregressive decoding, SSSD reduces latency by up to 2.9x. It achieves performance on par with leading training-based approaches across a broad range of benchmarks, while requiring substantially lower adoption effort–no data preparation, training or tuning are needed–and exhibiting superior robustness under language and domain shift, as well as in long-context settings.

[113] Exploring Iterative Controllable Summarization with Large Language Models

Sangwon Ryu, Heejin Do, Daehee Kim, Hwanjo Yu, Dongwoo Kim, Yunsu Kim, Gary Geunbae Lee, Jungseul Ok

Main category: cs.CL

TL;DR: The paper proposes a guide-to-explain (GTE) framework to improve LLMs’ controllability in summarization by enabling self-explanation of attribute misalignments, achieving better results with fewer iterations.

Details

Motivation: LLMs show strong abstractive summarization performance but lack precise control over summary attributes (length, topic, etc.), limiting their adaptability to specific user preferences. Current approaches don't adequately address controllability evaluation and improvement.

Method: Introduces iterative evaluation metrics (failure rate, average iteration count) to assess controllability, then proposes a guide-to-explain (GTE) framework where models identify misaligned attributes in initial drafts and self-explain errors to generate better-adjusted summaries.

Result: LLMs struggle more with numerical attributes than linguistic ones. GTE framework enables models to generate summaries satisfying desired attributes with robust effectiveness, requiring surprisingly fewer iterations than other iterative approaches.

Conclusion: The proposed GTE framework effectively improves LLM controllability in summarization through self-explanation of attribute misalignments, offering a practical solution for generating customized summaries that better match user preferences.

Abstract: Large language models (LLMs) have demonstrated remarkable performance in abstractive summarization tasks. However, their ability to precisely control summary attributes (e.g., length or topic) remains underexplored, limiting their adaptability to specific user preferences. In this paper, we systematically explore the controllability of LLMs. To this end, we revisit summary attribute measurements and introduce iterative evaluation metrics, failure rate and average iteration count to precisely evaluate controllability of LLMs, rather than merely assessing errors. Our findings show that LLMs struggle more with numerical attributes than with linguistic attributes. To address this challenge, we propose a guide-to-explain framework (GTE) for controllable summarization. Our GTE framework enables the model to identify misaligned attributes in the initial draft and guides it in self-explaining errors in the previous output. By allowing the model to reflect on its misalignment, GTE generates well-adjusted summaries that satisfy the desired attributes with robust effectiveness, requiring surprisingly fewer iterations than other iterative approaches.

[114] Can Large Language Models Identify Implicit Suicidal Ideation? An Empirical Evaluation

Tong Li, Shu Yang, Junchao Wu, Jiyao Wei, Lijie Hu, Mengdi Li, Derek F. Wong, Joshua R. Oltmanns, Di Wang

Main category: cs.CL

TL;DR: LLMs struggle with suicide prevention tasks, particularly identifying implicit suicidal ideation and providing appropriate support, despite being tested on a comprehensive dataset built on psychological frameworks.

Details

Motivation: To evaluate LLMs' capabilities in sensitive mental health applications, specifically suicide prevention, where accurate identification of implicit suicidal ideation and provision of appropriate support are critical but challenging tasks.

Method: Created a novel dataset of 1,308 test cases based on psychological frameworks (D/S-IAT and Negative Automatic Thinking) and real-world scenarios. Conducted extensive experiments with 8 widely used LLMs under different contextual settings to assess their performance on IIS (Identification of Implicit Suicidal ideation) and PAS (Provision of Appropriate Supportive responses).

Result: Current LLMs struggle significantly with both detecting implicit suicidal ideation and providing appropriate supportive responses, revealing crucial limitations in applying these models to mental health contexts.

Conclusion: There is a pressing need for more sophisticated approaches in developing and evaluating LLMs for sensitive psychological applications like suicide prevention, as current models fall short in critical areas of mental health support.

Abstract: We present a comprehensive evaluation framework for assessing Large Language Models’ (LLMs) capabilities in suicide prevention, focusing on two critical aspects: the Identification of Implicit Suicidal ideation (IIS) and the Provision of Appropriate Supportive responses (PAS). We introduce \ourdata, a novel dataset of 1,308 test cases built upon psychological frameworks including D/S-IAT and Negative Automatic Thinking, alongside real-world scenarios. Through extensive experiments with 8 widely used LLMs under different contextual settings, we find that current models struggle significantly with detecting implicit suicidal ideation and providing appropriate support, highlighting crucial limitations in applying LLMs to mental health contexts. Our findings underscore the need for more sophisticated approaches in developing and evaluating LLMs for sensitive psychological applications.

[115] Detecting PTSD in Clinical Interviews: A Comparative Analysis of NLP Methods and Large Language Models

Feng Chen, Dror Ben-Zeev, Gillian Sparks, Arya Kadakia, Trevor Cohen

Main category: cs.CL

TL;DR: This study evaluates NLP methods for PTSD detection from clinical interviews, finding domain-specific models and SentenceBERT embeddings perform best, with LLM prompting showing promise for scalable screening.

Details

Motivation: PTSD remains underdiagnosed in clinical settings, creating opportunities for automated detection to identify patients who might otherwise go undiagnosed.

Method: Compared multiple NLP approaches: general and mental health-specific transformer models (BERT/RoBERTa), embedding-based methods (SentenceBERT/LLaMA), and LLM prompting strategies (zero-shot/few-shot/chain-of-thought) using the DAIC-WOZ dataset.

Result: Domain-specific models outperformed general models (Mental-RoBERTa AUPRC=0.675 vs. RoBERTa-base 0.599). SentenceBERT embeddings with neural networks achieved highest performance (AUPRC=0.758). Few-shot prompting with DSM-5 criteria yielded competitive results (AUPRC=0.737). Performance varied by symptom severity and depression comorbidity.

Conclusion: Domain-adapted embeddings and LLMs show potential for scalable PTSD screening, but improved detection of nuanced presentations is needed for clinically viable AI tools.

Abstract: Post-Traumatic Stress Disorder (PTSD) remains underdiagnosed in clinical settings, presenting opportunities for automated detection to identify patients. This study evaluates natural language processing approaches for detecting PTSD from clinical interview transcripts. We compared general and mental health-specific transformer models (BERT/RoBERTa), embedding-based methods (SentenceBERT/LLaMA), and large language model prompting strategies (zero-shot/few-shot/chain-of-thought) using the DAIC-WOZ dataset. Domain-specific end-to-end models significantly outperformed general models (Mental-RoBERTa AUPRC=0.675+/-0.084 vs. RoBERTa-base 0.599+/-0.145). SentenceBERT embeddings with neural networks achieved the highest overall performance (AUPRC=0.758+/-0.128). Few-shot prompting using DSM-5 criteria yielded competitive results with two examples (AUPRC=0.737). Performance varied significantly across symptom severity and comorbidity status with depression, with higher accuracy for severe PTSD cases and patients with comorbid depression. Our findings highlight the potential of domain-adapted embeddings and LLMs for scalable screening while underscoring the need for improved detection of nuanced presentations and offering insights for developing clinically viable AI tools for PTSD assessment.

[116] DRA-GRPO: Your GRPO Needs to Know Diverse Reasoning Paths for Mathematical Reasoning

Xiwen Chen, Wenhui Zhu, Peijie Qiu, Xuanzhao Dong, Hao Wang, Haiyu Wu, Huayu Li, Aristeidis Sotiras, Yalin Wang, Abolfazl Razi

Main category: cs.CL

TL;DR: DRA-GRPO improves mathematical reasoning in LLMs by adding diversity-aware rewards to prevent policy collapse and better cover valid solution strategies.

Details

Motivation: Standard GRPO uses scalar correctness rewards that treat distinct reasoning paths as identical, causing Diversity-Quality Inconsistency where policies collapse into narrow dominant modes and ignore equally valid but structurally novel strategies.

Method: Proposes Diversity-aware Reward Adjustment (DRA), a theoretically grounded framework that calibrates reward signals using semantic density of sampled groups. Uses Submodular Mutual Information (SMI) with Inverse Propensity Scoring (IPS) to de-bias gradient estimation, creating repulsive forces against redundancy.

Result: DRA-GRPO outperforms strong baselines on five math benchmarks, achieving 58.2% average accuracy on DeepSeek-R1-Distill-Qwen-1.5B with only 7,000 training samples and $55 cost.

Conclusion: Diversity calibration plays a critical role in data-efficient alignment for mathematical reasoning, and DRA provides a plug-and-play solution that integrates seamlessly with GRPO variants.

Abstract: Post-training LLMs with Reinforcement Learning, specifically Group Relative Policy Optimization (GRPO), has emerged as a paradigm for enhancing mathematical reasoning. However, standard GRPO relies on scalar correctness rewards that are often non-injective with respect to semantic content: distinct reasoning paths receive identical rewards. This leads to a Diversity-Quality Inconsistency, where the policy collapses into a narrow set of dominant modes while ignoring equally valid but structurally novel strategies. To bridge this gap, we propose Diversity-aware Reward Adjustment (DRA), a theoretically grounded framework that calibrates the reward signal using the semantic density of sampled groups. By leveraging Submodular Mutual Information (SMI), DRA implements an Inverse Propensity Scoring (IPS) mechanism that effectively de-biases the gradient estimation. This creates a repulsive force against redundancy, driving the policy to achieve better coverage of the high-reward landscape. Our method is plug-and-play and integrates seamlessly with GRPO variants. Empirical evaluations on five math benchmarks demonstrate that DRA-GRPO consistently outperforms strong baselines, achieving an average accuracy of 58.2% on DeepSeek-R1-Distill-Qwen-1.5B with only 7,000 training samples and $55 cost, highlighting the critical role of diversity calibration in data-efficient alignment.

Angelie Kraft, Judith Simon, Sonja Schimmler

Main category: cs.CL

TL;DR: QA/RC benchmarks lack demographic and regional representation, with insufficient creator information and widespread gender, religion, and geographic biases in popular datasets.

Details

Motivation: To investigate whether popular QA and reading comprehension benchmarks adequately cover questions about different demographics or regions, and to examine potential biases in benchmark creation and content.

Method: Content analysis of 30 benchmark papers and quantitative analysis of 20 respective benchmark datasets, examining creator/annotator information, social bias measures, and demographic representation.

Result: Most benchmark papers provide insufficient information about creators and annotators; only WinoGrande explicitly addresses social representation. Data analysis reveals widespread gender, religion, and geographic biases across encyclopedic, commonsense, and scholarly benchmarks.

Conclusion: Biased benchmarks may contribute to LLM bias by incentivizing biased inference heuristics, adding to criticism of current AI evaluation practices and highlighting the need for more representative benchmark design.

Abstract: Question-answering (QA) and reading comprehension (RC) benchmarks are commonly used for assessing the capabilities of large language models (LLMs) to retrieve and reproduce knowledge. However, we demonstrate that popular QA and RC benchmarks do not cover questions about different demographics or regions in a representative way. We perform a content analysis of 30 benchmark papers and a quantitative analysis of 20 respective benchmark datasets to learn (1) who is involved in the benchmark creation, (2) whether the benchmarks exhibit social bias, or whether this is addressed or prevented, and (3) whether the demographics of the creators and annotators correspond to particular biases in the content. Most benchmark papers analyzed provide insufficient information about those involved in benchmark creation, particularly the annotators. Notably, just one (WinoGrande) explicitly reports measures taken to address social representation issues. Moreover, the data analysis revealed gender, religion, and geographic biases across a wide range of encyclopedic, commonsense, and scholarly benchmarks. Our work adds to the mounting criticism of AI evaluation practices and shines a light on biased benchmarks being a potential source of LLM bias by incentivizing biased inference heuristics.

[118] Shared Path: Unraveling Memorization in Multilingual LLMs through Language Similarities

Xiaoyu Luo, Yiyi Chen, Johannes Bjerva, Qiongxiu Li

Main category: cs.CL

TL;DR: First comprehensive study of memorization in multilingual LLMs across 95 languages, revealing that cross-lingual relationships better explain memorization patterns than just training data availability.

Details

Motivation: Understanding memorization in multilingual LLMs is critical as they're increasingly deployed, but prior work focused on monolingual models, leaving multilingual memorization underexplored despite long-tailed training corpora.

Method: Analyzed 95 languages using models across diverse scales and architectures; proposed novel graph-based correlation metric incorporating language similarity to analyze cross-lingual memorization.

Result: Found that among similar languages, those with fewer training tokens exhibit higher memorization, a trend only visible when cross-lingual relationships are explicitly modeled. Language similarity explains memorization patterns better than just training data availability.

Conclusion: Highlights importance of language-aware perspective for evaluating/mitigating memorization vulnerabilities in MLLMs; provides empirical evidence that language similarity explains memorization and underpins cross-lingual transferability.

Abstract: We present the first comprehensive study of Memorization in Multilingual Large Language Models (MLLMs), analyzing 95 languages using models across diverse model scales, architectures, and memorization definitions. As MLLMs are increasingly deployed, understanding their memorization behavior has become critical. Yet prior work has focused primarily on monolingual models, leaving multilingual memorization underexplored, despite the inherently long-tailed nature of training corpora. We find that the prevailing assumption, that memorization is highly correlated with training data availability, fails to fully explain memorization patterns in MLLMs. We hypothesize that the conventional focus on monolingual settings, effectively treating languages in isolation, may obscure the true patterns of memorization. To address this, we propose a novel graph-based correlation metric that incorporates language similarity to analyze cross-lingual memorization. Our analysis reveals that among similar languages, those with fewer training tokens tend to exhibit higher memorization, a trend that only emerges when cross-lingual relationships are explicitly modeled. These findings underscore the importance of a \textit{language-aware} perspective in evaluating and mitigating memorization vulnerabilities in MLLMs. This also constitutes empirical evidence that language similarity both explains Memorization in MLLMs and underpins Cross-lingual Transferability, with broad implications for multilingual NLP.

[119] After Retrieval, Before Generation: Enhancing the Trustworthiness of Large Language Models in Retrieval-Augmented Generation

Xinbang Dai, Huikang Hu, Yuncheng Hua, Jiaqi Li, Yongrui Chen, Rihui Jin, Nan Hu, Guilin Qi

Main category: cs.CL

TL;DR: BRIDGE framework improves RAG trustworthiness by dynamically balancing parametric vs. retrieved knowledge using adaptive soft bias and decision trees, outperforming baselines by 5-15% accuracy across diverse scenarios.

Details

Motivation: Current RAG systems struggle with trustworthiness when parametric (internal) and retrieved (external) knowledge conflict or are unreliable. Existing approaches handle isolated scenarios but lack a unified framework for real-world conditions where knowledge sources may conflict.

Method: Proposed BRIDGE framework: 1) Uses adaptive “soft bias” weighting mechanism to guide knowledge collection, 2) Employs Maximum Soft-bias Decision Tree to evaluate knowledge quality and select optimal response strategies (trust internal/external knowledge, or refuse to answer).

Result: BRIDGE outperforms baselines by 5-15% in accuracy while maintaining balanced performance across all scenarios. The framework was evaluated on the Trustworthiness Response Dataset (TRD) with 36,266 questions spanning four RAG settings.

Conclusion: BRIDGE provides an effective solution for LLMs’ trustworthy responses in real-world RAG applications by dynamically determining comprehensive response strategies that balance parametric and retrieved knowledge.

Abstract: Retrieval-augmented generation (RAG) is a promising paradigm, yet its trustworthiness remains a critical concern. A major vulnerability arises prior to generation: models often fail to balance parametric (internal) and retrieved (external) knowledge, particularly when the two sources conflict or are unreliable. To analyze these scenarios comprehensively, we construct the Trustworthiness Response Dataset (TRD) with 36,266 questions spanning four RAG settings. We reveal that existing approaches address isolated scenarios-prioritizing one knowledge source, naively merging both, or refusing answers-but lack a unified framework to handle different real-world conditions simultaneously. Therefore, we propose the BRIDGE framework, which dynamically determines a comprehensive response strategy of large language models (LLMs). BRIDGE leverages an adaptive weighting mechanism named soft bias to guide knowledge collection, followed by a Maximum Soft-bias Decision Tree to evaluate knowledge and select optimal response strategies (trust internal/external knowledge, or refuse). Experiments show BRIDGE outperforms baselines by 5-15% in accuracy while maintaining balanced performance across all scenarios. Our work provides an effective solution for LLMs’ trustworthy responses in real-world RAG applications.

[120] Interleaved Reasoning for Large Language Models via Reinforcement Learning

Roy Xie, David Qiu, Deepak Gopinath, Dong Lin, Yanchao Sun, Chong Wang, Saloni Potdar, Bhuwan Dhingra

Main category: cs.CL

TL;DR: A reinforcement learning training paradigm that teaches LLMs to interleave thinking and answering for multi-hop questions, improving accuracy while reducing reasoning length and time-to-first-token.

Details

Motivation: Traditional long chain-of-thought reasoning leads to inefficiencies and increased time-to-first-token (TTFT), creating a need for more efficient reasoning approaches that maintain accuracy.

Method: Uses only reinforcement learning (with PPO, GRPO, or REINFORCE++) to train LLMs to interleave thinking and answering. Introduces a simple reward scheme that incentivizes correct intermediate steps, leveraging intermediate signals during interleaved reasoning without requiring external tools.

Result: Achieves 12.5% improvement in Pass@1 accuracy, reduces overall reasoning length by 37%, and reduces TTFT by over 80% on average. Shows strong generalization to complex reasoning datasets (MATH, GPQA, MMLU) despite training only on QA and logical reasoning datasets.

Conclusion: Models inherently possess interleaved reasoning ability that can be enhanced through RL. The proposed method enables more effective credit assignment during RL, improving both accuracy and efficiency while maintaining generalization capabilities.

Abstract: Long chain-of-thought (CoT) significantly enhances the reasoning capabilities of large language models (LLMs). However, extensive reasoning traces lead to inefficiencies and increased time-to-first-token (TTFT). We propose a training paradigm that uses only reinforcement learning (RL) to guide reasoning LLMs to interleave thinking and answering for multi-hop questions. We observe that models inherently possess the ability to perform interleaved reasoning, which can be further enhanced through RL. We introduce a simple yet effective reward scheme to incentivize correct intermediate steps, guiding the policy model toward correct reasoning paths by leveraging intermediate signals generated during interleaved reasoning. Extensive experiments across five diverse datasets and three RL algorithms (PPO, GRPO, and REINFORCE++) demonstrate consistent improvements over traditional think-answer reasoning, without requiring external tools. Our method improves final task accuracy and overall efficiency by enabling more effective credit assignment during RL. Specifically, our approach achieves a 12.5% improvement in Pass@1 accuracy, while reducing overall reasoning length by 37% and TTFT by over 80% on average. Furthermore, our method, trained solely on question answering and logical reasoning datasets, exhibits strong generalization to complex reasoning datasets such as MATH, GPQA, and MMLU. Additionally, we conduct in-depth analysis to reveal several valuable insights into conditional reward modeling.

[121] PAM: Training Policy-Aligned Moderation Filters at Scale

Masoomali Fatehkia, Enes Altinisik, Mohamed Osman, Husrev Taha Sencar

Main category: cs.CL

TL;DR: PAM is a framework for training custom moderation filters based on user-defined policies beyond just safety, using automated data generation without human examples, achieving SOTA performance while being 5-100x faster at inference.

Details

Motivation: LLMs remain vulnerable to misalignment and jailbreaks, requiring external safeguards. Existing moderation filters focus too narrowly on safety and don't address broader alignment needs in real-world deployments.

Method: Policy Aligned Moderation (PAM) framework that automates training data generation without human-written examples, enabling scalable support for diverse application-specific alignment goals and generation policies.

Result: PAM-trained filters match SOTA safety moderation filters and policy reasoning models, outperform them on PAMbench (four new user-annotated benchmarks for age restrictions, dietary accommodations, cultural alignment, and medical guidance limitations), and run 5-100x faster at inference.

Conclusion: PAM provides a flexible, scalable solution for custom moderation filters that address broader alignment needs beyond conventional safety, with superior performance and efficiency compared to existing approaches.

Abstract: Large language models (LLMs) remain vulnerable to misalignment and jailbreaks, making external safeguards like moderation filters essential, yet existing filters often focus narrowly on safety, falling short of the broader alignment needs seen in real-world deployments. We introduce Policy Aligned Moderation (PAM), a flexible framework for training custom moderation filters grounded in user-defined policies that extend beyond conventional safety objectives. PAM automates training data generation without relying on human-written examples, enabling scalable support for diverse, application-specific alignment goals and generation policies. PAM-trained filters match the performance of state-of-the-art safety moderation filters and policy reasoning models, and outperform them on PAMbench, four newly introduced user-annotated policy enforcement benchmarks that target age restrictions, dietary accommodations, cultural alignment, and limitations in medical guidance. These performance gains are achieved while the PAM filter runs 5-100x faster at inference than policy-conditioned reasoning models.

[122] TrojanStego: Your Language Model Can Secretly Be A Steganographic Privacy Leaking Agent

Dominik Meier, Jan Philip Wahle, Paul Röttger, Terry Ruas, Bela Gipp

Main category: cs.CL

TL;DR: TrojanStego is a novel LLM threat where adversaries fine-tune models to embed sensitive information into natural-looking outputs via linguistic steganography, enabling passive data exfiltration without explicit control over inference inputs.

Details

Motivation: As LLMs are integrated into sensitive workflows, concerns grow about their potential to leak confidential information. Current threat models often require explicit control over inference inputs, but this paper explores a more covert approach where compromised models can passively exfiltrate data through seemingly normal outputs.

Method: The authors propose TrojanStego, which fine-tunes LLMs to embed sensitive context information into natural-looking outputs using linguistic steganography. They introduce a practical encoding scheme based on vocabulary partitioning that is learnable by LLMs through fine-tuning. The approach doesn’t require explicit control over inference inputs, making it more covert than traditional attacks.

Result: Experimental results show compromised models reliably transmit 32-bit secrets with 87% accuracy on held-out prompts, reaching over 97% accuracy using majority voting across three generations. The models maintain high utility, can evade human detection, and preserve coherence in their outputs.

Conclusion: TrojanStego represents a new class of LLM data exfiltration attacks that are passive, covert, practical, and dangerous. The paper highlights the need for better security measures against such steganographic threats in LLM deployments, especially in sensitive applications.

Abstract: As large language models (LLMs) become integrated into sensitive workflows, concerns grow over their potential to leak confidential information. We propose TrojanStego, a novel threat model in which an adversary fine-tunes an LLM to embed sensitive context information into natural-looking outputs via linguistic steganography, without requiring explicit control over inference inputs. We introduce a taxonomy outlining risk factors for compromised LLMs, and use it to evaluate the risk profile of the threat. To implement TrojanStego, we propose a practical encoding scheme based on vocabulary partitioning learnable by LLMs via fine-tuning. Experimental results show that compromised models reliably transmit 32-bit secrets with 87% accuracy on held-out prompts, reaching over 97% accuracy using majority voting across three generations. Further, they maintain high utility, can evade human detection, and preserve coherence. These results highlight a new class of LLM data exfiltration attacks that are passive, covert, practical, and dangerous.

[123] FinTagging: Benchmarking LLMs for Extracting and Structuring Financial Information

Yan Wang, Lingfei Qian, Xueqing Peng, Yang Ren, Keyi Wang, Yi Han, Dongji Feng, Fengran Mo, Shengyuan Lin, Qinchuan Zhang, Kaiwen He, Chenri Luo, Jianxing Chen, Junwei Wu, Chen Xu, Ziyang Xu, Jimin Huang, Guojun Xiong, Xiao-Yang Liu, Qianqian Xie, Jian-Yun Nie

Main category: cs.CL

TL;DR: FinTagging benchmark for XBRL tagging decomposes financial report tagging into entity extraction (FinNI) and concept linking (FinCL) to evaluate LLMs on realistic financial reporting tasks.

Details

Motivation: Current XBRL tagging benchmarks oversimplify the task as flat classification over small concept subsets, ignoring hierarchical taxonomy semantics and document structure, failing to evaluate LLMs under realistic financial reporting conditions.

Method: Introduces FinTagging benchmark with two-stage approach: FinNI (Financial Numeric Identification) extracts entities/types from heterogeneous contexts, and FinCL (Financial Concept Linking) maps entities to full US-GAAP taxonomy.

Result: Evaluation of diverse LLMs in zero-shot settings shows models generalize well in extraction (FinNI) but struggle with fine-grained concept linking (FinCL), revealing limitations in domain-specific, structure-aware reasoning.

Conclusion: FinTagging provides the first comprehensive benchmark for structure-aware XBRL tagging, enabling fair assessment of LLM capabilities in numerical reasoning and taxonomy alignment for realistic financial reporting tasks.

Abstract: Accurate interpretation of numerical data in financial reports is critical for markets and regulators. Although XBRL (eXtensible Business Reporting Language) provides a standard for tagging financial figures, mapping thousands of facts to over ten thousand US-GAAP concepts remains costly and error-prone. Existing benchmarks oversimplify this task as flat, single-step classification over small subsets of concepts, ignoring the hierarchical semantics of the taxonomy and the structured nature of financial documents. As a result, these benchmarks fail to evaluate Large Language Models (LLMs) under realistic reporting conditions. To bridge this gap, we introduce FinTagging, the first comprehensive benchmark for structure-aware and full-scope XBRL tagging. We decompose the complex tagging process into two subtasks: (1) FinNI (Financial Numeric Identification), which extracts entities and types from heterogeneous contexts such as text and tables; and (2) FinCL (Financial Concept Linking), which maps extracted entities to the full US-GAAP taxonomy. This two-stage formulation enables a fair assessment of LLM capabilities in numerical reasoning and taxonomy alignment. Evaluating diverse LLMs in zero-shot settings shows that while models generalize well in extraction, they struggle with fine-grained concept linking, revealing important limitations in domain-specific, structure-aware reasoning. Code is available on GitHub, and datasets are available on Hugging Face.

[124] Dissecting Physics Reasoning in Small Language Models: A Multi-Dimensional Analysis from an Educational Perspective

Nicy Scaria, Silvester John Joseph Kennedy, Krishna Agarwal, Diksha Seth, Deepak Subramani

Main category: cs.CL

TL;DR: SLMs show major reasoning reliability gaps in physics education - 75-98% of correct final answers contain reasoning errors, requiring evaluation focused on reasoning fidelity over answer correctness.

Details

Motivation: Existing benchmarks prioritize final answer accuracy, missing 'right answer, wrong procedure' failures that can reinforce student misconceptions in educational settings where SLMs offer privacy and efficiency advantages.

Method: Created Physbench with 3,162 physics questions (Bloom’s Taxonomy annotated) + 2,700 culturally contextualized variants. Used P-REFS stage-wise rubric to evaluate 10 SLMs across 58,000 responses, analyzing failure modes by model capability.

Result: Major reliability gap: 75-98% of final answer correct solutions contain at least one reasoning error. Failure modes shift with capability - weaker models fail at interpretation/modeling, stronger models fail during execution. Contextual variations minimally impact top models but degrade mid-tier models.

Conclusion: Safe educational AI requires evaluation paradigms prioritizing reasoning fidelity over final-answer correctness, as SLMs show substantial reasoning reliability issues despite correct answers.

Abstract: Small Language Models (SLMs) offer privacy and efficiency for educational deployment, yet their utility depends on reliable multistep reasoning. Existing benchmarks often prioritize final answer accuracy, obscuring ‘right answer, wrong procedure’ failures that can reinforce student misconceptions. This work investigates SLM physics reasoning reliability, stage wise failure modes, and robustness under paired contextual variants. We introduce Physbench, comprising of 3,162 high school and AP level physics questions derived from OpenStax in a structured reference solution format with Bloom’s Taxonomy annotations, plus 2,700 paired culturally contextualized variants. Using P-REFS, a stage wise evaluation rubric, we assess 10 SLMs across 58,000 responses. Results reveal substantial reliability gap: among final answer correct solutions, 75 to 98% contain at least one reasoning error. Failure modes shift with model capability; weaker models fail primarily at interpretation or modeling while stronger models often fail during execution. Paired contextual variations have minimal impact on top models but degrade the performance of mid-tier models. These findings demonstrate that safe educational AI requires evaluation paradigms that prioritize reasoning fidelity over final-answer correctness.

[125] InComeS: Integrating Compression and Selection Mechanisms into LLMs for Efficient Model Editing

Shuaiyi Li, Zhisong Zhang, Yang Deng, Chenlong Deng, Tianqing Fang, Hongming Zhang, Haitao Mi, Dong Yu, Wai Lam

Main category: cs.CL

TL;DR: InComeS enhances LLM model editing by compressing edit contexts into KV caches of gist tokens and using cross-attention for dynamic selection, overcoming context window limitations.

Details

Motivation: Existing model editing methods struggle with complex scenarios requiring semantic understanding rather than just knowledge recall. In-context learning (ICL) shows promise but is limited by LLMs' context window constraints, degrading performance as edit numbers increase.

Method: Proposes InComeS framework with two key mechanisms: 1) Compresses each editing context into KV cache of a special gist token for efficient handling of multiple edits, 2) Adds specialized cross-attention modules to dynamically select most relevant information from gist pools for adaptive edit utilization.

Result: Experiments on diverse model editing benchmarks with various editing formats demonstrate the effectiveness and efficiency of the InComeS method.

Conclusion: InComeS provides a flexible framework that enhances LLMs’ ability to process editing contexts through compression and selection mechanisms, overcoming context window limitations while maintaining effectiveness across different editing scenarios.

Abstract: Although existing model editing methods perform well in recalling exact edit facts, they often struggle in complex scenarios that require deeper semantic understanding rather than mere knowledge regurgitation. Leveraging the strong contextual reasoning abilities of large language models (LLMs), in-context learning (ICL) becomes a promising editing method by comprehending edit information through context encoding. However, this method is constrained by the limited context window of LLMs, leading to degraded performance and efficiency as the number of edits increases. To overcome this limitation, we propose InComeS, a flexible framework that enhances LLMs’ ability to process editing contexts through explicit compression and selection mechanisms. Specifically, InComeS compresses each editing context into the key-value (KV) cache of a special gist token, enabling efficient handling of multiple edits without being restricted by the model’s context window. Furthermore, specialized cross-attention modules are added to dynamically select the most relevant information from the gist pools, enabling adaptive and effective utilization of edit information. We conduct experiments on diverse model editing benchmarks with various editing formats, and the results demonstrate the effectiveness and efficiency of our method.

[126] Fair Document Valuation in LLM Summaries via Shapley Values

Zikun Ye, Hema Yoganarasimhan

Main category: cs.CL

TL;DR: Proposes Cluster Shapley, a scalable Shapley value approximation for fair attribution of documents in LLM-generated summaries, showing superior efficiency-accuracy tradeoffs over existing methods.

Details

Motivation: LLM-based summarization systems obscure individual content creator contributions, raising concerns about fair credit attribution and compensation for original documents used in summaries.

Method: Develops Cluster Shapley, an approximation algorithm that leverages semantic similarity among documents to cluster them, reducing computational complexity while maintaining attribution accuracy compared to exact Shapley value computation.

Result: Empirical evaluation on Amazon product reviews shows Cluster Shapley substantially improves the efficiency-accuracy frontier over off-the-shelf Shapley approximations (Monte Carlo, Kernel SHAP), while simple attribution rules lead to highly unfair outcomes.

Conclusion: Structure-aware Shapley approximations like Cluster Shapley offer scalable and fair content attribution mechanisms for LLM summarization systems, providing guidance for platforms seeking to implement fair attribution.

Abstract: Large Language Models (LLMs) are increasingly used in systems that retrieve and summarize content from multiple sources, such as search engines and AI assistants. While these systems enhance user experience through coherent summaries, they obscure the individual contributions of original content creators, raising concerns about credit attribution and compensation. We address the challenge of valuing individual documents used in LLM-generated summaries by proposing a Shapley value-based framework for fair document valuation. Although theoretically appealing, exact Shapley value computation is prohibitively expensive at scale. To improve efficiency, we develop Cluster Shapley, a simple approximation algorithm that leverages semantic similarity among documents to reduce computation while maintaining attribution accuracy. Using Amazon product review data, we empirically show that off-the-shelf Shapley approximations, such as Monte Carlo sampling and Kernel SHAP, perform suboptimally in LLM settings, whereas Cluster Shapley substantially improves the efficiency-accuracy frontier. Moreover, simple attribution rules (e.g., equal or relevance-based allocation), though computationally cheap, lead to highly unfair outcomes. Together, our findings highlight the potential of structure-aware Shapley approximations tailored to LLM summarization and offer guidance for platforms seeking scalable and fair content attribution mechanisms.

[127] Task Matters: Knowledge Requirements Shape LLM Responses to Context-Memory Conflict

Kaiser Sun, Fan Bai, Mark Dredze

Main category: cs.CL

TL;DR: LLMs struggle with conflicts between contextual information and parametric memory, with performance degradation depending on task-specific knowledge needs and conflict plausibility. Task-aware approaches are needed for balancing context and memory.

Details

Motivation: Prior research has focused on contextual QA where tasks should rely on context, leaving unclear how LLMs behave when tasks require different types and degrees of knowledge utilization. There's a need to understand how LLMs handle conflicts between context and parametric memory across varied task demands.

Method: Developed a model-agnostic diagnostic framework that holds underlying knowledge constant while introducing controlled conflicts across tasks with varying knowledge demands. Tested on representative open-source LLMs, examining strategies like rationales and context reiteration.

Result: Performance degradation under conflict is driven by both task-specific knowledge reliance and conflict plausibility. Strategies like rationales or context reiteration increase context reliance, helping context-only tasks but harming those requiring parametric knowledge. These effects bias model-based evaluation.

Conclusion: Context-memory conflict is inherently task-dependent, motivating task-aware approaches to balancing context and memory in LLM deployment and evaluation. The findings call into question the reliability of LLMs as judges in evaluation settings.

Abstract: Large language models (LLMs) draw on both contextual information and parametric memory, yet these sources can conflict. Prior studies have largely examined this issue in contextual question answering, implicitly assuming that tasks should rely on the provided context, leaving unclear how LLMs behave when tasks require different types and degrees of knowledge utilization. We address this gap with a model-agnostic diagnostic framework that holds underlying knowledge constant while introducing controlled conflicts across tasks with varying knowledge demands. Experiments on representative open-source LLMs show that performance degradation under conflict is driven by both task-specific knowledge reliance and conflict plausibility; that strategies such as rationales or context reiteration increase context reliance, helping context-only tasks but harming those requiring parametric knowledge; and that these effects bias model-based evaluation, calling into question the reliability of LLMs as judges. Overall, our findings reveal that context-memory conflict is inherently task-dependent and motivate task-aware approaches to balancing context and memory in LLM deployment and evaluation.

[128] Improved LLM Agents for Financial Document Question Answering

Nelvin Tan, Zian Seng, Liang Zhang, Yu-Ching Shih, Dong Yang, Amol Salunkhe

Main category: cs.CL

TL;DR: LLMs struggle with numerical QA on financial documents. Traditional critic agents fail without oracle labels. New critic + calculator agents outperform state-of-the-art and are safer.

Details

Motivation: LLMs have impressive NLP capabilities but struggle with numerical question answering on financial documents containing tabular and textual data. While critic agents (self-correction) work well with oracle labels, they perform poorly when such labels are unavailable, which is a more realistic scenario.

Method: The paper presents an improved critic agent and introduces a calculator agent. These agents work together to address numerical QA on financial documents. The approach is compared against the previous state-of-the-art program-of-thought method.

Result: The proposed critic + calculator agents outperform the previous state-of-the-art program-of-thought approach. The new approach is also safer. The paper investigates how the agents interact and how this interaction affects their performance.

Conclusion: The improved critic agent combined with a calculator agent provides a more effective and safer solution for numerical question answering on financial documents, especially in realistic scenarios where oracle labels are unavailable. The interaction between agents plays a crucial role in performance.

Abstract: Large language models (LLMs) have shown impressive capabilities on numerous natural language processing tasks. However, LLMs still struggle with numerical question answering for financial documents that include tabular and textual data. Recent works have showed the effectiveness of critic agents (i.e., self-correction) for this task given oracle labels. Building upon this framework, this paper examines the effectiveness of the traditional critic agent when oracle labels are not available, and show, through experiments, that this critic agent’s performance deteriorates in this scenario. With this in mind, we present an improved critic agent, along with the calculator agent which outperforms the previous state-of-the-art approach (program-of-thought) and is safer. Furthermore, we investigate how our agents interact with each other, and how this interaction affects their performance.

[129] Are Vision Language Models Cross-Cultural Theory of Mind Reasoners?

Zabir Al Nazi, GM Shahariar, Md. Abrar Hossain, Wei Peng

Main category: cs.CL

TL;DR: CulturalToM-VQA benchmark reveals VLMs struggle with cross-cultural Theory of Mind reasoning, showing performance gaps in false belief tasks, regional variance, and social desirability bias despite recent architectural improvements.

Details

Motivation: Current Vision-Language Model evaluations for Theory of Mind (social intelligence) are Western-centric, lacking cross-cultural assessment of social reasoning abilities across diverse cultural contexts, rituals, and norms.

Method: Created CulturalToM-VQA benchmark with 5,095 visually situated ToM probes across diverse cultures using frontier proprietary MLLM and human-verified pipeline. Taxonomy includes six ToM tasks and four complexity levels. Evaluated 10 VLMs (2023-2025) with ablation experiments on prompting strategies.

Result: Frontier models show high accuracy (>93%) overall but struggle with false belief reasoning (19-83% accuracy) and exhibit 20-30% regional performance gaps. Models display social desirability bias (favoring positive answers) and rely on parametric social priors/safety-aligned predictions. Chain-of-Thought helps older models but not newer ones.

Conclusion: Despite architectural advances, VLMs lack robust cross-cultural social reasoning. The benchmark provides a testbed for evaluating visually grounded Theory of Mind understanding, revealing that achieving culturally-aware social intelligence remains an open challenge.

Abstract: Theory of Mind (ToM) - the ability to attribute beliefs and intents to others - is fundamental for social intelligence, yet Vision-Language Model (VLM) evaluations remain largely Western-centric. In this work, we introduce CulturalToM-VQA, a benchmark of 5,095 visually situated ToM probes across diverse cultural contexts, rituals, and social norms. Constructed through a frontier proprietary MLLM, human-verified pipeline, the dataset spans a taxonomy of six ToM tasks and four complexity levels. We benchmark 10 VLMs (2023-2025) and observe a significant performance leap: while earlier models struggle, frontier models achieve high accuracy (>93%). However, significant limitations persist: models struggle with false belief reasoning (19-83% accuracy) and show high regional variance (20-30% gaps). Crucially, we find that SOTA models exhibit social desirability bias - systematically favoring semantically positive answer choices over negative ones. Ablation experiments reveal that some frontier models rely heavily on parametric social priors, frequently defaulting to safety-aligned predictions. Furthermore, while Chain-of-Thought prompting aids older models, it yields minimal gains for newer ones. Overall, our work provides a testbed for cross-cultural social reasoning, underscoring that despite architectural gains, achieving robust, visually grounded understanding remains an open challenge.

[130] Pitfalls of Evaluating Language Models with Open Benchmarks

Md. Najib Hasan, Md Mahadi Hassan Sibat, Mohammad Fakhruddin Babar, Souvika Sarkar, Monowar Hasan, Santu Karmaker

Main category: cs.CL

TL;DR: The paper exposes vulnerabilities in open LLM benchmarks where data leakage allows models to cheat by memorizing test sets, undermining leaderboard reliability and calling for improved evaluation practices.

Details

Motivation: Open LLM benchmarks like HELM and BIG-Bench promote transparency but create risks of data leakage during testing, which can undermine fairness and reliability of leaderboard rankings and enable manipulation.

Method: Constructed cheating models by fine-tuning smaller variants of BART, T5, and GPT-2 directly on publicly available test sets, then examined paraphrase-based safeguarding strategies to mitigate data leakage impact.

Result: Cheating models excelled on target benchmarks but failed to generalize to comparable unseen test sets, demonstrating the severity of data leakage. Paraphrase-based safeguards showed effectiveness but also limitations.

Conclusion: High leaderboard performance on limited open benchmarks may not reflect real-world utility; private/dynamic benchmarks should complement open ones; current benchmarking practices need reexamination for reliable LM assessment.

Abstract: Open Large Language Model (LLM) benchmarks, such as HELM and BIG-Bench, provide standardized and transparent evaluation protocols that support comparative analysis, reproducibility, and systematic progress tracking in Language Model (LM) research. Yet, this openness also creates substantial risks of data leakage during LM testing–deliberate or inadvertent, thereby undermining the fairness and reliability of leaderboard rankings and leaving them vulnerable to manipulation by unscrupulous actors. We illustrate the severity of this issue by intentionally constructing cheating models: smaller variants of BART, T5, and GPT-2, fine-tuned directly on publicly available test-sets. As expected, these models excel on the target benchmarks but fail terribly to generalize to comparable unseen testing sets. We then examine task specific simple paraphrase-based safeguarding strategies to mitigate the impact of data leakage and evaluate their effectiveness and limitations. Our findings underscore three key points: (i) high leaderboard performance on limited open, static benchmarks may not reflect real-world utility; (ii) private or dynamically generated benchmarks should complement open benchmarks to maintain evaluation integrity; and (iii) a reexamination of current benchmarking practices is essential for reliable and trustworthy LM assessment.

[131] ILID: Native Script Language Identification for Indian Languages

Yash Ingle, Pruthwik Mishra

Main category: cs.CL

TL;DR: This paper introduces ILID, a new dataset of 250K sentences covering 23 languages (English + 22 official Indian languages) for language identification, with baseline models that outperform existing state-of-the-art transformer models.

Details

Motivation: Language identification is crucial for NLP applications but faces challenges with noisy, short, and code-mixed text, especially for Indian languages that share scripts and have lexical/phonetic similarities but distinct differences.

Method: Created a new dataset of 250K sentences across 23 languages (mostly newly created data), then developed baseline models using state-of-the-art machine learning approaches and fine-tuning pre-trained transformer models.

Result: The developed models outperform state-of-the-art pre-trained transformer models for language identification tasks, particularly for the challenging Indian language context.

Conclusion: The paper releases both the ILID dataset and code publicly, providing valuable resources for improving language identification for diverse Indian languages in challenging real-world scenarios.

Abstract: The language identification task is a crucial fundamental step in NLP. Often it serves as a pre-processing step for widely used NLP applications such as multilingual machine translation, information retrieval, question and answering, and text summarization. The core challenge of language identification lies in distinguishing languages in noisy, short, and code-mixed environments. This becomes even harder in case of diverse Indian languages that exhibit lexical and phonetic similarities, but have distinct differences. Many Indian languages share the same script, making the task even more challenging. Taking all these challenges into account, we develop and release a dataset of 250K sentences consisting of 23 languages including English and all 22 official Indian languages labeled with their language identifiers, where data in most languages are newly created. We also develop and release baseline models using state-of-the-art approaches in machine learning and fine-tuning pre-trained transformer models. Our models outperforms the state-of-the-art pre-trained transformer models for the language identification task. The dataset and the codes are available at https://yashingle-ai.github.io/ILID/ and in Huggingface open source libraries.

[132] LAG: Logic-Augmented Generation from a Cartesian Perspective

Yilin Xiao, Chuang Zhou, Yujing Zhang, Qinggang Zhang, Su Dong, Shengyuan Chen, Chang Yang, Xiao Huang

Main category: cs.CL

TL;DR: LAG (Logic-Augmented Generation) improves LLM performance on complex reasoning tasks by decomposing questions into logical sub-questions and using stepwise reasoning with structured memory.

Details

Motivation: LLMs struggle with knowledge-intensive tasks and generate hallucinations, while existing RAG methods fail at complex reasoning due to lack of structured logical organization.

Method: LAG decomposes complex questions into atomic sub-questions with logical dependencies, resolves them sequentially using prior answers to guide context retrieval, and maintains an atomic memory bank for structured reasoning.

Result: Experiments on four benchmarks show LAG significantly improves accuracy and reduces hallucination compared to existing methods.

Conclusion: LAG provides a systematic approach to knowledge augmentation that addresses limitations of current RAG methods by incorporating structured logical reasoning inspired by Cartesian principles.

Abstract: Large language models (LLMs) have demonstrated remarkable capabilities across a wide range of tasks, yet exhibit critical limitations in knowledge-intensive tasks, often generating hallucinations when faced with questions requiring specialized expertise. While retrieval-augmented generation (RAG) mitigates this by integrating external knowledge, it struggles with complex reasoning scenarios due to its reliance on direct semantic retrieval and lack of structured logical organization. Inspired by Cartesian principles from \textit{Discours de la méthode}, this paper introduces Logic-Augmented Generation (LAG), a novel paradigm that reframes knowledge augmentation through systematic question decomposition, atomic memory bank and logic-aware reasoning. Specifically, LAG first decomposes complex questions into atomic sub-questions ordered by logical dependencies. It then resolves these sequentially, using prior answers to guide context retrieval for subsequent sub-questions, ensuring stepwise grounding in the logical chain. Experiments on four benchmarks demonstrate that LAG significantly improves accuracy and reduces hallucination over existing methods.

[133] PilotRL: Training Language Model Agents via Global Planning-Guided Progressive Reinforcement Learning

Keer Lu, Chong Chen, Xili Wang, Bin Cui, Yunhuai Liu, Wentao Zhang

Main category: cs.CL

TL;DR: AdaPlan introduces a global planning agent paradigm with PilotRL framework using progressive RL to improve LLM agents’ long-term strategic planning and generalization in complex tasks.

Details

Motivation: Existing LLM agent approaches like ReAct have limitations: single-step reasoning limits long-term planning, poor planner-executor coordination, and supervised fine-tuning leads to memorization rather than generalization for novel tasks.

Method: Propose AdaPlan paradigm for global planning-guided agents, and PilotRL framework with progressive reinforcement learning: 1) train to follow global plan guidance, 2) optimize plan quality, 3) jointly optimize planning-execution coordination.

Result: PilotRL achieves SOTA performance: LLaMA3.1-8B-Instruct + PilotRL surpasses GPT-4o by 3.60% and shows 55.78% gain over GPT-4o-mini at comparable parameter scale.

Conclusion: The AdaPlan paradigm with PilotRL training framework effectively addresses limitations of current LLM agents by enabling better long-horizon planning, improved generalization, and superior coordination between planning and execution.

Abstract: Large Language Models (LLMs) have shown remarkable advancements in tackling agent-oriented tasks. Despite their potential, existing work faces challenges when deploying LLMs in agent-based environments. The widely adopted agent paradigm ReAct centers on integrating single-step reasoning with immediate action execution, which limits its effectiveness in complex tasks requiring long-term strategic planning. Furthermore, the coordination between the planner and executor during problem-solving is also a critical factor to consider in agent design. Additionally, current approaches predominantly rely on supervised fine-tuning, which often leads models to memorize established task completion trajectories, thereby restricting their generalization ability when confronted with novel problem contexts. To address these challenges, we introduce an adaptive global plan-based agent paradigm AdaPlan, aiming to synergize high-level explicit guidance with execution to support effective long-horizon decision-making. Based on the proposed paradigm, we further put forward PilotRL, a global planning-guided training framework for LLM agents driven by progressive reinforcement learning. We first develop the model’s ability to follow explicit guidance from global plans when addressing agent tasks. Subsequently, based on this foundation, we focus on optimizing the quality of generated plans. Finally, we conduct joint optimization of the model’s planning and execution coordination. Experiments indicate that PilotRL could achieve state-of-the-art performances, with LLaMA3.1-8B-Instruct + PilotRL surpassing closed-sourced GPT-4o by 3.60%, while showing a more substantial gain of 55.78% comparing to GPT-4o-mini at a comparable parameter scale.

[134] Task-Stratified Knowledge Scaling Laws for Post-Training Quantized Large Language Models

Chenxi Zhou, Pengfei Cao, Jiang Li, Bohan Yu, Jinyu Ye, Jun Zhao, Kang Liu

Main category: cs.CL

TL;DR: This paper introduces Task-Stratified Knowledge Scaling Laws for LLM quantization, showing different knowledge types (memorization, application, reasoning) have distinct sensitivities to quantization parameters.

Details

Motivation: Existing PTQ scaling laws focus only on general performance, ignoring how quantization differentially impacts diverse knowledge capabilities and fine-grained factors like group size and calibration set size.

Method: Developed a framework stratifying capabilities into memorization, application, and reasoning, then unified model size, bit-width, group size, and calibration set size. Validated on 293 diverse PTQ configurations across architectures.

Result: Strong framework fit and cross-architecture consistency. Revealed distinct sensitivities: reasoning is precision-critical, application is scale-responsive, and memorization is calibration-sensitive. Low-bit scenarios require optimizing fine-grained factors to prevent performance collapse.

Conclusion: Provides empirically-backed foundation for designing knowledge-aware quantization strategies, highlighting the importance of considering different knowledge capabilities separately in PTQ optimization.

Abstract: Post-Training Quantization (PTQ) is a critical strategy for efficient Large Language Models (LLMs) deployment. However, existing scaling laws primarily focus on general performance, overlooking crucial fine-grained factors and how quantization differentially impacts diverse knowledge capabilities. To address this, we establish Task-Stratified Knowledge Scaling Laws. By stratifying capabilities into memorization, application, and reasoning, we develop a framework that unifies model size, bit-width, and fine-grained factors: group size and calibration set size. Validated on 293 diverse PTQ configurations, our framework demonstrates strong fit and cross-architecture consistency. It reveals distinct sensitivities across knowledge capabilities: reasoning is precision-critical, application is scale-responsive, and memorization is calibration-sensitive. We highlight that in low-bit scenarios, optimizing these fine-grained factors is essential for preventing performance collapse. These findings provide an empirically-backed foundation for designing knowledge-aware quantization strategies.

[135] Jailbreaking Commercial Black-Box LLMs with Explicitly Harmful Prompts

Chiyu Zhang, Lu Zhou, Xiaogang Xu, Jiafei Wu, Liming Fang, Zhe Liu

Main category: cs.CL

TL;DR: DH-CoT attack combines multiple jailbreak techniques into a single template to improve effectiveness on reasoning models, while MDH framework cleans evaluation datasets to provide accurate attack assessment.

Details

Motivation: Existing black-box jailbreak attacks work poorly on recent state-of-the-art reasoning models, and current red-teaming datasets contain low-quality samples that hinder accurate evaluation of attack effectiveness.

Method: Develop DH-CoT attack by integrating multiple jailbreak tricks into a single template: uses Adversarial Context Alignment to remove semantic inconsistencies, NTP-based few-shot examples to guide malicious outputs, and creates a fake chain of thought. Also introduces MDH framework (Malicious content Detection with Human assistance) to clean evaluation data and build RTA dataset suite.

Result: DH-CoT effectively jailbreaks models including GPT-5 and Claude-4, significantly outperforming SOTA methods like H-CoT and TAP. MDH reliably filters low-quality samples from datasets.

Conclusion: The proposed DH-CoT attack demonstrates superior jailbreak capabilities on reasoning models, and the MDH framework provides a more accurate evaluation methodology by cleaning dataset noise.

Abstract: Existing black-box jailbreak attacks achieve certain success on non-reasoning models but degrade significantly on recent SOTA reasoning models. To improve attack ability, inspired by adversarial aggregation strategies, we integrate multiple jailbreak tricks into a single developer template. Especially, we apply Adversarial Context Alignment to purge semantic inconsistencies and use NTP (a type of harmful prompt) -based few-shot examples to guide malicious outputs, lastly forming DH-CoT attack with a fake chain of thought. In experiments, we further observe that existing red-teaming datasets include samples unsuitable for evaluating attack gains, such as BPs, NHPs, and NTPs. Such data hinders accurate evaluation of true attack effect lifts. To address this, we introduce MDH, a Malicious content Detection framework integrating LLM-based annotation with Human assistance, with which we clean data and build RTA dataset suite. Experiments show that MDH reliably filters low-quality samples and that DH-CoT effectively jailbreaks models including GPT-5 and Claude-4, notably outperforming SOTA methods like H-CoT and TAP.

[136] LFD: Layer Fused Decoding to Exploit External Knowledge in Retrieval-Augmented Generation

Yang Sun, Zhiyong Xie, Lixin Zou, Dan Luo, Min Tang, Xiangyu Zhao, Yunwei Zhao, Xixun Lin, Yanxiong Lu, Chenliang Li

Main category: cs.CL

TL;DR: The paper finds that injecting noise into retrieved documents in RAG systems paradoxically improves generation quality, revealing a layer-specific functional demarcation in LLMs. Based on this insight, they propose Layer Fused Decoding (LFD) to better exploit external knowledge.

Details

Motivation: The motivation is to understand and improve how LLMs integrate external knowledge in RAG systems. The surprising empirical finding that noise injection improves generation quality provides an opportunity to analyze knowledge integration mechanisms and develop better decoding strategies.

Method: The method involves: 1) Analyzing noise injection effects to establish layer-specific functions in LLMs, 2) Proposing Layer Fused Decoding (LFD) that combines intermediate layer representations with final-layer outputs, and 3) Introducing Internal Knowledge Score (IKS) criterion to identify optimal intermediate layers.

Result: Experimental results across multiple benchmarks show that LFD helps RAG systems more effectively surface retrieved context knowledge with minimal computational cost.

Conclusion: The paper reveals a functional demarcation in LLM layers (shallow for local context, intermediate for external knowledge integration, deep for internal knowledge) and demonstrates that LFD can effectively leverage this structure to improve RAG performance.

Abstract: Retrieval-augmented generation (RAG) incorporates external knowledge into large language models (LLMs), improving their adaptability to downstream tasks and enabling information updates. Surprisingly, recent empirical evidence demonstrates that injecting noise into retrieved relevant documents paradoxically facilitates exploitation of external knowledge and improves generation quality. Although counterintuitive and challenging to apply in practice, this phenomenon enables granular control and rigorous analysis of how LLMs integrate external knowledge. Therefore, in this paper, we intervene on noise injection and establish a layer-specific functional demarcation within the LLM: shallow layers specialize in local context modeling, intermediate layers focus on integrating long-range external factual knowledge, and deeper layers primarily rely on parametric internal knowledge. Building on this insight, we propose Layer Fused Decoding (LFD), a simple decoding strategy that directly combines representations from an intermediate layer with final-layer decoding outputs to fully exploit the external factual knowledge. To identify the optimal intermediate layer, we introduce an internal knowledge score (IKS) criterion that selects the layer with the lowest IKS value in the latter half of layers. Experimental results across multiple benchmarks demonstrate that LFD helps RAG systems more effectively surface retrieved context knowledge with minimal cost.

[137] ToolRM: Outcome Reward Models for Tool-Calling Large Language Models

Mayank Agarwal, Ibrahim Abdelaziz, Kinjal Basu, Merve Unuvar, Luis A. Lastras, Yara Rizk, Pavan Kapanipathi

Main category: cs.CL

TL;DR: The paper introduces FC-RewardBench, a benchmark for evaluating reward models in tool-calling scenarios, and proposes ToolRM, a suite of specialized reward models for tool use that outperform general-purpose baselines.

Details

Motivation: Existing reward models trained on natural language outputs struggle to evaluate tool-based reasoning and execution, creating a critical gap as LLMs increasingly interact with external tools.

Method: Introduced FC-RewardBench benchmark for systematic evaluation, then proposed a training framework for outcome reward models using data synthesized from permissively licensed, open-weight LLMs, resulting in ToolRM suite (1.7B to 14B parameters).

Result: ToolRM models consistently outperform general-purpose baselines across diverse settings, achieving up to 25% improvement with Best-of-N sampling, while improving robustness to input noise, enabling effective data filtering, and supporting RL-training of policy models.

Conclusion: Specialized reward models for tool use are necessary and effective, with ToolRM demonstrating significant performance improvements and practical benefits over existing general-purpose reward models.

Abstract: As large language models (LLMs) increasingly interact with external tools, reward modeling for tool use has emerged as a critical yet underexplored area of research. Existing reward models, trained primarily on natural language outputs, struggle to evaluate tool-based reasoning and execution. To quantify this gap, we introduce FC-RewardBench, the first benchmark to systematically evaluate reward models in tool-calling scenarios. Our analysis shows that current reward models frequently miss key signals of effective tool use, highlighting the need for domain-specific modeling. We address this by proposing a training framework for outcome reward models using data synthesized from permissively licensed, open-weight LLMs. We introduce ToolRM - a suite of reward models for tool-use ranging from 1.7B to 14B parameters. Across diverse settings, these models consistently outperform general-purpose baselines. Notably, they achieve up to a 25% improvement with Best-of-N sampling, while also improving robustness to input noise, enabling effective data filtering, and supporting RL-training of policy models.

[138] Relevance to Utility: Process-Supervised Rewrite for RAG

Jaeyoung Kim, Jongho Kim, Seung-won Hwang, Seoho Song, Young-In Song

Main category: cs.CL

TL;DR: R2U improves retrieval-augmented generation by approximating true document utility through joint observation of rewriting and answering, using utility-improvement supervision for fine-tuning.

Details

Motivation: Retrieval-augmented generation systems have a gap between retrieval relevance and generative utility - retrieved documents may be topically relevant but lack content needed for effective reasoning during generation. Existing bridge modules fail by not capturing "document utility".

Method: Propose R2U which approximates true utility through joint observation of rewriting and answering in the reasoning process. Use distillation to scale supervision for reliability. Construct utility-improvement supervision by measuring generator’s answer gain under rewritten context, yielding signals for fine-tuning and preference optimization.

Result: Evaluated across multiple open-domain question-answering benchmarks. Empirical results demonstrate consistent improvements over strong bridging baselines.

Conclusion: R2U effectively bridges the gap between retrieval relevance and generative utility by focusing on document utility through joint reasoning observation and utility-improvement supervision.

Abstract: Retrieval-augmented generation systems often suffer from a gap between optimizing retrieval relevance and generative utility. With such a gap, retrieved documents may be topically relevant but still lack the content needed for effective reasoning during generation. While existing bridge modules attempt to rewrite the retrieved text for better generation, we show how they fail by not capturing “document utility”. In this work, we propose R2U, with a key distinction of approximating true utility through joint observation of rewriting and answering in the reasoning process. To distill, R2U scale such supervision to enhance reliability in distillation. We further construct utility-improvement supervision by measuring the generator’s gain of the answer under the rewritten context, yielding signals for fine-tuning and preference optimization. We evaluate our method across multiple open-domain question-answering benchmarks. The empirical results demonstrate consistent improvements over strong bridging baselines

[139] DyBBT: Dynamic Balance via Bandit inspired Targeting for Dialog Policy with Cognitive Dual-Systems

Shuyu Zhang, Yifan Wei, Jialuo Yuan, Xinru Wang, Yanmin Zhu, Bin Li, Yujie Liu

Main category: cs.CL

TL;DR: DyBBT is a dialog policy learning framework that dynamically switches between fast intuitive inference and slow deliberative reasoning based on real-time cognitive states, improving exploration efficiency and performance in task-oriented dialog systems.

Details

Motivation: Current task-oriented dialog systems use static exploration strategies that don't adapt to dynamic dialog contexts, resulting in inefficient exploration and suboptimal performance.

Method: Proposes DyBBT with a structured cognitive state space capturing dialog progression, user uncertainty, and slot dependency, plus a bandit-inspired meta-controller that dynamically switches between System 1 (fast intuitive inference) and System 2 (slow deliberative reasoner) based on real-time cognitive states and visitation counts.

Result: Achieves state-of-the-art performance in success rate, efficiency, and generalization on single- and multi-domain benchmarks, with human evaluations confirming decisions align well with expert judgment.

Conclusion: DyBBT effectively addresses the exploration challenge in dialog policy learning through adaptive switching between intuitive and deliberative reasoning, demonstrating superior performance and generalization capabilities.

Abstract: Task oriented dialog systems often rely on static exploration strategies that do not adapt to dynamic dialog contexts, leading to inefficient exploration and suboptimal performance. We propose DyBBT, a novel dialog policy learning framework that formalizes the exploration challenge through a structured cognitive state space capturing dialog progression, user uncertainty, and slot dependency. DyBBT proposes a bandit inspired meta-controller that dynamically switches between a fast intuitive inference (System 1) and a slow deliberative reasoner (System 2) based on real-time cognitive states and visitation counts. Extensive experiments on single- and multi-domain benchmarks show that DyBBT achieves state-of-the-art performance in success rate, efficiency, and generalization, with human evaluations confirming its decisions are well aligned with expert judgment. Code is available at https://github.com/carsonz/DyBBT.

[140] HiCoLoRA: Addressing Context-Prompt Misalignment via Hierarchical Collaborative LoRA for Zero-Shot DST

Shuyu Zhang, Yifan Wei, Xinru Wang, Yanmin Zhu, Yangfan He, Yixuan Weng, Bin Li, Yujie Liu

Main category: cs.CL

TL;DR: HiCoLoRA is a hierarchical LoRA framework for zero-shot dialog state tracking that addresses semantic misalignment through dynamic layer-specific processing, spectral clustering for transferable associations, and semantic-enhanced initialization to preserve pre-trained knowledge.

Details

Motivation: Zero-shot DST needs to generalize to new domains without costly annotation, but faces challenges with semantic misalignment between dynamic dialog contexts and static prompts, leading to inflexible cross-layer coordination, domain interference, and catastrophic forgetting.

Method: Proposes HiCoLoRA with: 1) hierarchical LoRA architecture for dynamic layer-specific processing (lower-layer heuristic grouping + higher-layer full interaction), 2) Spectral Joint Domain-Slot Clustering to identify transferable associations feeding an Adaptive Linear Fusion Mechanism, and 3) Semantic-Enhanced SVD Initialization (SemSVD-Init) to preserve pre-trained knowledge.

Result: Outperforms baselines on multi-domain datasets MultiWOZ and SGD, achieving state-of-the-art performance in zero-shot dialog state tracking.

Conclusion: HiCoLoRA effectively addresses semantic misalignment in zero-shot DST through hierarchical collaborative adaptation, demonstrating strong generalization to new domains without requiring annotated data.

Abstract: Zero-shot Dialog State Tracking (zs-DST) is essential for enabling Task-Oriented Dialog Systems (TODs) to generalize to new domains without costly data annotation. A central challenge lies in the semantic misalignment between dynamic dialog contexts and static prompts, leading to inflexible cross-layer coordination, domain interference, and catastrophic forgetting. To tackle this, we propose Hierarchical Collaborative Low-Rank Adaptation (HiCoLoRA), a framework that enhances zero-shot slot inference through robust prompt alignment. It features a hierarchical LoRA architecture for dynamic layer-specific processing (combining lower-layer heuristic grouping and higher-layer full interaction), integrates Spectral Joint Domain-Slot Clustering to identify transferable associations (feeding an Adaptive Linear Fusion Mechanism), and employs Semantic-Enhanced SVD Initialization (SemSVD-Init) to preserve pre-trained knowledge. Experiments on multi-domain datasets MultiWOZ and SGD show that HiCoLoRA outperforms baselines, achieving SOTA in zs-DST. Code is available at https://github.com/carsonz/HiCoLoRA.

[141] Quantifying LLM Biases Across Instruction Boundary in Mixed Question Forms

Zipeng Ling, Shuliang Liu, Yuehao Tang, Chen Huang, Gaoyang Jiang, Shenghong Fu, Junqi Yang, Yao Wan, Jiawan Zhang, Kejia Huang, Xuming Hu

Main category: cs.CL

TL;DR: The paper introduces BiasDetector, a benchmark to evaluate how different instruction settings (Instruction Boundary) affect LLMs’ ability to identify datasets with mixed question forms and sparse labels, revealing significant biases from user instructions.

Details

Motivation: LLM-annotated datasets often contain biases and low-quality data where questions may have none or multiple correct answers (sparse labels). Users' instructions can introduce biases when trying to identify these mixed question forms, but there's no systematic way to evaluate how different instruction settings affect LLMs' identification capabilities.

Method: Proposes BiasDetector benchmark with the concept of Instruction Boundary to systematically evaluate different instruction settings. Creates datasets with mixed question forms (MCQs with sparse labels, true-or-false with unsolvable elements) and tests LLMs under various instruction conditions to measure biases.

Result: Experiments show that users’ instructions induce large biases on the benchmark, demonstrating that instruction settings significantly affect LLMs’ ability to identify datasets with sparse label mixtures.

Conclusion: Highlights the need for both LLM developers to recognize risks of biased annotation leading to sparse label mixtures, and for awareness of problems arising from users’ instructions when identifying such datasets. Provides tools for systematic evaluation of instruction-induced biases.

Abstract: Large Language Models (LLMs) annotated datasets are widely used nowadays, however, large-scale annotations often show biases in low-quality datasets. For example, Multiple-Choice Questions (MCQs) datasets with one single correct option is common, however, there may be questions attributed to none or multiple correct options; whereas true-or-false questions are supposed to be labeled with either True or False, but similarly the text can include unsolvable elements, which should be further labeled as Unknown. There are problems when low-quality datasets with mixed question forms can not be identified. We refer to these exceptional label forms as Sparse Labels, and LLMs’ ability to distinguish datasets with Sparse Labels mixture is important. Since users may not know situations of datasets, their instructions can be biased. To study how different instruction settings affect LLMs’ identifications of Sparse Labels mixture, we introduce the concept of Instruction Boundary, which systematically evaluates different instruction settings that lead to biases. We propose BiasDetector, a diagnostic benchmark to systematically evaluate LLMs on datasets with mixed question forms under Instruction Boundary settings. Experiments show that users’ instructions induce large biases on our benchmark, highlighting the need not only for LLM developers to recognize risks of LLM biased annotation resulting in Sparse Labels mixture, but also problems arising from users’ instructions to identify them. Code, datasets and detailed implementations are available at https://github.com/ZpLing/Instruction-Boundary.

[142] How Training Data Shapes the Use of Parametric and In-Context Knowledge in Language Models

Minsung Kim, Dong-Kyum Kim, Jea Kwon, Nakyeong Yang, Kyomin Jung, Meeyoung Cha

Main category: cs.CL

TL;DR: Training data needs specific properties (repetition, moderate inconsistency, skewed frequency) for LLMs to properly balance parametric vs. in-context knowledge.

Details

Motivation: To understand what training conditions enable language models to effectively use both parametric knowledge (from training) and in-context knowledge (provided at inference), especially when they conflict, since current models do this without explicit training objectives.

Method: Conducted controlled experiments training language models while systematically manipulating training data properties. Validated findings in real-world pretraining settings and analyzed post-training procedures.

Result: Found three counterintuitive training data properties must co-occur for robust knowledge utilization: (1) intra-document repetition, (2) moderate within-document inconsistency, and (3) skewed knowledge frequency distribution. These dynamics also emerge in real-world pretraining.

Conclusion: Provides concrete empirical guidance for training LLMs that harmoniously integrate parametric and in-context knowledge, showing that certain “detrimental” training data properties are actually necessary for effective knowledge conflict resolution.

Abstract: Large language models leverage not only parametric knowledge acquired during training but also in-context knowledge provided at inference time, despite the absence of explicit training objectives for using both sources. Prior work has further shown that when these knowledge sources conflict, models resolve the tension based on their internal confidence, preferring parametric knowledge for high-confidence facts while deferring to contextual information for less familiar ones. However, the training conditions that give rise to such knowledge utilization behaviors remain unclear. To address this gap, we conduct controlled experiments in which we train language models while systematically manipulating key properties of the training data. Our results reveal a counterintuitive finding: three properties commonly regarded as detrimental must co-occur for robust knowledge utilization and conflict resolution to emerge: (i) intra-document repetition of information, (ii) a moderate degree of within-document inconsistency, and (iii) a skewed knowledge frequency distribution. We further validate that the same training dynamics observed in our controlled setting also arise during real-world language model pretraining, and we analyze how post-training procedures can reshape models’ knowledge preferences. Together, our findings provide concrete empirical guidance for training language models that harmoniously integrate parametric and in-context knowledge.

[143] GIFT: Guided Importance-Aware Fine-Tuning for Diffusion Language Models

Guowei Xu, Wenxin Xu, Jiawang Zhao, Kaisheng Ma

Main category: cs.CL

TL;DR: GIFT is an importance-aware fine-tuning method for diffusion language models that assigns different weights to tokens based on their entropy, achieving superior performance across diverse settings compared to standard supervised fine-tuning.

Details

Motivation: Diffusion models show promise for language modeling with faster generation, but supervised fine-tuning is challenging due to lack of precise probability estimates at each denoising step. The diffusion mechanism makes generation less predictable and inconsistent, highlighting the need to control key tokens that guide generation direction.

Method: Proposes GIFT (importance-aware finetuning method) where tokens are assigned different importance weights based on their entropy. Derived from diffusion theory, it addresses the challenge of applying SFT to diffusion language models by focusing on controlling key tokens.

Result: GIFT delivers substantial gains across diverse settings: different mainstream training datasets (1k to 10k size), using LoRA or full parameter fine-tuning, and training on base or instruct models. Consistently achieves superior overall performance compared to standard SFT on four reasoning benchmarks (Sudoku, Countdown, GSM8K, and MATH-500).

Conclusion: GIFT effectively addresses the challenges of supervised fine-tuning for diffusion language models by incorporating importance-aware weighting based on token entropy, leading to consistently better performance across various experimental settings and reasoning tasks.

Abstract: Diffusion models have recently shown strong potential in language modeling, offering faster generation compared to traditional autoregressive approaches. However, applying supervised fine-tuning (SFT) to diffusion models remains challenging, as they lack precise probability estimates at each denoising step. While the diffusion mechanism enables the model to reason over entire sequences, it also makes the generation process less predictable and often inconsistent. This highlights the importance of controlling key tokens that guide the direction of generation. To address this issue, we propose GIFT, an importance-aware finetuning method for diffusion language models, where tokens are assigned different importance weights based on their entropy. Derived from diffusion theory, GIFT delivers substantial gains: across diverse settings including different mainstream training datasets ranging from 1k to 10k in size, utilizing LoRA or full parameter fine-tuning, and training on base or instruct models, GIFT consistently achieves superior overall performance compared to standard SFT on four widely used reasoning benchmarks (Sudoku, Countdown, GSM8K, and MATH-500).

[144] LayerNorm Induces Recency Bias in Transformer Decoders

Junu Kim, Xiao Liu, Zhenghao Lin, Lei Ji, Yeyun Gong, Edward Choi

Main category: cs.CL

TL;DR: Causal self-attention with LayerNorm induces recency bias in Transformers, contrary to prior belief that causal attention alone causes bias toward earlier tokens.

Details

Motivation: There's a discrepancy between prior work showing causal self-attention alone causes bias toward earlier tokens, and the observed recency bias (bias toward later tokens) in actual Transformer decoders. The paper aims to understand this contradiction by analyzing interactions between causal self-attention and other architectural components.

Method: Analyzed the interaction between causal self-attention layers and other architectural components (LayerNorm, residual connections, input token embeddings). Theoretical analysis to understand how these components collectively induce positional biases.

Result: Found that stacked causal self-attention layers combined with LayerNorm induce recency bias (bias toward later tokens). Also examined effects of residual connections and input token embedding distributions on this bias.

Conclusion: Provides new theoretical insights into how positional information interacts with architectural components in Transformers, suggesting directions for improving positional encoding strategies based on understanding these interactions.

Abstract: Causal self-attention provides positional information to Transformer decoders. Prior work has shown that stacks of causal self-attention layers alone induce a positional bias in attention scores toward earlier tokens. However, this differs from the bias toward later tokens typically observed in Transformer decoders, known as recency bias. We address this discrepancy by analyzing the interaction between causal self-attention and other architectural components. We show that stacked causal self-attention layers combined with LayerNorm induce recency bias. Furthermore, we examine the effects of residual connections and the distribution of input token embeddings on this bias. Our results provide new theoretical insights into how positional information interacts with architectural components and suggest directions for improving positional encoding strategies.

[145] Big Reasoning with Small Models: Instruction Retrieval at Inference Time

Kenan Alkiek, David Jurgens, Vinod Vydiswaran

Main category: cs.CL

TL;DR: Instruction retrieval improves small language models by augmenting them with structured reasoning procedures retrieved at inference time, achieving significant accuracy gains on domain-specific tasks without fine-tuning.

Details

Motivation: Small language models struggle with domain knowledge and multi-step reasoning tasks. Existing approaches either rely on model scale, require task-specific training that limits reusability, or use unstructured information retrieval that doesn't provide clear reasoning strategies.

Method: Instruction retrieval: create an Instruction Corpus by clustering similar training questions, using a teacher model to generate generalizable guides with domain background and step-by-step procedures. At inference, the SLM retrieves relevant instructions and executes the procedures without fine-tuning.

Result: Across medicine, law, and mathematics domains, instruction retrieval yields consistent gains for models with ≥3B parameters: 9.4%, 7.9%, and 5.1% accuracy improvements respectively. The 14B model surpasses GPT-4o’s zero-shot performance on knowledge-intensive tasks.

Conclusion: Instruction retrieval provides an effective inference-time intervention that enhances small language models’ reasoning capabilities by providing structured, reusable procedures, enabling them to compete with much larger models on specialized domain tasks.

Abstract: Small language models (SLMs) enable low-cost, private, on-device inference, but they often fail on problems that require specialized domain knowledge or multi-step reasoning. Existing approaches for improving reasoning either rely on scale (e.g., chain-of-thought prompting), require task-specific training that limits reuse and generality (e.g., distillation), or retrieve unstructured information that still leaves the SLM to determine an appropriate reasoning strategy. We propose instruction retrieval, an inference-time intervention that augments an SLM with structured, reusable reasoning procedures rather than raw passages. We construct an Instruction Corpus by clustering similar training questions and using a teacher model to generate generalizable guides that pair domain background with explicit step-by-step procedures. At inference, the SLM retrieves the instructions most relevant to a given query and executes the associated procedures without any additional fine-tuning. Across three challenging domains: medicine, law, and mathematics, instruction retrieval yields consistent gains for models with at least 3B parameters, improving accuracy by 9.4%, 7.9%, and 5.1%, respectively, with the strongest 14B model surpassing GPT-4o’s zero-shot performance on knowledge-intensive tasks.

[146] Believing without Seeing: Quality Scores for Contextualizing Vision-Language Model Explanations

Keyu He, Tejas Srinivasan, Brihi Joshi, Xiang Ren, Jesse Thomason, Swabha Swayamdipta

Main category: cs.CL

TL;DR: Vision-Language Models (VLMs) can mislead users with explanations that make incorrect predictions seem plausible. The paper proposes two new explanation quality metrics - Visual Fidelity and Contrastiveness - that better correlate with model correctness and help users identify when to trust VLM predictions.

Details

Motivation: When users (especially blind/low-vision users) cannot see visual context, VLM explanations can convince them to trust incorrect predictions. There's a need for better ways to signal which VLM predictions are reliable through explanation quality assessment.

Method: Proposes two explanation quality scoring functions: 1) Visual Fidelity - measures how faithful explanations are to actual visual context, and 2) Contrastiveness - measures how well explanations identify distinguishing visual details between the prediction and alternatives. Evaluates these on A-OKVQA, VizWiz, and MMMU-Pro tasks.

Result: The proposed quality scoring functions are better calibrated with model correctness than existing explanation metrics. In user studies, showing these quality scores alongside explanations improved participants’ accuracy at predicting VLM correctness by 11.1%, with 15.4% reduction in falsely believing incorrect predictions.

Conclusion: Explanation quality scores (Visual Fidelity and Contrastiveness) effectively foster appropriate reliance on VLM predictions by helping users identify when explanations are trustworthy, especially important for users who cannot access visual context.

Abstract: When people query Vision-Language Models (VLMs) but cannot see the accompanying visual context (e.g. for blind and low-vision users), augmenting VLM predictions with natural language explanations can signal which model predictions are reliable. However, prior work has found that explanations can easily convince users that inaccurate VLM predictions are correct. To remedy undesirable overreliance on VLM predictions, we propose evaluating two complementary qualities of VLM-generated explanations via two quality scoring functions. We propose Visual Fidelity, which captures how faithful an explanation is to the visual context, and Contrastiveness, which captures how well the explanation identifies visual details that distinguish the model’s prediction from plausible alternatives. On the A-OKVQA, VizWiz, and MMMU-Pro tasks, these quality scoring functions are better calibrated with model correctness than existing explanation qualities. We conduct a user study in which participants have to decide whether a VLM prediction is accurate without viewing its visual context. We observe that showing our quality scores alongside VLM explanations improves participants’ accuracy at predicting VLM correctness by 11.1%, including a 15.4% reduction in the rate of falsely believing incorrect predictions. These findings highlight the utility of explanation quality scores in fostering appropriate reliance on VLM predictions.

[147] LiteStage: Latency-aware Layer Skipping for Multi-stage Reasoning

Beomseok Kang, Jiwon Song, Jae-Joon Kim

Main category: cs.CL

TL;DR: LiteStage: A latency-aware layer skipping framework for multi-stage reasoning that balances efficiency and accuracy through stage-wise layer allocation and confidence-based early exit.

Details

Motivation: Multi-stage reasoning improves small language models' reasoning capability but increases latency. Existing adaptive acceleration techniques like layer skipping struggle to balance efficiency and accuracy due to stage-wise variation in skip sensitivity and redundant output token generation.

Method: Proposes LiteStage with two components: (1) stage-wise offline search that allocates optimal layer budgets for each reasoning stage, and (2) online confidence-based generation early exit to suppress unnecessary decoding.

Result: Experiments on three benchmarks (OBQA, CSQA, and StrategyQA) show that LiteStage outperforms prior training-free layer skipping methods in balancing efficiency and accuracy.

Conclusion: LiteStage effectively addresses the efficiency-accuracy trade-off in multi-stage reasoning by combining stage-aware layer allocation with confidence-based early exit, demonstrating superior performance over existing methods.

Abstract: Multi-stage reasoning has emerged as an effective strategy for enhancing the reasoning capability of small language models by decomposing complex problems into sequential sub-stages. However, this comes at the cost of increased latency. We observe that existing adaptive acceleration techniques, such as layer skipping, struggle to balance efficiency and accuracy in this setting due to two key challenges: (1) stage-wise variation in skip sensitivity, and (2) the generation of redundant output tokens. To address these, we propose LiteStage, a latency-aware layer skipping framework for multi-stage reasoning. LiteStage combines a stage-wise offline search that allocates optimal layer budgets with an online confidence-based generation early exit to suppress unnecessary decoding. Experiments on three benchmarks, e.g., OBQA, CSQA, and StrategyQA, show that LiteStage outperforms prior training-free layer skipping methods.

[148] A Comparative Analysis of Contextual Representation Flow in State-Space and Transformer Architectures

Nhat M. Hoang, Do Xuan Long, Cong-Duy Nguyen, Min-Yen Kan, Luu Anh Tuan

Main category: cs.CL

TL;DR: SSMs preserve token uniqueness early but converge to homogenization deeper, while TBMs rapidly homogenize then re-diversify later; oversmoothing in TBMs is architectural while in SSMs it’s from training dynamics.

Details

Motivation: State Space Models (SSMs) have emerged as efficient alternatives to Transformer-Based Models (TBMs) for long-sequence processing with linear scaling, but how contextual information flows across layers in these architectures remains understudied.

Method: First unified, token- and layer-wise analysis of representation propagation in SSMs and TBMs using centered kernel alignment, variance-based metrics, and probing to characterize how representations evolve within and across layers.

Result: TBMs rapidly homogenize token representations with diversity reemerging only in later layers, while SSMs preserve token uniqueness early but converge to homogenization deeper. Theoretical analysis and parameter randomization reveal oversmoothing in TBMs stems from architectural design, whereas in SSMs it arises mainly from training dynamics.

Conclusion: These insights clarify the inductive biases of both architectures and inform future model and training designs for long-context reasoning.

Abstract: State Space Models (SSMs) have recently emerged as efficient alternatives to Transformer-Based Models (TBMs) for long-sequence processing with linear scaling, yet how contextual information flows across layers in these architectures remains understudied. We present the first unified, token- and layer-wise analysis of representation propagation in SSMs and TBMs. Using centered kernel alignment, variance-based metrics, and probing, we characterize how representations evolve within and across layers. We find a key divergence: TBMs rapidly homogenize token representations, with diversity reemerging only in later layers, while SSMs preserve token uniqueness early but converge to homogenization deeper. Theoretical analysis and parameter randomization further reveal that oversmoothing in TBMs stems from architectural design, whereas in SSMs, it arises mainly from training dynamics. These insights clarify the inductive biases of both architectures and inform future model and training designs for long-context reasoning.

[149] User Perceptions of Privacy and Helpfulness in LLM Responses to Privacy-Sensitive Scenarios

Xiaoyuan Wu, Roshni Kaushik, Wenkai Li, Lujo Bauer, Koichi Onoue

Main category: cs.CL

TL;DR: Proxy LLMs cannot accurately estimate users’ perceptions of privacy and utility in privacy-sensitive scenarios, despite high agreement among themselves.

Details

Motivation: Prior work evaluated LLMs' privacy-preserving abilities using proxy LLMs to judge responses, but didn't measure actual user perceptions of helpfulness and privacy in sensitive scenarios.

Method: Conducted a user study (n=94) using 90 PrivacyLens scenarios to directly measure users’ perceptions of LLM responses’ helpfulness and privacy-preservation quality.

Result: Users had low agreement when evaluating identical LLM responses, while five proxy LLMs reached high agreement but had low correlation with users’ evaluations.

Conclusion: Proxy LLMs cannot accurately estimate users’ wide range of perceptions; more user-centered studies and improved alignment between LLMs and users are needed for privacy-sensitive scenarios.

Abstract: Large language models (LLMs) are rapidly being adopted for tasks like drafting emails, summarizing meetings, and answering health questions. In these settings, users may need to share private information (e.g., contact details, health records). To evaluate LLMs’ ability to identify and redact such information, prior work introduced real-life, scenario-based benchmarks (e.g., ConfAIde, PrivacyLens) and found that LLMs can leak private information in complex scenarios. However, these evaluations relied on proxy LLMs to judge the helpfulness and privacy-preservation quality of LLM responses, rather than directly measuring users’ perceptions. To understand how users perceive the helpfulness and privacy-preservation quality of LLM responses to privacy-sensitive scenarios, we conducted a user study ($n=94$) using 90 PrivacyLens scenarios. We found that users had low agreement with each other when evaluating identical LLM responses. In contrast, five proxy LLMs reached high agreement, yet each proxy LLM had low correlation with users’ evaluations. These results indicate that proxy LLMs cannot accurately estimate users’ wide range of perceptions of utility and privacy in privacy-sensitive scenarios. We discuss the need for more user-centered studies to measure LLMs’ ability to help users while preserving privacy, and for improving alignment between LLMs and users in estimating perceived privacy and utility.

[150] Beyond Monolingual Assumptions: A Survey of Code-Switched NLP in the Era of Large Language Models across Modalities

Rajvee Sheth, Samridhi Raj Sinha, Mahavir Patil, Himanshu Beniwal, Mayank Singh

Main category: cs.CL

TL;DR: This survey paper provides the first comprehensive analysis of code-switching (CSW) in large language models, reviewing 327 studies across multiple research areas, tasks, datasets, and languages to address the persistent challenges of mixed-language processing in multilingual NLP.

Details

Motivation: Despite rapid advances in large language models, most LLMs still struggle with code-switching (mixed-language inputs), limited CSW datasets, and evaluation biases, which hinders their deployment in multilingual societies where code-switching is common.

Method: The paper conducts a comprehensive survey of 327 studies spanning five research areas, 15+ NLP tasks, 30+ datasets, and 80+ languages. It categorizes recent advances by architecture, training strategy, and evaluation methodology to analyze how LLMs have reshaped CSW modeling.

Result: The survey identifies persistent challenges in CSW-aware LLM research and provides a systematic categorization of current approaches, highlighting the need for better solutions to handle mixed-language inputs effectively.

Conclusion: The paper concludes with a roadmap emphasizing the need for inclusive datasets, fair evaluation methods, and linguistically grounded models to achieve truly multilingual capabilities in LLMs, with an accompanying GitHub repository for resources.

Abstract: Code-switching (CSW), the alternation of languages and scripts within a single utterance, remains a fundamental challenge for multilingual NLP, even amidst the rapid advances of large language models (LLMs). Amidst the rapid advances of large language models (LLMs), most LLMs still struggle with mixed-language inputs, limited Codeswitching (CSW) datasets, and evaluation biases, which hinder their deployment in multilingual societies. This survey provides the first comprehensive analysis of CSW-aware LLM research, reviewing 327 studies spanning five research areas, 15+ NLP tasks, 30+ datasets, and 80+ languages. We categorize recent advances by architecture, training strategy, and evaluation methodology, outlining how LLMs have reshaped CSW modeling and identifying the challenges that persist. The paper concludes with a roadmap that emphasizes the need for inclusive datasets, fair evaluation, and linguistically grounded models to achieve truly multilingual capabilities https://github.com/lingo-iitgn/awesome-code-mixing/.

[151] Beyond Direct Generation: A Decomposed Approach to Well-Crafted Screenwriting with LLMs

Hang Lei, Shengyi Zong, Zhaoyan Li, Ziren Zhou, Hao Liu, Liang Yu

Main category: cs.CL

TL;DR: DSR framework decomposes screenplay generation into creative narrative writing and format conversion stages, using hybrid data synthesis to overcome training data scarcity, achieving 75% win rate against strong baselines.

Details

Motivation: Direct end-to-end LLM generation fails to produce professional-quality screenplays because it forces models to simultaneously handle creative narrative construction and rigid format adherence, resulting in superficial style without structural integrity.

Method: Dual-Stage Refinement (DSR) framework: Stage 1 transforms brief outlines into rich novel-style prose (creative narrative generation), Stage 2 refines this narrative into professionally formatted screenplays (format conversion). Uses hybrid data synthesis: reverse synthesis deconstructs existing screenplays into structured inputs, forward synthesis generates narrative texts as training targets.

Result: DSR achieves 75% win rate against strong baselines like Gemini-2.5-Pro and reaches 82.7% of human-level performance in blind evaluations by professional screenwriters.

Conclusion: Decomposed generation architecture with tailored data synthesis effectively specializes LLMs in complex creative domains like screenplay writing, demonstrating that separating creative and formatting tasks yields superior results.

Abstract: The screenplay serves as the foundation for television production, defining narrative structure, character development, and dialogue. While Large Language Models (LLMs) show great potential in creative writing, direct end-to-end generation approaches often fail to produce well-crafted screenplays. We argue this failure stems from forcing a single model to simultaneously master two disparate capabilities: creative narrative construction and rigid format adherence. The resulting outputs may mimic superficial style but lack the deep structural integrity and storytelling substance required for professional use. To enable LLMs to generate high-quality screenplays, we introduce Dual-Stage Refinement (DSR), a decomposed framework that decouples creative narrative generation from format conversion. The first stage transforms a brief outline into rich, novel-style prose. The second stage refines this narrative into a professionally formatted screenplay. This separation enables the model to specialize in one distinct capability at each stage. A key challenge in implementing DSR is the scarcity of paired outline-to-novel training data. We address this through hybrid data synthesis: reverse synthesis deconstructs existing screenplays into structured inputs, while forward synthesis leverages these inputs to generate high-quality narrative texts as training targets. Blind evaluations by professional screenwriters show that DSR achieves a 75% win rate against strong baselines like Gemini-2.5-Pro and reaches 82.7% of human-level performance. Our work demonstrates that decomposed generation architecture with tailored data synthesis effectively specializes LLMs in complex creative domains.

[152] Don’t Adapt Small Language Models for Tools; Adapt Tool Schemas to the Models

Jonggeun Lee, Woojung Song, Jongwook Han, Haesung Pyun, Yohan Jo

Main category: cs.CL

TL;DR: PA-Tool improves small language models’ tool-use by aligning tool schemas with models’ pretraining knowledge instead of forcing models to adapt to arbitrary schemas, reducing schema misalignment errors by 80% and improving performance up to 17%.

Details

Motivation: Small language models struggle with tool-use tasks due to schema misalignment - they hallucinate plausible but nonexistent tool names based on pretraining conventions that don't match provided tool schemas. Instead of forcing models to adapt to arbitrary schemas, the paper proposes adapting schemas to align with models' pretrained knowledge.

Method: PA-Tool (Pretraining-Aligned Tool Schema Generation) is a training-free method that uses peakedness (a signal from contamination detection indicating pretraining familiarity) to rename tool components. It generates multiple naming candidates and selects those with highest peakedness across samples to identify pretraining-aligned naming patterns.

Result: Experiments on MetaTool and RoTBench show improvements up to 17%, with schema misalignment errors reduced by 80%. Small models approach state-of-the-art performance while maintaining computational efficiency in adapting to new tools without retraining.

Conclusion: Schema-level interventions can unlock tool-use potential of resource-efficient models by adapting schemas to models rather than models to schemas. PA-Tool enables small models to leverage their pretraining knowledge effectively for tool-use tasks.

Abstract: Small language models (SLMs) enable scalable multi-agent tool systems where multiple SLMs handle subtasks orchestrated by a powerful coordinator. However, they struggle with tool-use tasks, particularly in selecting appropriate tools and identifying correct parameters. A common failure mode is schema misalignment: models hallucinate plausible but nonexistent tool names that reflect naming conventions internalized during pretraining but absent from the provided tool schema. Rather than forcing models to adapt to arbitrary schemas, we propose adapting schemas to align with models’ pretrained knowledge. We introduce PA-Tool (Pretraining-Aligned Tool Schema Generation), a training-free method that leverages peakedness, a signal from contamination detection indicating pretraining familiarity, to rename tool components. By generating multiple candidates and selecting those with the highest peakedness across samples, PA-Tool identifies pretraining-aligned naming patterns. Experiments on MetaTool and RoTBench show improvements of up to 17%, with schema misalignment errors reduced by 80%. PA-Tool enables small models to approach state-of-the-art performance while maintaining computational efficiency in adapting to new tools without retraining. Our work demonstrates that schema-level interventions can unlock the tool-use potential of resource-efficient models by adapting schemas to models rather than models to schemas.

[153] Investigating Counterclaims in Causality Extraction from Text

Tim Hagen, Niklas Deckers, Felix Wolter, Harrisen Scells, Martin Potthast

Main category: cs.CL

TL;DR: New dataset for causality extraction that includes counterclaims, showing SOTA models trained only on causal claims perform poorly on counterclaims.

Details

Motivation: Existing causality extraction research neglects counterclaims of causation, despite their importance in scientific discourse and causal reasoning.

Method: Conducted literature review, compiled linguistic patterns of countercausal claims, developed annotation guidelines, and created dataset with 1028 causal claims, 952 counterclaims, and 1435 uncausal statements.

Result: Achieved substantial inter-annotator agreement (κ=0.74). Models trained only on causal claims misclassify counterclaims 10x more often than models trained on the new dataset.

Conclusion: Counterclaims are essential for robust causality extraction, and the new dataset significantly improves model performance on identifying countercausal language.

Abstract: Many causal claims, such as “sugar causes hyperactivity,” are disputed or outdated. Yet research on causality extraction from text has almost entirely neglected counterclaims of causation. To close this gap, we conduct a thorough literature review of causality extraction, compile an extensive inventory of linguistic realizations of countercausal claims, and develop rigorous annotation guidelines that explicitly incorporate countercausal language. We also highlight how counterclaims of causation are an integral part of causal reasoning. Based on our guidelines, we construct a new dataset comprising 1028 causal claims, 952 counterclaims, and 1435 uncausal statements, achieving substantial inter-annotator agreement (Cohen’s $κ= 0.74$). In our experiments, state-of-the-art models trained solely on causal claims misclassify counterclaims more than 10 times as often as models trained on our dataset.

[154] EngTrace: A Symbolic Benchmark for Verifiable Process Supervision of Engineering Reasoning

Ayesha Gull, Muhammad Usman Safder, Rania Elbadry, Fan Zhang, Veselin Stoyanov, Preslav Nakov, Zhuohan Xie

Main category: cs.CL

TL;DR: EngTrace is a symbolic benchmark for evaluating LLMs’ physically-grounded engineering reasoning through 1,350 contamination-resistant test cases and a verifiable two-stage evaluation framework that assesses both intermediate reasoning traces and final answers.

Details

Motivation: Existing benchmarks (MMLU, MATH, HumanEval) assess isolated cognitive skills but fail to capture the physically grounded reasoning required in safety-critical engineering workflows where scientific principles, quantitative modeling, and practical constraints must converge.

Method: Created EngTrace benchmark with 90 templates across 3 engineering branches, 9 core domains, and 20 areas; generated 1,350 unique test cases via domain-aware parameterization. Introduced verifiable two-stage evaluation framework with tiered protocol using automated procedural checks and heterogeneous AI Tribunal to validate intermediate reasoning traces alongside final answers.

Result: Evaluation of 24 leading LLMs revealed distinct trade-off between numeric precision and trace fidelity, identifying a complexity cliff where abstract mathematical pre-training fails to translate into integrative reasoning required for advanced engineering tasks.

Conclusion: EngTrace enables rigorous evaluation of LLMs’ engineering reasoning capabilities, moving beyond outcome matching to verifiable process supervision, revealing limitations in current models’ ability to perform integrative physically-grounded reasoning essential for safety-critical engineering applications.

Abstract: Large Language Models (LLMs) are increasingly entering specialized, safety-critical engineering workflows governed by strict quantitative standards and immutable physical laws, making rigorous evaluation of their reasoning capabilities imperative. However, existing benchmarks such as MMLU, MATH, and HumanEval assess isolated cognitive skills, failing to capture the physically grounded reasoning central to engineering, where scientific principles, quantitative modeling, and practical constraints must converge. To enable verifiable process supervision in engineering, we introduce EngTrace, a symbolic benchmark comprising 90 templates across three major engineering branches, nine core domains and 20 distinct areas. Through domain-aware parameterization, we generate 1,350 unique, contamination-resistant test cases to stress-test generalization. Moving beyond outcome matching, we introduce a verifiable two-stage evaluation framework that uses a tiered protocol to validate intermediate reasoning traces alongside final answers through automated procedural checks and a heterogeneous AI Tribunal. Our evaluation of 24 leading LLMs reveals a distinct trade-off between numeric precision and trace fidelity, identifying a complexity cliff where abstract mathematical pre-training fails to translate into the integrative reasoning required for advanced engineering tasks.

[155] Merlin’s Whisper: Enabling Efficient Reasoning in Large Language Models via Black-box Persuasive Prompting

Heming Xia, Cunxiao Du, Rui Li, Chak Tou Leong, Yongqi Li, Wenjie Li

Main category: cs.CL

TL;DR: Whisper is a black-box persuasive prompting framework that reduces token usage in large reasoning models by 40-50% while maintaining accuracy, without requiring model access or fine-tuning.

Details

Motivation: Large reasoning models (LRMs) incur substantial computational and latency overheads due to lengthy step-by-step reasoning processes, hindering their practical deployment in real-world applications.

Method: Treats LRMs as black-box communicators and uses iterative refinement to generate high-quality persuasive prompts from diverse perspectives, encouraging models to produce concise responses without compromising accuracy.

Result: Achieves 3x reduction in average response length on GSM8K for Qwen3 models, ~40% token reduction across all benchmarks, 46% reduction for Claude-3.7 on MATH-500, and 50% reduction for Gemini-2.5.

Conclusion: Black-box persuasive prompting via Whisper is a practical and broadly applicable strategy for enhancing LRM efficiency across different data domains, model scales, and families without requiring model modifications.

Abstract: Large reasoning models (LRMs) have demonstrated remarkable proficiency in tackling complex tasks through step-by-step thinking. However, this lengthy reasoning process incurs substantial computational and latency overheads, hindering the practical deployment of LRMs. This work presents a new approach to mitigating overthinking in LRMs via black-box persuasive prompting. By treating LRMs as black-box communicators, we investigate how to persuade them to generate concise responses without compromising accuracy. We introduce Whisper, an iterative refinement framework that generates high-quality persuasive prompts from diverse perspectives. Experiments across multiple benchmarks demonstrate that Whisper consistently reduces token usage while preserving performance. Notably, Whisper achieves a 3x reduction in average response length on simple GSM8K questions for the Qwen3 model series and delivers an average ~40% token reduction across all benchmarks. For closed-source APIs, Whisper reduces token usage on MATH-500 by 46% for Claude-3.7 and 50% for Gemini-2.5. Further analysis reveals the broad applicability of Whisper across data domains, model scales, and families, underscoring the potential of black-box persuasive prompting as a practical strategy for enhancing LRM efficiency.

[156] Table as a Modality for Large Language Models

Liyao Li, Chao Ye, Wentao Ye, Yifei Sun, Zhe Jiang, Haobo Wang, Jiaming Tian, Yiming Zhang, Ningtao Wang, Xing Fu, Gang Chen, Junbo Zhao

Main category: cs.CL

TL;DR: TAMO is a multimodal framework that treats tables as an independent modality integrated with text tokens, using a hypergraph neural network as global table encoder with LLMs, achieving 42.65% average improvement on table reasoning tasks.

Details

Motivation: Current LLMs fall short on table reasoning tasks because they serialize tabular data, losing structural information. Even advanced LLMs like GPTs struggle with tabular data as shown in the StructQA benchmark.

Method: Proposes TAMO (Tables As a MOdality) - a multimodal framework with hypergraph neural network as global table encoder integrated with mainstream LLMs. Treats tables as independent modality alongside text tokens.

Result: Significant improvements on multiple benchmarking datasets (HiTab, WikiTQ, WikiSQL, FeTaQA, StructQA) with average relative gain of 42.65% compared to existing methods.

Conclusion: Treating tables as an independent modality with specialized structural encoding (hypergraph neural networks) integrated with LLMs substantially improves table reasoning performance, addressing the structural information loss problem in current approaches.

Abstract: To migrate the remarkable successes of Large Language Models (LLMs), the community has made numerous efforts to generalize them to the table reasoning tasks for the widely deployed tabular data. Despite that, in this work, by showing a probing experiment on our proposed StructQA benchmark, we postulate that even the most advanced LLMs (such as GPTs) may still fall short of coping with tabular data. More specifically, the current scheme often simply relies on serializing the tabular data, together with the meta information, then inputting them through the LLMs. We argue that the loss of structural information is the root of this shortcoming. In this work, we further propose TAMO, which bears an ideology to treat the tables as an independent modality integrated with the text tokens. The resulting model in TAMO is a multimodal framework consisting of a hypergraph neural network as the global table encoder seamlessly integrated with the mainstream LLM. Empirical results on various benchmarking datasets, including HiTab, WikiTQ, WikiSQL, FeTaQA, and StructQA, have demonstrated significant improvements on generalization with an average relative gain of 42.65%.

[157] Proverbs or Pythian Oracles? Sentiments and Emotions in Greek Sayings

Katerina Korre, John Pavlopoulos

Main category: cs.CL

TL;DR: The paper analyzes Greek proverbs using NLP to create a multi-label emotion annotation framework, scale to local varieties, and map emotional distribution across Greece, finding that proverbs exhibit multidimensional emotional complexity that LLMs can capture.

Details

Motivation: Proverbs represent fascinating cross-cultural language phenomena, but much global proverb wisdom remains underexplored due to oral traditions. The authors aim to leverage NLP advances to analyze Greek proverbs' sentiment and emotion, addressing the gap in understanding these cultural expressions.

Method: 1) Developed a multi-label annotation framework and dataset for Greek proverbs capturing emotional variability; 2) Scaled analysis to local varieties; 3) Created a map of Greece showing emotional distribution; 4) Used LLMs to capture the multidimensional complexity of proverb interpretation.

Result: Proverb interpretation is multidimensional, shown through both multi-labeling and instance-level polarity. LLMs can effectively capture and reproduce this complexity. The emotional map of Greece reveals that surprise and anger compete and coexist within Greek proverbs.

Conclusion: LLMs can help better understand the proverbial landscape of a place, as demonstrated with Greece. The multidimensional nature of proverb interpretation requires sophisticated annotation frameworks, and NLP tools can effectively analyze cultural wisdom preserved in oral traditions.

Abstract: Proverbs are among the most fascinating language phenomena that transcend cultural and linguistic boundaries. Yet, much of the global landscape of proverbs remains underexplored, as many cultures preserve their traditional wisdom within their own communities due to the oral tradition of the phenomenon. Taking advantage of the current advances in Natural Language Processing (NLP), we focus on Greek proverbs, analyzing their sentiment and emotion. Departing from an annotated dataset of Greek proverbs, (1) we propose a multi-label annotation framework and dataset that captures the emotional variability of the proverbs, (2) we up-scale to local varieties, (3) we sketch a map of Greece that provides an overview of the distribution of emotions. Our findings show that the interpretation of proverbs is multidimensional, a property manifested through both multi-labeling and instance-level polarity. LLMs can capture and reproduce this complexity, and can therefore help us better understand the proverbial landscape of a place, as in the case of Greece, where surprise and anger compete and coexist within proverbs.

[158] FinDeepResearch: Evaluating Deep Research Agents in Rigorous Financial Analysis

Fengbin Zhu, Xiang Yao Ng, Ziyang Liu, Chang Liu, Xianwei Zeng, Chao Wang, Tianhui Tan, Xuan Yao, Pengyang Shao, Min Xu, Zixuan Wang, Jing Wang, Xin Lin, Junfeng Li, Jingxian Zhu, Yang Zhang, Wenjie Wang, Fuli Feng, Richang Hong, Huanbo Luan, Ke-Wei Huang, Tat-Seng Chua

Main category: cs.CL

TL;DR: Researchers propose HisRubric, a hierarchical evaluation framework for assessing Deep Research agents’ capabilities in corporate financial analysis, and create FinDeepResearch benchmark with 64 companies across 8 markets and 4 languages.

Details

Motivation: Existing literature lacks rigorous and systematic evaluation of Deep Research agents' capabilities in critical research analysis, particularly in corporate financial analysis where professional workflows need to be mirrored.

Method: Propose HisRubric framework with hierarchical analytical structure and fine-grained grading rubric that follows professional analyst workflow (data recognition → metric calculation → strategic summarization/interpretation). Build FinDeepResearch benchmark with 64 companies from 8 financial markets across 4 languages (15,808 grading items). Evaluate 16 methods: 6 DR agents, 5 LLMs with deep reasoning+search, and 5 LLMs with reasoning only.

Result: Extensive experiments reveal strengths and limitations of different approaches across diverse capabilities, financial markets, and languages. The benchmark and evaluation code are publicly available at OpenFinArena.com.

Conclusion: The study provides a systematic evaluation framework and benchmark for assessing DR agents in financial analysis, offering valuable insights for future research and development in this emerging field.

Abstract: Deep Research (DR) agents, powered by advanced Large Language Models (LLMs), have recently garnered increasing attention for their capability in conducting complex research tasks. However, existing literature lacks a rigorous and systematic evaluation of DR Agent’s capabilities in critical research analysis. To address this gap, we first propose HisRubric, a novel evaluation framework with a hierarchical analytical structure and a fine-grained grading rubric for rigorously assessing DR agents’ capabilities in corporate financial analysis. This framework mirrors the professional analyst’s workflow, progressing from data recognition to metric calculation, and finally to strategic summarization and interpretation. Built on this framework, we construct a FinDeepResearch benchmark that comprises 64 listed companies from 8 financial markets across 4 languages, encompassing a total of 15,808 grading items. We further conduct extensive experiments on the FinDeepResearch using 16 representative methods, including 6 DR agents, 5 LLMs equipped with both deep reasoning and search capabilities, and 5 LLMs with deep reasoning capabilities only. The results reveal the strengths and limitations of these approaches across diverse capabilities, financial markets, and languages, offering valuable insights for future research and development. The benchmark and evaluation code is publicly available at https://OpenFinArena.com/.

[159] SWAA: Sliding Window Attention Adaptation for Efficient Long-Context LLMs Without Pretraining

Yijiong Yu, Jiale Liu, Qingyun Wu, Huazheng Wang, Ji Pei

Main category: cs.CL

TL;DR: SWAA is a plug-and-play toolkit that adapts full-attention LLMs to sliding window attention for efficient long-context inference without costly retraining, achieving 30-100% speedups with acceptable quality loss.

Details

Motivation: Self-attention's quadratic complexity makes long-context inference prohibitively expensive in Transformers. Sliding window attention offers linear complexity but causes catastrophic performance collapse when naively applied to models pretrained with full attention due to training-inference mismatch.

Method: SWAA combines five strategies: (1) applying SWA only during prefilling, (2) preserving “sink” tokens, (3) interleaving FA/SWA layers, (4) chain-of-thought reasoning, and (5) fine-tuning. The toolkit systematically explores synergistic combinations of these methods.

Result: While individual methods are insufficient, specific synergistic combinations effectively recover original long-context capabilities. The approach achieves 30% to 100% speedups for long-context LLM inference with acceptable quality loss. Recommended configurations are identified for diverse scenarios.

Conclusion: SWAA provides a practical solution for adapting existing full-attention LLMs to efficient sliding window attention without costly pretraining, enabling significant inference speedups for long-context applications while maintaining acceptable performance.

Abstract: The quadratic complexity of self-attention in Transformer-based Large Language Models (LLMs) renders long-context inference prohibitively expensive. While Sliding Window Attention (SWA), the simplest sparse attention pattern, offers a linear-complexity alternative, naively applying it to models pretrained with Full Attention (FA) causes catastrophic long-context performance collapse due to the training-inference mismatch. To address this, we propose Sliding Window Attention Adaptation (SWAA), a plug-and-play toolkit of recipes that adapt FA models to SWA without costly pretraining. SWAA systematically combines five strategies: (1) applying SWA only during prefilling; (2) preserving “sink” tokens; (3) interleaving FA/SWA layers; (4) chain-of-thought (CoT); and (5) fine-tuning. Our experiments demonstrate that while individual methods are insufficient, specific synergistic combinations can effectively recover original long-context capabilities. After further analyzing performance-efficiency trade-offs, we identify recommended SWAA configurations for diverse scenarios, which achieve 30% to 100% speedups for long-context LLM inference with acceptable quality loss. Our code is available at https://github.com/yuyijiong/sliding-window-attention-adaptation

[160] Qomhra: A Bilingual Irish and English Large Language Model

Joseph McInerney, Khanh-Tung Tran, Liam Lonergan, Ailbhe Ní Chasaide, Neasa Ní Chiaráin, Barry Devereux

Main category: cs.CL

TL;DR: Qomhrá is a bilingual Irish-English LLM developed under low-resource constraints, featuring novel methods for synthesizing human preference data and showing significant improvements over existing Irish LLM baselines.

Details

Motivation: LLM research has focused on major languages, leaving low-resource languages like Irish underrepresented. There's a lack of scalable methods to create human preference data for such languages.

Method: Developed a complete pipeline: bilingual continued pre-training, instruction tuning, and novel synthesis of human preference data by prompting LLMs to generate “accepted” and “rejected” responses. Used Gemini-2.5-Pro (best-performing for Irish) to translate English instruction datasets and create the first Irish-language human preference dataset.

Result: Qomhrá shows gains of up to 29% in Irish and 44% in English compared to existing open-source Irish LLM baseline (UCCIX). Gemini-2.5-Pro was validated as aligning with L1 Irish speakers, diverging from LLM-as-a-judge ratings, revealing misalignment between current LLMs and Irish-language community.

Conclusion: The framework provides insights and guidance for developing LLMs for Irish and other low-resource languages, demonstrating effective methods for overcoming resource constraints and creating human preference data.

Abstract: Large language model (LLM) research and development has overwhelmingly focused on the world’s major languages, leading to under-representation of low-resource languages such as Irish. This paper introduces \textbf{Qomhrá}, a bilingual Irish and English LLM, developed under extremely low-resource constraints. A complete pipeline is outlined spanning bilingual continued pre-training, instruction tuning, and the synthesis of human preference data for future alignment training. We focus on the lack of scalable methods to create human preference data by proposing a novel method to synthesise such data by prompting an LLM to generate accepted'' and rejected’’ responses, which we validate as aligning with L1 Irish speakers. To select an LLM for synthesis, we evaluate the top closed-weight LLMs for Irish language generation performance. Gemini-2.5-Pro is ranked highest by L1 and L2 Irish-speakers, diverging from LLM-as-a-judge ratings, indicating a misalignment between current LLMs and the Irish-language community. Subsequently, we leverage Gemini-2.5-Pro to translate a large scale English-language instruction tuning dataset to Irish and to synthesise a first-of-its-kind Irish-language human preference dataset. We comprehensively evaluate Qomhrá across several benchmarks, testing translation, gender understanding, topic identification, and world knowledge; these evaluations show gains of up to 29% in Irish and 44% in English compared to the existing open-source Irish LLM baseline, UCCIX. The results of our framework provide insight and guidance to developing LLMs for both Irish and other low-resource languages.

[161] Understanding New-Knowledge-Induced Factual Hallucinations in LLMs: Analysis and Interpretation

Renfei Dang, Peng Hu, Zhejian Lai, Changjiang Gao, Min Zhang, Shujian Huang

Main category: cs.CL

TL;DR: Fine-tuning LLMs on new knowledge causes factual hallucinations that spread across tasks, driven by unfamiliarity within specific knowledge types rather than overall new knowledge proportion.

Details

Motivation: Prior research shows fine-tuning on new knowledge induces factual hallucinations in LLMs, but specific manifestations and underlying mechanisms remain poorly understood. The authors aim to systematically analyze how hallucinations occur and spread during knowledge updates.

Method: Created controlled dataset Biography-Reasoning for fine-grained analysis across multiple knowledge types and two task types (knowledge QA and reasoning). Used interpretability analysis to examine attention patterns and conducted experiments with varying knowledge familiarity levels during fine-tuning.

Result: Hallucinations severely affect tasks with new knowledge and propagate to other evaluation tasks. When a knowledge type consists entirely of new knowledge, LLMs show elevated hallucination tendencies. Learning new knowledge weakens attention to key entities, causing over-reliance on context. Reintroducing known knowledge later in training restores attention and mitigates hallucinations. Disrupted attention patterns propagate across lexically similar contexts.

Conclusion: The degree of unfamiliarity within specific knowledge types drives hallucinations more than overall new knowledge proportion. Attention mechanism disruptions underlie hallucination behavior, but can be mitigated by strategic reintroduction of known knowledge during training.

Abstract: Prior works have shown that fine-tuning on new knowledge can induce factual hallucinations in large language models (LLMs), leading to incorrect outputs when evaluated on previously known information. However, the specific manifestations of such hallucination and its underlying mechanisms remain insufficiently understood. Our work addresses this gap by designing a controlled dataset \textit{Biography-Reasoning}, and conducting a fine-grained analysis across multiple knowledge types and two task types, including knowledge question answering (QA) and knowledge reasoning tasks. We find that hallucinations not only severely affect tasks involving newly introduced knowledge, but also propagate to other evaluation tasks. Moreover, when fine-tuning on a dataset in which a specific knowledge type consists entirely of new knowledge, LLMs exhibit elevated hallucination tendencies. This suggests that the degree of unfamiliarity within a particular knowledge type, rather than the overall proportion of new knowledge, is a stronger driver of hallucinations. Through interpretability analysis, we show that learning new knowledge weakens the model’s attention to key entities in the input question, leading to an over-reliance on surrounding context and a higher risk of hallucination. Conversely, reintroducing a small amount of known knowledge during the later stages of training restores attention to key entities and substantially mitigates hallucination behavior. Finally, we demonstrate that disrupted attention patterns can propagate across lexically similar contexts, facilitating the spread of hallucinations beyond the original task.

[162] Investigating CoT Monitorability in Large Reasoning Models

Shu Yang, Junchao Wu, Xilin Gong, Xuansheng Wu, Derek Wong, Ninghao Liu, Di Wang

Main category: cs.CL

TL;DR: The paper investigates CoT monitorability - using chain-of-thought reasoning traces to monitor potential model misbehavior in Large Reasoning Models, addressing challenges of truthful verbalization and reliable detection.

Details

Motivation: While LRMs' detailed reasoning traces offer new opportunities for AI safety monitoring, two key challenges exist: models may not truthfully represent internal decision-making in CoT, and monitors may be unreliable or easily deceived by elaborate reasoning.

Method: Systematic investigation structured around two perspectives: verbalization (faithfulness of true factors in CoT) and monitor reliability. Empirical evidence and correlation analyses across mathematical, scientific, and ethical domains. Study of CoT intervention effects, and proposal of MoME paradigm for LLMs to monitor other models through CoT with structured judgments.

Result: The paper provides empirical evidence and correlation analyses between verbalization quality, monitor reliability, and LLM performance across multiple domains. It investigates how different CoT intervention methods affect monitoring effectiveness.

Conclusion: The study presents the first systematic investigation of CoT monitorability challenges and potential, proposing MoME as a new paradigm for AI safety monitoring through chain-of-thought analysis.

Abstract: Large Reasoning Models (LRMs) have demonstrated remarkable performance on complex tasks by engaging in extended reasoning before producing final answers. Beyond improving abilities, these detailed reasoning traces also create a new opportunity for AI safety, CoT Monitorability: monitoring potential model misbehavior, such as the use of shortcuts or sycophancy, through their chain-of-thought (CoT) during decision-making. However, two key fundamental challenges arise when attempting to build more effective monitors through CoT analysis. First, as prior research on CoT faithfulness has pointed out, models do not always truthfully represent their internal decision-making in the generated reasoning. Second, monitors themselves may be either overly sensitive or insufficiently sensitive, and can potentially be deceived by models’ long, elaborate reasoning traces. In this paper, we present the first systematic investigation of the challenges and potential of CoT monitorability. Motivated by two fundamental challenges we mentioned before, we structure our study around two central perspectives: (i) verbalization: to what extent do LRMs faithfully verbalize the true factors guiding their decisions in the CoT, and (ii) monitor reliability: to what extent can misbehavior be reliably detected by a CoT-based monitor? Specifically, we provide empirical evidence and correlation analyses between verbalization quality, monitor reliability, and LLM performance across mathematical, scientific, and ethical domains. Then we further investigate how different CoT intervention methods, designed to improve reasoning efficiency or performance, will affect monitoring effectiveness. Finally, we propose MoME, a new paradigm in which LLMs monitor other models’ misbehavior through their CoT and provide structured judgments along with supporting evidence.

[163] When in Doubt, Consult: Expert Debate for Sexism Detection via Confidence-Based Routing

Anwar Alajmi, Gabriele Pergola

Main category: cs.CL

TL;DR: A two-stage framework combining targeted training procedures and reasoning-based inference to detect subtle online sexism, addressing label noise, class imbalance, and ambiguous cases.

Details

Motivation: Online sexism is becoming more subtle and context-dependent, evading traditional detection methods. Challenges include inconsistent annotations due to overlapping linguistic, psychological, legal, and cultural dimensions, label scarcity, class imbalance, and unstable decision boundaries that cause models to miss underrepresented forms of harm.

Method: Two-stage framework: (1) Targeted training with class-balanced focal loss, class-aware batching, and post-hoc threshold calibration to mitigate label imbalance and noisy supervision. (2) Dynamic routing mechanism that distinguishes unambiguous cases from complex ones, with a Collaborative Expert Judgment (CEJ) module that prompts multiple personas and consolidates their reasoning through a judge model.

Result: Outperforms existing approaches across public benchmarks: +4.48% F1 gain on EDOS Task A, +1.30% on EDOS Task B, and +2.79% improvement in ICM on EXIST 2025 Task 1.1.

Conclusion: The proposed framework effectively addresses challenges in detecting subtle online sexism by combining regularization techniques for noisy, imbalanced data with reasoning-based inference for ambiguous cases, achieving state-of-the-art performance on benchmark datasets.

Abstract: Online sexism increasingly appears in subtle, context-dependent forms that evade traditional detection methods. Its interpretation often depends on overlapping linguistic, psychological, legal, and cultural dimensions, which produce mixed and sometimes contradictory signals in annotated datasets. These inconsistencies, combined with label scarcity and class imbalance, result in unstable decision boundaries and cause fine-tuned models to overlook subtler, underrepresented forms of harm. To address these challenges, we propose a two-stage framework that unifies (i) targeted training procedures to better regularize supervision to scarce and noisy data with (ii) selective, reasoning-based inference to handle ambiguous or borderline cases. First, we stabilize the training combining class-balanced focal loss, class-aware batching, and post-hoc threshold calibration, strategies for the firs time adapted for this domain to mitigate label imbalance and noisy supervision. Second, we bridge the gap between efficiency and reasoning with a a dynamic routing mechanism that distinguishes between unambiguous instances and complex cases requiring a deliberative process. This reasoning process results in the novel Collaborative Expert Judgment (CEJ) module which prompts multiple personas and consolidates their reasoning through a judge model. Our approach outperforms existing approaches across several public benchmarks, with F1 gains of +4.48% and +1.30% on EDOS Tasks A and B, respectively, and a +2.79% improvement in ICM on EXIST 2025 Task 1.1.

[164] Authors Should Label Their Own Documents

Marcus Ma, Cole Johnson, Nolan Bridges, Jackson Trager, Georgios Chochlakis, Shrikanth Narayanan

Main category: cs.CL

TL;DR: Author labeling is a new annotation technique where document creators label their own data in real-time, achieving 537% better click-through rates than industry baselines and outperforming traditional third-party annotation methods.

Details

Motivation: Third-party annotation is insufficient for capturing egocentric information like sentiment and belief, as these subjective aspects can only be approximated by external annotators rather than accurately captured from the author's perspective.

Method: Collaborated with a commercial chatbot (20,000+ users) to deploy an author labeling system that identifies task-relevant queries, generates real-time labeling questions, and records authors’ answers. Used online-learning model architecture for product recommendation trained to minimize prediction error on author-labeled subjective belief questions.

Result: Model achieved 537% improvement in click-through rate compared to industry advertising baseline. Author labeling was found to be higher quality, faster to acquire, and cheaper than three traditional annotation approaches for sentiment analysis.

Conclusion: Author labeling produces significantly higher quality annotations for egocentric and subjective beliefs compared to third-party annotation, and the authors have released an author labeling service (https://academic.echogroup.ai) to facilitate broader scientific adoption.

Abstract: Third-party annotation is the status quo for labeling text, but egocentric information such as sentiment and belief can at best only be approximated by a third-person proxy. We introduce author labeling, an annotation technique where the writer of the document itself annotates the data at the moment of creation. We collaborate with a commercial chatbot with over 20,000 users to deploy an author labeling annotation system. This system identifies task-relevant queries, generates on-the-fly labeling questions, and records authors’ answers in real time. We train and deploy an online-learning model architecture for product recommendation with author-labeled data to improve performance. We train our model to minimize the prediction error on questions generated for a set of predetermined subjective beliefs using author-labeled responses. Our model achieves a 537% improvement in click-through rate compared to an industry advertising baseline running concurrently. We then compare the quality and practicality of author labeling to three traditional annotation approaches for sentiment analysis and find author labeling to be higher quality, faster to acquire, and cheaper. These findings reinforce existing literature that annotations, especially for egocentric and subjective beliefs, are significantly higher quality when labeled by the author rather than a third party. To facilitate broader scientific adoption, we release an author labeling service for the research community at https://academic.echogroup.ai.

[165] MoE-DiffuSeq: Enhancing Long-Document Diffusion Models with Sparse Attention and Mixture of Experts

Alexandros Christoforos, Chadbourne Davis

Main category: cs.CL

TL;DR: MoE-DiffuSeq: A diffusion-based framework for efficient long-form text generation using sparse attention and Mixture-of-Experts architecture to reduce computational costs while maintaining quality.

Details

Motivation: Existing sequence diffusion models face prohibitive computational and memory costs when scaling to long documents due to dense attention and slow iterative reconstruction, limiting their practical application for long-form text generation.

Method: Combines expert routing with tailored sparse attention mechanism to reduce attention complexity, plus introduces a soft absorbing state within the diffusion process to reshape attention dynamics during denoising for faster reconstruction and token refinement.

Result: Outperforms prior diffusion-based and sparse-attention baselines in training efficiency, inference speed, and generation quality on long-document benchmarks, particularly effective for scientific document generation, code synthesis, and extended dialogue modeling.

Conclusion: Establishes a scalable and expressive solution for diffusion-based long-form text generation by addressing computational bottlenecks through MoE architecture and sparse attention mechanisms.

Abstract: We propose \textbf{MoE-DiffuSeq}, a diffusion-based framework for efficient long-form text generation that integrates sparse attention with a Mixture-of-Experts (MoE) architecture. Existing sequence diffusion models suffer from prohibitive computational and memory costs when scaling to long documents, largely due to dense attention and slow iterative reconstruction. MoE-DiffuSeq addresses these limitations by combining expert routing with a tailored sparse attention mechanism, substantially reducing attention complexity while preserving global coherence and textual fidelity. In addition, we introduce a \emph{soft absorbing state} within the diffusion process that reshapes attention dynamics during denoising, enabling faster sequence reconstruction and more precise token refinement. This design accelerates both training and sampling without sacrificing generation quality. Extensive experiments on long-document benchmarks demonstrate that MoE-DiffuSeq consistently outperforms prior diffusion-based and sparse-attention baselines in training efficiency, inference speed, and generation quality. Our approach is particularly effective for long-context applications such as scientific document generation, large-scale code synthesis, and extended dialogue modeling, establishing a scalable and expressive solution for diffusion-based long-form text generation.

[166] Does Memory Need Graphs? A Unified Framework and Empirical Analysis for Long-Term Dialog Memory

Sen Hu, Yuxiang Wei, Jiaxin Ran, Zhiyuan Yao, Xueran Han, Huacan Wang, Ronghao Chen, Lei Zou

Main category: cs.CL

TL;DR: Experimental analysis shows many dialog memory performance differences come from foundational system settings rather than architectural innovations, identifying reliable baselines for future research.

Details

Motivation: Graph structures are increasingly used in dialog memory systems but empirical findings on their effectiveness remain inconsistent, making it unclear which design choices truly matter.

Method: Introduce a unified framework decomposing dialog memory systems into core components supporting both graph-based and non-graph approaches. Conduct controlled, stage-wise experiments on LongMemEval and HaluMem, comparing design choices in memory representation, organization, maintenance, and retrieval.

Result: Results show many performance differences are driven by foundational system settings rather than specific architectural innovations.

Conclusion: Based on findings, identify stable and reliable strong baselines for future dialog memory research.

Abstract: Graph structures are increasingly used in dialog memory systems, but empirical findings on their effectiveness remain inconsistent, making it unclear which design choices truly matter. We present an experimental, system-oriented analysis of long-term dialog memory architectures. We introduce a unified framework that decomposes dialog memory systems into core components and supports both graph-based and non-graph approaches. Under this framework, we conduct controlled, stage-wise experiments on LongMemEval and HaluMem, comparing common design choices in memory representation, organization, maintenance, and retrieval. Our results show that many performance differences are driven by foundational system settings rather than specific architectural innovations. Based on these findings, we identify stable and reliable strong baselines for future dialog memory research.

[167] Talk Less, Verify More: Improving LLM Assistants with Semantic Checks and Execution Feedback

Yan Sun, Ming Cai, Stanley Kok

Main category: cs.CL

TL;DR: The paper introduces Q* and Feedback+ verification techniques for LLM assistants in enterprise workflows to reduce errors and improve reliability in conversational business analytics systems.

Details

Motivation: Current conversational business analytics systems lack built-in verification mechanisms, forcing users to manually validate potentially flawed results from LLM assistants, which is inefficient and error-prone for enterprise decision support.

Method: Two complementary verification techniques: 1) Q* performs reverse translation and semantic matching between generated code and user intent, and 2) Feedback+ incorporates execution feedback to guide code refinement. These are embedded within a generator-discriminator framework.

Result: Evaluations on Spider, Bird, and GSM8K benchmark datasets show that both Q* and Feedback+ reduce error rates and task completion time. Reverse translation is identified as a key bottleneck.

Conclusion: The work contributes a design-oriented framework for building more reliable, enterprise-grade GenAI systems capable of trustworthy decision support, with identified opportunities for future improvement in reverse translation.

Abstract: As large language model (LLM) assistants become increasingly integrated into enterprise workflows, their ability to generate accurate, semantically aligned, and executable outputs is critical. However, current conversational business analytics (CBA) systems often lack built-in verification mechanisms, leaving users to manually validate potentially flawed results. This paper introduces two complementary verification techniques: Q*, which performs reverse translation and semantic matching between code and user intent, and Feedback+, which incorporates execution feedback to guide code refinement. Embedded within a generator-discriminator framework, these mechanisms shift validation responsibilities from users to the system. Evaluations on three benchmark datasets, Spider, Bird, and GSM8K, demonstrate that both Q* and Feedback+ reduce error rates and task completion time. The study also identifies reverse translation as a key bottleneck, highlighting opportunities for future improvement. Overall, this work contributes a design-oriented framework for building more reliable, enterprise-grade GenAI systems capable of trustworthy decision support.

[168] Can LLMs Track Their Output Length? A Dynamic Feedback Mechanism for Precise Length Regulation

Meiman Xiao, Ante Wang, Qingguo Hu, Zhongjian Miao, Huangjun Shen, Longyue Wang, Weihua Luo, Jinsong Su

Main category: cs.CL

TL;DR: LLMs struggle with precise length control in text generation. The paper proposes a training-free approach using dynamic length feedback during generation to improve adherence to token, word, or sentence count targets.

Details

Motivation: Real-world applications often require precise control over generated text length, but current LLMs perform poorly on this task despite advances in following other human instructions. The authors identify that LLMs fail to accurately measure their own response lengths, leading to poor adherence to length constraints.

Method: A novel length regulation approach that incorporates dynamic length feedback during generation, enabling adaptive adjustments to meet target lengths. The method is training-free and can be further enhanced with supervised fine-tuning for broader generalization.

Result: Experiments on summarization and biography tasks show significant improvement in precision for achieving target token, word, or sentence counts without compromising quality. The approach also demonstrates effective generalization to broader text-generation tasks when combined with supervised fine-tuning.

Conclusion: The proposed dynamic length feedback approach effectively addresses LLMs’ limitations in length control, offering a practical solution for real-world applications requiring precise text length constraints while maintaining generation quality.

Abstract: Precisely controlling the length of generated text is a common requirement in real-world applications. However, despite significant advancements in following human instructions, Large Language Models (LLMs) still struggle with this task. In this work, we demonstrate that LLMs often fail to accurately measure their response lengths, leading to poor adherence to length constraints. To address this issue, we propose a novel length regulation approach that incorporates dynamic length feedback during generation, enabling adaptive adjustments to meet target lengths. Experiments on summarization and biography tasks show our training-free approach significantly improves precision in achieving target token, word, or sentence counts without compromising quality. Additionally, we demonstrate that further supervised fine-tuning allows our method to generalize effectively to broader text-generation tasks.

[169] Empirical Comparison of Encoder-Based Language Models and Feature-Based Supervised Machine Learning Approaches to Automated Scoring of Long Essays

Kuo Wang, Haowei Hua, Pengfei Yan, Hong Jiao, Dan Song

Main category: cs.CL

TL;DR: Ensemble model combining multiple encoder-based language model embeddings with gradient boosting outperforms individual models for automated scoring of long essays.

Details

Motivation: Long context poses challenges for encoder-only language models in automated essay scoring, especially for long essays that exceed typical token limits.

Method: Trained BERT-based models (BERT, RoBERTa, DistilBERT, DeBERTa), ensemble models integrating embeddings from multiple encoders, and feature-based supervised ML models (GBDT, XGBoost, LightGBM) on 17,307 essays with 80/10/10 split, evaluated using Quadratic Weighted Kappa.

Result: Ensemble-of-embeddings model combining multiple pre-trained language model representations with gradient-boosting classifier significantly outperforms individual language models for scoring long essays.

Conclusion: For automated scoring of long essays, ensemble approaches that combine multiple encoder representations with gradient boosting are more effective than individual encoder-based language models.

Abstract: Long context may impose challenges for encoder-only language models in text processing, specifically for automated scoring of essays. This study trained several commonly used encoder-based language models for automated scoring of long essays. The performance of these trained models was evaluated and compared with the ensemble models built upon the base language models with a token limit of 512?. The experimented models include BERT-based models (BERT, RoBERTa, DistilBERT, and DeBERTa), ensemble models integrating embeddings from multiple encoder models, and ensemble models of feature-based supervised machine learning models, including Gradient-Boosted Decision Trees, eXtreme Gradient Boosting, and Light Gradient Boosting Machine. We trained, validated, and tested each model on a dataset of 17,307 essays, with an 80%/10%/10% split, and evaluated model performance using Quadratic Weighted Kappa. This study revealed that an ensemble-of-embeddings model that combines multiple pre-trained language model representations with gradient-boosting classifier as the ensemble model significantly outperforms individual language models at scoring long essays.

[170] The performances of the Chinese and U.S. Large Language Models on the Topic of Chinese Culture

Feiyan Liu, Siyan Zhao, Chenxun Zhuo, Tianming Liu, Bao Ge

Main category: cs.CL

TL;DR: Chinese LLMs outperform US models on Chinese cultural understanding tasks, with differences likely due to training data distribution and localization strategies.

Details

Motivation: To examine whether LLMs from Chinese and US developers exhibit cultural differences in Chinese-language settings, particularly in understanding Chinese culture, given that top LLM developers are concentrated in these two countries.

Method: Direct-questioning paradigm evaluating models (GPT-5.1, DeepSeek-V3.2, Qwen3-Max, Gemini2.5Pro) on Chinese cultural understanding including history, literature, poetry, and related domains.

Result: Chinese models generally outperform US models on Chinese cultural tasks. Among US models, Gemini 2.5Pro and GPT-5.1 achieve relatively higher accuracy.

Conclusion: Performance differences likely stem from variations in training data distribution, localization strategies, and emphasis on Chinese cultural content during model development.

Abstract: Cultural backgrounds shape individuals’ perspectives and approaches to problem-solving. Since the emergence of GPT-1 in 2018, large language models (LLMs) have undergone rapid development. To date, the world’s ten leading LLM developers are primarily based in China and the United States. To examine whether LLMs released by Chinese and U.S. developers exhibit cultural differences in Chinese-language settings, we evaluate their performance on questions about Chinese culture. This study adopts a direct-questioning paradigm to evaluate models such as GPT-5.1, DeepSeek-V3.2, Qwen3-Max, and Gemini2.5Pro. We assess their understanding of traditional Chinese culture, including history, literature, poetry, and related domains. Comparative analyses between LLMs developed in China and the U.S. indicate that Chinese models generally outperform their U.S. counterparts on these tasks. Among U.S.-developed models, Gemini 2.5Pro and GPT-5.1 achieve relatively higher accuracy. The observed performance differences may potentially arise from variations in training data distribution, localization strategies, and the degree of emphasis on Chinese cultural content during model development.

[171] Stable-RAG: Mitigating Retrieval-Permutation-Induced Hallucinations in Retrieval-Augmented Generation

Qianchi Zhang, Hainan Zhang, Liang Pang, Hongwei Zheng, Zhiming Zheng

Main category: cs.CL

TL;DR: Stable-RAG addresses LLMs’ sensitivity to document order in RAG by using permutation sensitivity estimation to identify and correct hallucinations, improving accuracy and consistency.

Details

Motivation: Current RAG systems show significant sensitivity to the order of retrieved documents, causing varying model outputs even when the gold document is present. Existing robust RAG methods focus on low-quality retrieval and positional bias but don't address this permutation sensitivity problem.

Method: Stable-RAG runs the generator under multiple retrieval orders, clusters hidden states to identify dominant reasoning patterns, decodes from cluster-center representations, and uses these results to align hallucinated outputs toward correct answers for consistency.

Result: Experiments on three QA datasets show significant improvements in answer accuracy, reasoning consistency, and robust generalization across datasets, retrievers, and input lengths compared to baselines.

Conclusion: Permutation sensitivity is a critical but underexplored issue in RAG systems, and Stable-RAG effectively addresses it by leveraging permutation sensitivity estimation to produce more consistent and accurate outputs.

Abstract: Retrieval-Augmented Generation (RAG) has become a key paradigm for reducing factual hallucinations in large language models (LLMs), yet little is known about how the order of retrieved documents affects model behavior. We empirically show that under Top-5 retrieval with the gold document included, LLM answers vary substantially across permutations of the retrieved set, even when the gold document is fixed in the first position. This reveals a previously underexplored sensitivity to retrieval permutations. Although robust RAG methods primarily focus on enhancing LLM robustness to low-quality retrieval and mitigating positional bias to distribute attention fairly over long contexts, neither approach directly addresses permutation sensitivity. In this paper, we propose Stable-RAG, which exploits permutation sensitivity estimation to mitigate permutation-induced hallucinations. Stable-RAG runs the generator under multiple retrieval orders, clusters hidden states, and decodes from a cluster-center representation that captures the dominant reasoning pattern. It then uses these reasoning results to align hallucinated outputs toward the correct answer, encouraging the model to produce consistent and accurate predictions across document permutations. Experiments on three QA datasets show that Stable-RAG significantly improves answer accuracy, reasoning consistency and robust generalization across datasets, retrievers, and input lengths compared with baselines.

[172] MedDialogRubrics: A Comprehensive Benchmark and Evaluation Framework for Multi-turn Medical Consultations in Large Language Models

Lecheng Gong, Weimin Fang, Ting Yang, Dongjie Tao, Chunxiao Guo, Peng Wei, Bo Xie, Jinqun Guan, Zixiao Chen, Fang Shi, Jinjie Gu, Junwei Liu

Main category: cs.CL

TL;DR: MedDialogRubrics is a new benchmark with 5,200 synthetic patient cases and 60,000+ expert-refined rubrics to evaluate medical LLMs’ diagnostic reasoning, using multi-agent simulation and evidence-based rubric generation.

Details

Motivation: Existing benchmarks for medical conversational AI lack rigorous evaluation of information-gathering and diagnostic reasoning abilities, creating gaps in assessing medical LLMs' multi-turn diagnostic capabilities.

Method: Uses multi-agent system to synthesize realistic patient cases without real EHR data; Patient Agent with atomic medical facts and hallucination correction; structured rubric-generation pipeline with EBM guidelines and reject sampling for “must-ask” items.

Result: Comprehensive evaluation shows current models face substantial challenges across multiple assessment dimensions, indicating that improving medical dialogue requires advances in dialogue management architectures beyond base-model tuning.

Conclusion: MedDialogRubrics provides a rigorous benchmark for medical conversational AI evaluation, revealing that current models struggle with diagnostic reasoning and highlighting the need for architectural improvements in dialogue management systems.

Abstract: Medical conversational AI (AI) plays a pivotal role in the development of safer and more effective medical dialogue systems. However, existing benchmarks and evaluation frameworks for assessing the information-gathering and diagnostic reasoning abilities of medical large language models (LLMs) have not been rigorously evaluated. To address these gaps, we present MedDialogRubrics, a novel benchmark comprising 5,200 synthetically constructed patient cases and over 60,000 fine-grained evaluation rubrics generated by LLMs and subsequently refined by clinical experts, specifically designed to assess the multi-turn diagnostic capabilities of LLM. Our framework employs a multi-agent system to synthesize realistic patient records and chief complaints from underlying disease knowledge without accessing real-world electronic health records, thereby mitigating privacy and data-governance concerns. We design a robust Patient Agent that is limited to a set of atomic medical facts and augmented with a dynamic guidance mechanism that continuously detects and corrects hallucinations throughout the dialogue, ensuring internal coherence and clinical plausibility of the simulated cases. Furthermore, we propose a structured LLM-based and expert-annotated rubric-generation pipeline that retrieves Evidence-Based Medicine (EBM) guidelines and utilizes the reject sampling to derive a prioritized set of rubric items (“must-ask” items) for each case. We perform a comprehensive evaluation of state-of-the-art models and demonstrate that, across multiple assessment dimensions, current models face substantial challenges. Our results indicate that improving medical dialogue will require advances in dialogue management architectures, not just incremental tuning of the base-model.

[173] WebAnchor: Anchoring Agent Planning to Stabilize Long-Horizon Web Reasoning

Xinmiao Yu, Liwen Zhang, Xiaocheng Feng, Yong Jiang, Bing Qin, Pengjun Xie, Jingren Zhou

Main category: cs.CL

TL;DR: Anchor-GRPO: A two-stage RL framework that addresses the “plan anchor” problem in LLM-based web agents by decoupling planning and execution, improving long-horizon web reasoning tasks.

Details

Motivation: Current RL methods for LLM-based web agents struggle with long-horizon planning due to the "plan anchor" phenomenon where the first reasoning step disproportionately impacts downstream behavior, and existing RL algorithms fail to account for this by uniformly distributing rewards across trajectories.

Method: Anchor-GRPO is a two-stage RL framework: Stage 1 optimizes first-step planning using fine-grained rubrics from self-play experiences and human calibration; Stage 2 aligns execution with the initial plan through sparse rewards to ensure stable and efficient tool usage.

Result: Anchor-GRPO outperforms baseline GRPO and First-step GRPO across four benchmarks (BrowseComp, BrowseComp-Zh, GAIA, XBench-DeepSearch) for models from 3B to 30B parameters, improving task success and tool efficiency. WebAnchor-30B achieves 46.0% pass@1 on BrowseComp and 76.4% on GAIA.

Conclusion: The proposed Anchor-GRPO framework effectively addresses the plan anchor problem in long-horizon web reasoning, demonstrating strong scalability with improved accuracy as model size and context length increase, making it a promising approach for optimizing LLM-based web agents.

Abstract: Large Language Model(LLM)-based agents have shown strong capabilities in web information seeking, with reinforcement learning (RL) becoming a key optimization paradigm. However, planning remains a bottleneck, as existing methods struggle with long-horizon strategies. Our analysis reveals a critical phenomenon, plan anchor, where the first reasoning step disproportionately impacts downstream behavior in long-horizon web reasoning tasks. Current RL algorithms, fail to account for this by uniformly distributing rewards across the trajectory. To address this, we propose Anchor-GRPO, a two-stage RL framework that decouples planning and execution. In Stage 1, the agent optimizes its first-step planning using fine-grained rubrics derived from self-play experiences and human calibration. In Stage 2, execution is aligned with the initial plan through sparse rewards, ensuring stable and efficient tool usage. We evaluate Anchor-GRPO on four benchmarks: BrowseComp, BrowseComp-Zh, GAIA, and XBench-DeepSearch. Across models from 3B to 30B, Anchor-GRPO outperforms baseline GRPO and First-step GRPO, improving task success and tool efficiency. Notably, WebAnchor-30B achieves 46.0% pass@1 on BrowseComp and 76.4% on GAIA. Anchor-GRPO also demonstrates strong scalability, getting higher accuracy as model size and context length increase.

cs.CV

[174] HyperCLOVA X 32B Think

NAVER Cloud HyperCLOVA X Team

Main category: cs.CV

TL;DR: HyperCLOVA X 32B Think is a Korean-focused vision-language model with strong reasoning and agentic capabilities, achieving top performance on Korean benchmarks and agent tasks.

Details

Motivation: To develop a vision-language model specifically designed for Korean linguistic and cultural context with emphasis on reasoning and agentic abilities, addressing the need for localized AI models.

Method: Two-stage approach: pre-training with strong focus on reasoning capabilities, followed by post-training for multimodal understanding, enhanced reasoning, agentic behaviors, and human preference alignment.

Result: The model achieves strong performance on Korean text-to-text and vision-to-text benchmarks, as well as on agent-oriented evaluation tasks compared to similarly sized models.

Conclusion: By open-sourcing HyperCLOVA X 32B Think, the authors aim to support broader adoption and facilitate further research and innovation in both academic and industrial communities.

Abstract: In this report, we present HyperCLOVA X 32B Think, a vision-language model designed with particular emphasis on reasoning within the Korean linguistic and cultural context, as well as agentic ability. HyperCLOVA X 32B Think is pre-trained with a strong focus on reasoning capabilities and subsequently post-trained to support multimodal understanding, enhanced reasoning, agentic behaviors, and alignment with human preferences. Experimental evaluations against comparably sized models demonstrate that our model achieves strong performance on Korean text-to-text and vision-to-text benchmarks, as well as on agent-oriented evaluation tasks. By open-sourcing HyperCLOVA X 32B Think, we aim to support broader adoption and facilitate further research and innovation across both academic and industrial communities.

[175] Klear: Unified Multi-Task Audio-Video Joint Generation

Jun Wang, Chunyu Qiang, Yuxin Guo, Yiran Wang, Xijuan Zeng, Chen Zhang, Pengfei Wan

Main category: cs.CV

TL;DR: Klear introduces a unified audio-video generation model with architectural innovations, progressive training strategies, and a novel large-scale dataset to solve synchronization and quality issues in multimodal generation.

Details

Motivation: Existing audio-video generation approaches suffer from audio-visual asynchrony, poor lip-speech alignment, unimodal degradation, weak correspondence modeling, limited generalization, and scarcity of high-quality dense-caption data.

Method: Three-axis approach: 1) Single-tower architecture with unified DiT blocks and Omni-Full Attention for tight alignment; 2) Progressive multitask training with random modality masking and multistage curriculum; 3) Novel automated pipeline for creating large-scale audio-video dataset with dense captions.

Result: Klear achieves high-fidelity, semantically/temporally aligned generation in both joint and unimodal settings, generalizes robustly to out-of-distribution scenarios, substantially outperforms prior methods, and achieves performance comparable to Veo 3.

Conclusion: Klear offers a unified, scalable path toward next-generation audio-video synthesis by addressing core challenges through architectural, training, and data innovations.

Abstract: Audio-video joint generation has progressed rapidly, yet substantial challenges still remain. Non-commercial approaches still suffer audio-visual asynchrony, poor lip-speech alignment, and unimodal degradation, which can be stemmed from weak audio-visual correspondence modeling, limited generalization, and scarce high-quality dense-caption data. To address these issues, we introduce Klear and delve into three axes–model architecture, training strategy, and data curation. Architecturally, we adopt a single-tower design with unified DiT blocks and an Omni-Full Attention mechanism, achieving tight audio-visual alignment and strong scalability. Training-wise, we adopt a progressive multitask regime–random modality masking to joint optimization across tasks, and a multistage curriculum, yielding robust representations, strengthening A-V aligned world knowledge, and preventing unimodal collapse. For datasets, we present the first large-scale audio-video dataset with dense captions, and introduce a novel automated data-construction pipeline which annotates and filters millions of diverse, high-quality, strictly aligned audio-video-caption triplets. Building on this, Klear scales to large datasets, delivering high-fidelity, semantically and temporally aligned, instruction-following generation in both joint and unimodal settings while generalizing robustly to out-of-distribution scenarios. Across tasks, it substantially outperforms prior methods by a large margin and achieves performance comparable to Veo 3, offering a unified, scalable path toward next-generation audio-video synthesis.

[176] CageDroneRF: A Large-Scale RF Benchmark and Toolkit for Drone Perception

Mohammad Rostami, Atik Faysal, Hongtao Xia, Hadi Kasasbeh, Ziang Gao, Huaxia Wang

Main category: cs.CV

TL;DR: CDRF is a large-scale RF drone detection benchmark combining real-world captures with systematic synthetic data generation to address dataset scarcity and diversity limitations.

Details

Motivation: Existing RF datasets for drone detection suffer from scarcity and limited diversity, hindering development of robust RF perception models.

Method: Combines extensive real-world recordings with principled synthetic augmentation pipeline controlling SNR, injecting interfering emitters, applying frequency shifts with label-consistent bounding-box transformations.

Result: CDRF spans wide range of contemporary drone models and acquisition conditions from real campus data and controlled RF-cage facility, with interoperable open-source tools for data generation, preprocessing, augmentation, and evaluation.

Conclusion: CDRF enables standardized benchmarking for classification, open-set recognition, and object detection, accelerating progress toward robust, generalizable RF perception models.

Abstract: We present CageDroneRF (CDRF), a large-scale benchmark for Radio-Frequency (RF) drone detection and identification built from real-world captures and systematically generated synthetic variants. CDRF addresses the scarcity and limited diversity of existing RF datasets by coupling extensive raw recordings with a principled augmentation pipeline that (i) precisely controls Signal-to-Noise Ratio (SNR), (ii) injects interfering emitters, and (iii) applies frequency shifts with label-consistent bounding-box transformations for detection. This dataset spans a wide range of contemporary drone models, many unavailable in current public datasets, and acquisition conditions, derived from data collected at the Rowan University campus and within a controlled RF-cage facility. CDRF is released with interoperable open-source tools for data generation, preprocessing, augmentation, and evaluation that also operate on existing public benchmarks. CDRF enables standardized benchmarking for classification, open-set recognition, and object detection, supporting rigorous comparisons and reproducible pipelines. By releasing this comprehensive benchmark and tooling, CDRF aims to accelerate progress toward robust, generalizable RF perception models.

[177] Data relativistic uncertainty framework for low-illumination anime scenery image enhancement

Yiquan Gao, John See

Main category: cs.CV

TL;DR: This paper introduces a Data Relativistic Uncertainty (DRU) framework for enhancing low-light anime scenery images, addressing the domain gap between natural images and anime by leveraging illumination uncertainty information.

Details

Motivation: The paper addresses the domain gap in low-light enhancement between natural images/videos and anime scenery images, which has been underexplored. There's a need for specialized enhancement methods for anime content that consider its unique characteristics and illumination conditions.

Method: The authors first construct an unpaired anime scenery dataset with diverse environments and illumination conditions. They then propose a Data Relativistic Uncertainty (DRU) framework inspired by Relativistic GAN, which defines and quantifies illumination uncertainty using a wave-particle duality analogy. This uncertainty information dynamically adjusts objective functions to recalibrate model learning under data uncertainty.

Result: Extensive experiments show the DRU framework effectively enhances EnlightenGAN models, yielding superior perceptual and aesthetic qualities compared to state-of-the-art methods that don’t learn from a data uncertainty perspective.

Conclusion: The DRU framework provides a novel data-centric learning paradigm for low-light enhancement in anime scenery images, with potential applications in other visual and language domains. The approach successfully bridges the domain gap by leveraging illumination uncertainty information.

Abstract: By contrast with the prevailing works of low-light enhancement in natural images and videos, this study copes with the low-illumination quality degradation in anime scenery images to bridge the domain gap. For such an underexplored enhancement task, we first curate images from various sources and construct an unpaired anime scenery dataset with diverse environments and illumination conditions to address the data scarcity. To exploit the power of uncertainty information inherent with the diverse illumination conditions, we propose a Data Relativistic Uncertainty (DRU) framework, motivated by the idea from Relativistic GAN. By analogy with the wave-particle duality of light, our framework interpretably defines and quantifies the illumination uncertainty of dark/bright samples, which is leveraged to dynamically adjust the objective functions to recalibrate the model learning under data uncertainty. Extensive experiments demonstrate the effectiveness of DRU framework by training several versions of EnlightenGANs, yielding superior perceptual and aesthetic qualities beyond the state-of-the-art methods that are incapable of learning from data uncertainty perspective. We hope our framework can expose a novel paradigm of data-centric learning for potential visual and language domains. Code is available.

[178] Mass Concept Erasure in Diffusion Models with Concept Hierarchy

Jiahang Tu, Ye Li, Yiming Wu, Hanbin Zhao, Chao Zhang, Hui Qian

Main category: cs.CV

TL;DR: Proposes SuPLoRA: a hierarchical concept erasure method for diffusion models that groups related concepts under supertypes and uses parameter-efficient fine-tuning to erase concepts while preserving generation quality.

Details

Motivation: Existing concept erasure methods become inefficient and ineffective as the number of erased concepts grows, requiring separate parameters for each concept and degrading overall generation quality.

Method: Organizes erased concepts into a supertype-subtype hierarchy, groups semantically similar concepts, and uses Supertype-Preserving Low-Rank Adaptation (SuPLoRA) which encodes supertype information in frozen down-projection matrices while updating only up-projection matrices during erasure.

Result: Theoretical analysis demonstrates effectiveness in mitigating generation degradation. A challenging benchmark is constructed for simultaneous erasure across diverse domains including celebrities, objects, and pornographic content.

Conclusion: SuPLoRA provides an effective and efficient group-wise suppression method for concept erasure in diffusion models that preserves generation quality while handling multiple related concepts through hierarchical organization.

Abstract: The success of diffusion models has raised concerns about the generation of unsafe or harmful content, prompting concept erasure approaches that fine-tune modules to suppress specific concepts while preserving general generative capabilities. However, as the number of erased concepts grows, these methods often become inefficient and ineffective, since each concept requires a separate set of fine-tuned parameters and may degrade the overall generation quality. In this work, we propose a supertype-subtype concept hierarchy that organizes erased concepts into a parent-child structure. Each erased concept is treated as a child node, and semantically related concepts (e.g., macaw, and bald eagle) are grouped under a shared parent node, referred to as a supertype concept (e.g., bird). Rather than erasing concepts individually, we introduce an effective and efficient group-wise suppression method, where semantically similar concepts are grouped and erased jointly by sharing a single set of learnable parameters. During the erasure phase, standard diffusion regularization is applied to preserve denoising process in unmasked regions. To mitigate the degradation of supertype generation caused by excessive erasure of semantically related subtypes, we propose a novel method called Supertype-Preserving Low-Rank Adaptation (SuPLoRA), which encodes the supertype concept information in the frozen down-projection matrix and updates only the up-projection matrix during erasure. Theoretical analysis demonstrates the effectiveness of SuPLoRA in mitigating generation performance degradation. We construct a more challenging benchmark that requires simultaneous erasure of concepts across diverse domains, including celebrities, objects, and pornographic content.

[179] VLM4VLA: Revisiting Vision-Language-Models in Vision-Language-Action Models

Jianke Zhang, Xiaoyu Chen, Qiuyue Wang, Mingsheng Li, Yanjiang Guo, Yucheng Hu, Jiajun Zhang, Shuai Bai, Junyang Lin, Jianyu Chen

Main category: cs.CV

TL;DR: VLM4VLA pipeline shows VLM choice doesn’t predict downstream VLA performance; vision module is the bottleneck, not language; embodied skill improvements don’t guarantee better control.

Details

Motivation: To systematically study how VLM choice and capabilities translate to downstream Vision-Language-Action policy performance, challenging common assumptions about VLM competence for embodied control.

Method: VLM4VLA - minimal adaptation pipeline converting general-purpose VLMs into VLA policies using small learnable parameters; extensive empirical studies across three benchmarks; fine-tuning on seven auxiliary embodied tasks; modality-level ablations.

Result: VLM initialization helps but general VLM capabilities poorly predict downstream performance; improving specific embodied skills doesn’t guarantee better control; vision module is primary bottleneck; injecting control-relevant supervision into vision encoder yields consistent gains even when frozen.

Conclusion: Standard VLM competence is necessary but insufficient for embodied control; persistent domain gap exists between VLM pretraining objectives and embodied action-planning requirements; vision encoder improvements are crucial for better VLA policies.

Abstract: Vision-Language-Action (VLA) models, which integrate pretrained large Vision-Language Models (VLM) into their policy backbone, are gaining significant attention for their promising generalization capabilities. This paper revisits a fundamental yet seldom systematically studied question: how VLM choice and competence translate to downstream VLA policies performance? We introduce VLM4VLA, a minimal adaptation pipeline that converts general-purpose VLMs into VLA policies using only a small set of new learnable parameters for fair and efficient comparison. Despite its simplicity, VLM4VLA proves surprisingly competitive with more sophisticated network designs. Through extensive empirical studies on various downstream tasks across three benchmarks, we find that while VLM initialization offers a consistent benefit over training from scratch, a VLM’s general capabilities are poor predictors of its downstream task performance. This challenges common assumptions, indicating that standard VLM competence is necessary but insufficient for effective embodied control. We further investigate the impact of specific embodied capabilities by fine-tuning VLMs on seven auxiliary embodied tasks (e.g., embodied QA, visual pointing, depth estimation). Contrary to intuition, improving a VLM’s performance on specific embodied skills does not guarantee better downstream control performance. Finally, modality-level ablations identify the visual module in VLM, rather than the language component, as the primary performance bottleneck. We demonstrate that injecting control-relevant supervision into the vision encoder of the VLM yields consistent gains, even when the encoder remains frozen during downstream fine-tuning. This isolates a persistent domain gap between current VLM pretraining objectives and the requirements of embodied action-planning.

[180] Deep Learning-Based Image Recognition for Soft-Shell Shrimp Classification

Yun-Hao Zhang, I-Hsien Ting, Dario Liberona, Yun-Hsiu Liu, Kazunori Minetaki

Main category: cs.CV

TL;DR: Deep learning-based image recognition system for automated classification of white shrimp post-harvest to improve freshness and reduce head-body separation issues.

Details

Motivation: Consumer demand for high-quality aquatic products is increasing, with freshness and appearance integrity being key concerns. Shrimp freshness declines rapidly post-harvest, and soft-shell shrimp often suffer from head-body separation after cooking/freezing, affecting product appearance and consumer perception.

Method: Uses convolutional neural network (CNN) model for automated classification of white shrimp immediately after harvest, replacing manual sorting.

Result: Enhances classification accuracy, efficiency, and consistency while reducing processing time, helping maintain freshness and enabling shrimp transportation businesses to better meet customer demands.

Conclusion: Deep learning-based image recognition provides an effective automated solution for shrimp classification that addresses freshness and quality issues in aquaculture processing.

Abstract: With the integration of information technology into aquaculture, production has become more stable and continues to grow annually. As consumer demand for high-quality aquatic products rises, freshness and appearance integrity are key concerns. In shrimp-based processed foods, freshness declines rapidly post-harvest, and soft-shell shrimp often suffer from head-body separation after cooking or freezing, affecting product appearance and consumer perception. To address these issues, this study leverages deep learning-based image recognition for automated classification of white shrimp immediately after harvest. A convolutional neural network (CNN) model replaces manual sorting, enhancing classification accuracy, efficiency, and consistency. By reducing processing time, this technology helps maintain freshness and ensures that shrimp transportation businesses meet customer demands more effectively.

[181] Higher order PCA-like rotation-invariant features for detailed shape descriptors modulo rotation

Jarek Duda

Main category: cs.CV

TL;DR: The paper proposes extending PCA’s covariance matrix approach to higher-order tensors for more accurate rotation-invariant shape descriptors, enabling applications in molecular shape analysis, object recognition, and shape similarity comparison without rotation optimization.

Details

Motivation: PCA's covariance matrix only approximates shapes as ellipsoids, which is insufficient for complex real-world shapes. There's a need for more accurate rotation-invariant shape descriptors that can capture higher-order geometric information.

Method: Extends PCA’s covariance matrix (order-2 tensor) to higher-order tensors (order-3 and above) describing central moments. Uses polynomial times Gaussian approach to create decodable shape descriptors with arbitrarily high accuracy, with analogous rotation invariants.

Result: Develops a framework for creating rotation-invariant shape descriptors that can capture complex shapes more accurately than PCA’s ellipsoid approximation, enabling shape analysis modulo rotation.

Conclusion: Higher-order tensor extensions of PCA provide powerful rotation-invariant shape descriptors for complex real-world shapes, with practical applications in molecular analysis, object recognition, and efficient shape similarity comparison without rotation optimization.

Abstract: PCA can be used for rotation invariant features, describing a shape with its $p_{ab}=E[(x_i-E[x_a])(x_b-E[x_b])]$ covariance matrix approximating shape by ellipsoid, allowing for rotation invariants like its traces of powers. However, real shapes are usually much more complicated, hence there is proposed its extension to e.g. $p_{abc}=E[(x_a-E[x_a])(x_b-E[x_b])(x_c-E[x_c])]$ order-3 or higher tensors describing central moments, or polynomial times Gaussian allowing decodable shape descriptors of arbitrarily high accuracy, and their analogous rotation invariants. Its practical applications could be rotation-invariant features to include shape modulo rotation e.g. for molecular shape descriptors, or for up to rotation object recognition in 2D images/3D scans, or shape similarity metric allowing their inexpensive comparison (modulo rotation) without costly optimization over rotations.

[182] MMErroR: A Benchmark for Erroneous Reasoning in Vision-Language Models

Yang Shi, Yifeng Xie, Minzhe Guo, Liangsi Lu, Mingxuan Huang, Jingchao Wang, Zhihong Zhu, Boyan Xu, Zhiqi Huang

Main category: cs.CV

TL;DR: MMErroR is a multi-modal benchmark with 2,013 samples containing single reasoning errors across 24 subdomains, designed to evaluate VLMs’ ability to detect incorrect reasoning and classify error types, revealing even top models struggle with this task.

Details

Motivation: To determine if Vision-Language Models truly understand content by testing their ability to detect when reasoning processes are wrong and identify error types, moving beyond simple answer correctness to process-level evaluation.

Method: Created MMErroR benchmark with 2,013 samples embedding single coherent reasoning errors, spanning 24 subdomains across six top-level domains, requiring models to both detect incorrect reasoning and classify error types in visual and linguistic contexts.

Result: Evaluation of 20 advanced VLMs shows even the best model (Gemini-3.0-Pro) only classifies errors correctly in 66.47% of cases, highlighting the significant challenge of identifying erroneous reasoning in multi-modal contexts.

Conclusion: VLMs still struggle with error detection and classification despite advances, and error identification capability provides valuable insights into multi-modal reasoning model capabilities, suggesting need for improved reasoning evaluation methods.

Abstract: Recent advances in Vision-Language Models (VLMs) have improved performance in multi-modal learning, raising the question of whether these models truly understand the content they process. Crucially, can VLMs detect when a reasoning process is wrong and identify its error type? To answer this, we present MMErroR, a multi-modal benchmark of 2,013 samples, each embedding a single coherent reasoning error. These samples span 24 subdomains across six top-level domains, ensuring broad coverage and taxonomic richness. Unlike existing benchmarks that focus on answer correctness, MMErroR targets a process-level, error-centric evaluation that requires models to detect incorrect reasoning and classify the error type within both visual and linguistic contexts. We evaluate 20 advanced VLMs, even the best model (Gemini-3.0-Pro) classifies the error in only 66.47% of cases, underscoring the challenge of identifying erroneous reasoning. Furthermore, the ability to accurately identify errors offers valuable insights into the capabilities of multi-modal reasoning models. Project Page: https://mmerror-benchmark.github.io

[183] Towards Real-world Lens Active Alignment with Unlabeled Data via Domain Adaptation

Wenyong Lia, Qi Jiang, Weijian Hu, Kailun Yang, Zhanjun Zhang, Wenjun Tian, Kaiwei Wang, Jian Bai

Main category: cs.CV

TL;DR: DA3 uses domain adaptation to bridge simulation-real gap in optical alignment, achieving 46% accuracy improvement over simulation-only methods with minimal real-world data.

Details

Motivation: Active Alignment is crucial for automated optical assembly, but simulation-trained models suffer from domain gap when applied to real-world images, limiting generalization.

Method: Proposes Domain Adaptive Active Alignment (DA3) with autoregressive domain transformation generator and adversarial feature alignment strategy for self-supervised learning to extract domain-invariant degradation features.

Result: DA3 improves accuracy by 46% over simulation-only pipeline, approaches performance of precisely labeled real-world data (3 lens samples) while reducing on-device data collection time by 98.7%.

Conclusion: Domain adaptation effectively enables simulation-trained models to perform robustly in real-world settings, validating digital-twin pipeline as practical solution for enhancing large-scale optical assembly efficiency.

Abstract: Active Alignment (AA) is a key technology for the large-scale automated assembly of high-precision optical systems. Compared with labor-intensive per-model on-device calibration, a digital-twin pipeline built on optical simulation offers a substantial advantage in generating large-scale labeled data. However, complex imaging conditions induce a domain gap between simulation and real-world images, limiting the generalization of simulation-trained models. To address this, we propose augmenting a simulation baseline with minimal unlabeled real-world images captured at random misalignment positions, mitigating the gap from a domain adaptation perspective. We introduce Domain Adaptive Active Alignment (DA3), which utilizes an autoregressive domain transformation generator and an adversarial-based feature alignment strategy to distill real-world domain information via self-supervised learning. This enables the extraction of domain-invariant image degradation features to facilitate robust misalignment prediction. Experiments on two lens types reveal that DA3 improves accuracy by 46% over a purely simulation pipeline. Notably, it approaches the performance achieved with precisely labeled real-world data collected on 3 lens samples, while reducing on-device data collection time by 98.7%. The results demonstrate that domain adaptation effectively endows simulation-trained models with robust real-world performance, validating the digital-twin pipeline as a practical solution to significantly enhance the efficiency of large-scale optical assembly.

[184] RelightAnyone: A Generalized Relightable 3D Gaussian Head Model

Yingyan Xu, Pramod Rao, Sebastian Weiss, Gaspard Zoss, Markus Gross, Christian Theobalt, Marc Habermann, Derek Bradley

Main category: cs.CV

TL;DR: A two-stage method for creating relightable 3D Gaussian head avatars from single/multi-view images without requiring OLAT capture data.

Details

Motivation: Existing high-quality relighting methods require complex OLAT capture setups, limiting practical applications. Need a method that can create relightable avatars from standard captures without OLAT data.

Method: Two-stage approach: 1) Learn flat-lit 3DGS avatars from diverse multi-view datasets without OLAT, using dataset-specific lighting codes for self-supervised alignment. 2) Learn mapping from flat-lit avatars to physically-based reflectance parameters using a smaller OLAT dataset.

Result: Method generalizes well to relight any subject from stage 1 as if captured under OLAT lighting. Can fit to unseen subjects from single images, enabling novel view synthesis and relighting applications.

Conclusion: Proposed approach enables high-quality relightable 3D head avatars without requiring OLAT capture for each subject, making relightable avatar creation more practical and accessible.

Abstract: 3D Gaussian Splatting (3DGS) has become a standard approach to reconstruct and render photorealistic 3D head avatars. A major challenge is to relight the avatars to match any scene illumination. For high quality relighting, existing methods require subjects to be captured under complex time-multiplexed illumination, such as one-light-at-a-time (OLAT). We propose a new generalized relightable 3D Gaussian head model that can relight any subject observed in a single- or multi-view images without requiring OLAT data for that subject. Our core idea is to learn a mapping from flat-lit 3DGS avatars to corresponding relightable Gaussian parameters for that avatar. Our model consists of two stages: a first stage that models flat-lit 3DGS avatars without OLAT lighting, and a second stage that learns the mapping to physically-based reflectance parameters for high-quality relighting. This two-stage design allows us to train the first stage across diverse existing multi-view datasets without OLAT lighting ensuring cross-subject generalization, where we learn a dataset-specific lighting code for self-supervised lighting alignment. Subsequently, the second stage can be trained on a significantly smaller dataset of subjects captured under OLAT illumination. Together, this allows our method to generalize well and relight any subject from the first stage as if we had captured them under OLAT lighting. Furthermore, we can fit our model to unseen subjects from as little as a single image, allowing several applications in novel view synthesis and relighting for digital avatars.

[185] Padé Neurons for Efficient Neural Models

Onur Keleş, A. Murat Tekalp

Main category: cs.CV

TL;DR: Padé neurons (Paons) are a novel non-linear neuron model inspired by Padé approximants that offer stronger non-linearity, diversity, and layer efficiency compared to traditional McCulloch-Pitts neurons.

Details

Motivation: Traditional neural networks use linear McCulloch-Pitts neurons with point-wise non-linear activations, which have limited non-linearity. Existing non-linear neuron models (quadratic, generalized operational, etc.) offer improvements but the authors seek a more powerful and flexible non-linear neuron model that can encompass previous approaches while providing better performance with fewer layers.

Method: Introduce Padé neurons (Paons) inspired by Padé approximants, which learn different non-linear functions of inputs. Paons can replace any neuron model in any network. The method is validated by replacing classic neurons in ResNet-based models for image super-resolution, compression, and classification tasks.

Result: Experimental results show that neural models built with Paons provide better or equal performance than their classic counterparts with fewer layers. Paons demonstrate stronger non-linearity and layer efficiency across multiple computer vision tasks.

Conclusion: Paons represent a superior non-linear neuron model that offers diversity of non-linearity, layer efficiency, and encompasses all previously proposed neuron models as special cases, enabling performance improvements with reduced network depth.

Abstract: Neural networks commonly employ the McCulloch-Pitts neuron model, which is a linear model followed by a point-wise non-linear activation. Various researchers have already advanced inherently non-linear neuron models, such as quadratic neurons, generalized operational neurons, generative neurons, and super neurons, which offer stronger non-linearity compared to point-wise activation functions. In this paper, we introduce a novel and better non-linear neuron model called Padé neurons (Paons), inspired by Padé approximants. Paons offer several advantages, such as diversity of non-linearity, since each Paon learns a different non-linear function of its inputs, and layer efficiency, since Paons provide stronger non-linearity in much fewer layers compared to piecewise linear approximation. Furthermore, Paons include all previously proposed neuron models as special cases, thus any neuron model in any network can be replaced by Paons. We note that there has been a proposal to employ the Padé approximation as a generalized point-wise activation function, which is fundamentally different from our model. To validate the efficacy of Paons, in our experiments, we replace classic neurons in some well-known neural image super-resolution, compression, and classification models based on the ResNet architecture with Paons. Our comprehensive experimental results and analyses demonstrate that neural models built by Paons provide better or equal performance than their classic counterparts with a smaller number of layers. The PyTorch implementation code for Paon is open-sourced at https://github.com/onur-keles/Paon.

[186] Guardians of the Hair: Rescuing Soft Boundaries in Depth, Stereo, and Novel Views

Xiang Zhang, Yang Zhang, Lukas Mehl, Markus Gross, Christopher Schroers

Main category: cs.CV

TL;DR: HairGuard is a framework that recovers fine-grained soft boundary details (like hair) in 3D vision tasks by refining depth around soft boundaries and generating high-quality novel views.

Details

Motivation: Soft boundaries like thin hairs are common in natural and computer-generated imagery but challenging for 3D vision due to ambiguous mixing of foreground and background cues, leading to poor depth estimation and view synthesis quality.

Method: 1) Data curation pipeline using image matting datasets; 2) Depth fixer network with gated residual module to refine depth around soft boundaries; 3) Depth-based forward warping for view synthesis; 4) Generative scene painter for disoccluded regions; 5) Color fuser to combine results.

Result: HairGuard achieves state-of-the-art performance in monocular depth estimation, stereo image/video conversion, and novel view synthesis, with significant improvements in soft boundary regions.

Conclusion: The framework effectively handles soft boundary challenges in 3D vision through specialized depth refinement and view synthesis components, enabling plug-and-play integration with existing depth models.

Abstract: Soft boundaries, like thin hairs, are commonly observed in natural and computer-generated imagery, but they remain challenging for 3D vision due to the ambiguous mixing of foreground and background cues. This paper introduces Guardians of the Hair (HairGuard), a framework designed to recover fine-grained soft boundary details in 3D vision tasks. Specifically, we first propose a novel data curation pipeline that leverages image matting datasets for training and design a depth fixer network to automatically identify soft boundary regions. With a gated residual module, the depth fixer refines depth precisely around soft boundaries while maintaining global depth quality, allowing plug-and-play integration with state-of-the-art depth models. For view synthesis, we perform depth-based forward warping to retain high-fidelity textures, followed by a generative scene painter that fills disoccluded regions and eliminates redundant background artifacts within soft boundaries. Finally, a color fuser adaptively combines warped and inpainted results to produce novel views with consistent geometry and fine-grained details. Extensive experiments demonstrate that HairGuard achieves state-of-the-art performance across monocular depth estimation, stereo image/video conversion, and novel view synthesis, with significant improvements in soft boundary regions.

[187] RiskCueBench: Benchmarking Anticipatory Reasoning from Early Risk Cues in Video-Language Models

Sha Luo, Yogesh Prabhu, Tim Ossowski, Kaiping Chen, Junjie Hu

Main category: cs.CV

TL;DR: New benchmark RiskCueBench challenges models to identify earliest risk signals in videos, revealing current systems struggle to anticipate risky events from early visual cues.

Details

Motivation: Existing video risk assessment datasets often include the full accident sequence, making the task too easy and not reflective of real-world conditions where early warning is crucial for preventing accidents.

Method: Introduce RiskCueBench benchmark with videos carefully annotated to identify “risk signal clips” - the earliest moments indicating potential safety concerns, requiring models to anticipate future risky events from early visual signals.

Result: Experimental results show significant gap in current systems’ ability to interpret evolving situations and anticipate future risky events from early visual signals.

Conclusion: Current video risk prediction models face important challenges for practical deployment, highlighting the need for better anticipation capabilities from early visual cues.

Abstract: With the rapid growth of video centered social media, the ability to anticipate risky events from visual data is a promising direction for ensuring public safety and preventing real world accidents. Prior work has extensively studied supervised video risk assessment across domains such as driving, protests, and natural disasters. However, many existing datasets provide models with access to the full video sequence, including the accident itself, which substantially reduces the difficulty of the task. To better reflect real world conditions, we introduce a new video understanding benchmark RiskCueBench in which videos are carefully annotated to identify a risk signal clip, defined as the earliest moment that indicates a potential safety concern. Experimental results reveal a significant gap in current systems ability to interpret evolving situations and anticipate future risky events from early visual signals, highlighting important challenges for deploying video risk prediction models in practice.

[188] Mitigating Label Noise using Prompt-Based Hyperbolic Meta-Learning in Open-Set Domain Generalization

Kunyu Peng, Di Wen, M. Saquib Sarfraz, Yufan Chen, Junwei Zheng, David Schneider, Kailun Yang, Jiamin Wu, Alina Roitberg, Rainer Stiefelhagen

Main category: cs.CV

TL;DR: This paper introduces Open-Set Domain Generalization under Noisy Labels (OSDG-NL), a new problem combining open-set domain generalization with label noise, and proposes HyProMeta framework using hyperbolic prototypes and meta-learning to address it.

Details

Motivation: Current Open-Set Domain Generalization (OSDG) research overlooks label noise, which is common in real-world datasets and can mislead model optimization, making open-set recognition in novel domains even more challenging.

Method: Proposes HyProMeta framework that integrates hyperbolic category prototypes for label noise-aware meta-learning with a learnable new-category agnostic prompt to enhance generalization to unseen classes.

Result: Extensive experiments on newly established benchmarks (PACS and DigitsDG with added noise) show HyProMeta outperforms state-of-the-art methods in handling OSDG with noisy labels.

Conclusion: The paper addresses the important but overlooked problem of label noise in OSDG, establishes benchmarks for OSDG-NL, and demonstrates the effectiveness of the proposed HyProMeta framework through superior experimental results.

Abstract: Open-Set Domain Generalization (OSDG) is a challenging task requiring models to accurately predict familiar categories while minimizing confidence for unknown categories to effectively reject them in unseen domains. While the OSDG field has seen considerable advancements, the impact of label noise–a common issue in real-world datasets–has been largely overlooked. Label noise can mislead model optimization, thereby exacerbating the challenges of open-set recognition in novel domains. In this study, we take the first step towards addressing Open-Set Domain Generalization under Noisy Labels (OSDG-NL) by constructing dedicated benchmarks derived from widely used OSDG datasets, including PACS and DigitsDG. We evaluate baseline approaches by integrating techniques from both label denoising and OSDG methodologies, highlighting the limitations of existing strategies in handling label noise effectively. To address these limitations, we propose HyProMeta, a novel framework that integrates hyperbolic category prototypes for label noise-aware meta-learning alongside a learnable new-category agnostic prompt designed to enhance generalization to unseen classes. Our extensive experiments demonstrate the superior performance of HyProMeta compared to state-of-the-art methods across the newly established benchmarks. The source code of this work is released at https://github.com/KPeng9510/HyProMeta.

[189] A Novel Unified Approach to Deepfake Detection

Lord Sen, Shyamapada Mukherjee

Main category: cs.CV

TL;DR: Novel deepfake detection architecture using cross-attention between spatial/frequency features and blood detection module achieves SOTA results (99.80% AUC on FF++, 99.88% on Celeb-DF).

Details

Motivation: Deepfake synthesis poses significant threats to digital trust, necessitating robust detection and tagging systems to combat AI-generated misinformation.

Method: Proposes unified architecture with cross-attention mechanism between spatial and frequency domain features, combined with blood detection module for classification.

Result: Achieves 99.80% AUC on FF++ and 99.88% on Celeb-DF using Swin Transformer+BERT, and 99.55%/99.38% with EfficientNet-B4+BERT, with strong cross-dataset generalization.

Conclusion: The proposed architecture outperforms state-of-the-art methods and demonstrates excellent generalization across datasets, providing a unified solution for deepfake detection.

Abstract: The advancements in the field of AI is increasingly giving rise to various threats. One of the most prominent of them is the synthesis and misuse of Deepfakes. To sustain trust in this digital age, detection and tagging of deepfakes is very necessary. In this paper, a novel architecture for Deepfake detection in images and videos is presented. The architecture uses cross attention between spatial and frequency domain features along with a blood detection module to classify an image as real or fake. This paper aims to develop a unified architecture and provide insights into each step. Though this approach we achieve results better than SOTA, specifically 99.80%, 99.88% AUC on FF++ and Celeb-DF upon using Swin Transformer and BERT and 99.55, 99.38 while using EfficientNet-B4 and BERT. The approach also generalizes very well achieving great cross dataset results as well.

[190] Better, But Not Sufficient: Testing Video ANNs Against Macaque IT Dynamics

Matteo Dunnhofer, Christian Micheloni, Kohitij Kar

Main category: cs.CV

TL;DR: Current video-based ANN models show modest improvements over static models in predicting macaque IT cortex responses to naturalistic videos, but fail to capture appearance-invariant motion computations that IT exhibits.

Details

Motivation: To understand whether primate inferior temporal (IT) cortex performs richer dynamic computations beyond simple framewise feature extraction, and to test if current ANN models can capture these temporal dynamics.

Method: Compared macaque IT responses during naturalistic videos against static, recurrent, and video-based ANN models. Applied a stress test using “appearance-free” video variants that preserve motion but remove shape/texture to test generalization.

Result: Video models provided modest improvements in neural predictivity, especially at later response stages. IT population activity generalized across appearance-free manipulation, but all ANN classes failed this stress test.

Conclusion: Current video models capture appearance-bound dynamics rather than the appearance-invariant temporal computations in IT, highlighting the need for new objectives that encode biological temporal statistics and invariances.

Abstract: Feedforward artificial neural networks (ANNs) trained on static images remain the dominant models of the the primate ventral visual stream, yet they are intrinsically limited to static computations. The primate world is dynamic, and the macaque ventral visual pathways, specifically the inferior temporal (IT) cortex not only supports object recognition but also encodes object motion velocity during naturalistic video viewing. Does IT’s temporal responses reflect nothing more than time-unfolded feedforward transformations, framewise features with shallow temporal pooling, or do they embody richer dynamic computations? We tested this by comparing macaque IT responses during naturalistic videos against static, recurrent, and video-based ANN models. Video models provided modest improvements in neural predictivity, particularly at later response stages, raising the question of what kind of dynamics they capture. To probe this, we applied a stress test: decoders trained on naturalistic videos were evaluated on “appearance-free” variants that preserve motion but remove shape and texture. IT population activity generalized across this manipulation, but all ANN classes failed. Thus, current video models better capture appearance-bound dynamics rather than the appearance-invariant temporal computations expressed in IT, underscoring the need for new objectives that encode biological temporal statistics and invariances.

[191] Eye-Q: A Multilingual Benchmark for Visual Word Puzzle Solving and Image-to-Phrase Reasoning

Ali Najar, Alireza Mirrokni, Arshia Izadyari, Sadegh Mohammadian, Amir Homayoon Sharifizade, Asal Meskin, Mobin Bagherian, Ehsaneddin Asgari

Main category: cs.CV

TL;DR: Eye-Q is a multilingual visual word puzzle benchmark that tests complex visual reasoning beyond surface-level recognition, revealing significant performance gaps in current VLMs.

Details

Motivation: Current VLMs perform well on standard benchmarks but often rely on surface-level recognition, OCR shortcuts, or simple retrieval rather than deep reasoning. There's a need for benchmarks that test complex visual understanding involving implicit cues, hypothesis generation/revision, and non-literal concept mapping.

Method: Created Eye-Q benchmark with 1,343 multilingual visual word puzzles (English, Persian, Arabic, cross-lingual). Each puzzle presents a conceptually dense scene with brief description, requiring inference of a specific target word/phrase. Puzzles are unstructured, cue-implicit with distractors, demanding selective attention, abstraction, and associative inference. Evaluated state-of-the-art VLMs using open-ended, human-aligned protocol with lightweight assistance.

Result: Substantial performance gaps revealed, especially on abstract and cross-lingual puzzles. Maximum accuracy reaches only 60.27%, highlighting limitations in models’ ability to construct and search appropriate conceptual representations for flexible image-to-phrase inference.

Conclusion: Visual word puzzles like Eye-Q expose critical limitations in current VLMs’ reasoning capabilities. The benchmark provides a challenging testbed for developing models that can perform deeper visual understanding involving hypothesis formation, revision, and non-literal concept mapping across languages.

Abstract: Vision-Language Models (VLMs) have achieved strong performance on standard vision-language benchmarks, yet often rely on surface-level recognition rather than deeper reasoning. We propose visual word puzzles as a challenging alternative, as they require discovering implicit visual cues, generating and revising hypotheses, and mapping perceptual evidence to non-literal concepts in ways that are difficult to solve via literal grounding, OCR-heavy shortcuts, or simple retrieval-style matching. We introduce Eye-Q, a multilingual benchmark designed to assess this form of complex visual understanding. Eye-Q contains 1,343 puzzles in which a model observes a conceptually dense scene with a brief description and must infer a specific target word or phrase. The puzzles are intentionally unstructured and cue-implicit, with distractors and contextual relationships that demand selective attention, abstraction, and associative inference. The benchmark spans English, Persian, Arabic, and cross-lingual puzzles. We evaluate state-of-the-art VLMs using an open-ended, human-aligned protocol that probes hypothesis formation and revision under lightweight assistance. Results reveal substantial performance gaps, especially on abstract and cross-lingual puzzles, highlighting limitations in current models’ ability to construct and search over appropriate conceptual representations for flexible image-to-phrase inference; maximum accuracy reaches only 60.27%.

[192] Real-Time In-Cabin Driver Behavior Recognition on Low-Cost Edge Hardware

Vesal Ahsani, Babak Hossein Khalaj, Hamed Shah-Mansouri

Main category: cs.CV

TL;DR: A real-time driver monitoring system using single-camera vision for 17 behavior classes, optimized for low-cost edge devices (Raspberry Pi 5 and Coral Edge TPU) with 16-25 FPS performance.

Details

Motivation: Driver monitoring systems need to detect distraction and drowsiness behaviors in real-time under strict computational, power, and cost constraints for in-cabin deployment.

Method: Three-component pipeline: (1) compact per-frame vision model, (2) confounder-aware label taxonomy to reduce confusion between visually similar behaviors, and (3) temporal decision head that triggers alerts only for confident, sustained predictions.

Result: System achieves ~16 FPS on Raspberry Pi 5 (INT8 inference, <60ms latency) and ~25 FPS on Coral Edge TPU (~40ms latency), enabling real-time monitoring on embedded hardware. Validated on 800,000+ labeled frames and live in-vehicle tests.

Conclusion: The system demonstrates practical real-time driver behavior recognition on low-cost edge platforms, and reliable in-cabin perception can serve as upstream signal for human-centered vehicle intelligence and agentic vehicle concepts.

Abstract: In-cabin driver monitoring systems (DMS) must recognize distraction- and drowsiness-related behaviors with low latency under strict constraints on compute, power, and cost. We present a single-camera in-cabin driver behavior recognition system designed for deployment on two low-cost edge platforms: Raspberry Pi 5 (CPU-only) and the Google Coral development board with an Edge Tensor Processing Unit (Edge TPU) accelerator. The proposed pipeline combines (i) a compact per-frame vision model, (ii) a confounder-aware label taxonomy to reduce confusions among visually similar behaviors, and (iii) a temporal decision head that triggers alerts only when predictions are both confident and sustained. The system supports 17 behavior classes. Training and evaluation use licensed datasets plus in-house collection (over 800,000 labeled frames) with driver-disjoint splits, and we further validate the deployed system in live in-vehicle tests. End-to-end performance reaches approximately 16 FPS on Raspberry Pi 5 using 8-bit integer (INT8) inference (per-frame latency <60 ms) and approximately 25 FPS on Coral Edge TPU (end-to-end latency ~40 ms), enabling real-time monitoring and stable alert generation on embedded hardware. Finally, we discuss how reliable in-cabin perception can serve as an upstream signal for human-centered vehicle intelligence, including emerging agentic vehicle concepts.

[193] GAMBIT: A Gamified Jailbreak Framework for Multimodal Large Language Models

Xiangdong Hu, Yangyang Jiang, Qin Hu, Xiaojun Jia

Main category: cs.CV

TL;DR: GAMBIT is a novel multimodal jailbreak framework that uses gamified scenes to exploit MLLMs’ reasoning incentives, achieving high attack success rates by making models proactively complete harmful queries through structured reasoning chains.

Details

Motivation: Current multimodal jailbreak attacks underperform on reasoning models (with Chain-of-Thought) because they don't leverage the model's own reasoning incentives. The authors explore whether influencing a model's cognitive-stage decisions can make it proactively complete jailbreaks.

Method: GAMBIT decomposes and reassembles harmful visual semantics, constructs gamified scenes that drive models to explore, reconstruct intent, and answer as part of winning a game. This creates structured reasoning chains that increase task complexity in both vision and text, positioning the model as a participant whose goal pursuit reduces safety attention.

Result: Achieves high Attack Success Rates: 92.13% on Gemini 2.5 Flash, 91.20% on QvQ-MAX, and 85.87% on GPT-4o, significantly outperforming baselines on both reasoning and non-reasoning MLLMs.

Conclusion: By exploiting reasoning incentives through gamification, GAMBIT demonstrates that safety alignment in MLLMs remains fragile, especially when models are positioned as active participants in complex reasoning tasks that distract from safety considerations.

Abstract: Multimodal Large Language Models (MLLMs) have become widely deployed, yet their safety alignment remains fragile under adversarial inputs. Previous work has shown that increasing inference steps can disrupt safety mechanisms and lead MLLMs to generate attacker-desired harmful content. However, most existing attacks focus on increasing the complexity of the modified visual task itself and do not explicitly leverage the model’s own reasoning incentives. This leads to them underperforming on reasoning models (Models with Chain-of-Thoughts) compared to non-reasoning ones (Models without Chain-of-Thoughts). If a model can think like a human, can we influence its cognitive-stage decisions so that it proactively completes a jailbreak? To validate this idea, we propose GAMBI} (Gamified Adversarial Multimodal Breakout via Instructional Traps), a novel multimodal jailbreak framework that decomposes and reassembles harmful visual semantics, then constructs a gamified scene that drives the model to explore, reconstruct intent, and answer as part of winning the game. The resulting structured reasoning chain increases task complexity in both vision and text, positioning the model as a participant whose goal pursuit reduces safety attention and induces it to answer the reconstructed malicious query. Extensive experiments on popular reasoning and non-reasoning MLLMs demonstrate that GAMBIT achieves high Attack Success Rates (ASR), reaching 92.13% on Gemini 2.5 Flash, 91.20% on QvQ-MAX, and 85.87% on GPT-4o, significantly outperforming baselines.

[194] WeedRepFormer: Reparameterizable Vision Transformers for Real-Time Waterhemp Segmentation and Gender Classification

Toqi Tahamid Sarker, Taminul Islam, Khaled R. Ahmed, Cristiana Bernardi Rankrape, Kaitlin E. Creager, Karla Gage

Main category: cs.CV

TL;DR: WeedRepFormer is a lightweight multi-task Vision Transformer for simultaneous waterhemp segmentation and gender classification, using structural reparameterization to balance accuracy and efficiency.

Details

Motivation: Existing agricultural models struggle to balance fine-grained feature extraction for biological attribute classification with the efficiency needed for real-time deployment.

Method: Systematically integrates structural reparameterization across the entire architecture (Vision Transformer backbone, Lite R-ASPP decoder, and novel reparameterizable classification head) to decouple training-time capacity from inference-time latency.

Result: Achieves 92.18% mIoU for segmentation and 81.91% accuracy for gender classification with only 3.59M parameters and 3.80 GFLOPs. Runs at 108.95 FPS, outperforming iFormer-T by 4.40% in classification accuracy while reducing parameters by 1.9x.

Conclusion: WeedRepFormer effectively addresses the efficiency-accuracy trade-off in agricultural vision tasks, providing a practical solution for real-time waterhemp analysis with comprehensive performance improvements.

Abstract: We present WeedRepFormer, a lightweight multi-task Vision Transformer designed for simultaneous waterhemp segmentation and gender classification. Existing agricultural models often struggle to balance the fine-grained feature extraction required for biological attribute classification with the efficiency needed for real-time deployment. To address this, WeedRepFormer systematically integrates structural reparameterization across the entire architecture - comprising a Vision Transformer backbone, a Lite R-ASPP decoder, and a novel reparameterizable classification head - to decouple training-time capacity from inference-time latency. We also introduce a comprehensive waterhemp dataset containing 10,264 annotated frames from 23 plants. On this benchmark, WeedRepFormer achieves 92.18% mIoU for segmentation and 81.91% accuracy for gender classification using only 3.59M parameters and 3.80 GFLOPs. At 108.95 FPS, our model outperforms the state-of-the-art iFormer-T by 4.40% in classification accuracy while maintaining competitive segmentation performance and significantly reducing parameter count by 1.9x.

[195] FROST-Drive: Scalable and Efficient End-to-End Driving with a Frozen Vision Encoder

Zeyu Dong, Yimin Zhu, Yu Wu, Yu Sun

Main category: cs.CV

TL;DR: FROST-Drive: A novel end-to-end autonomous driving model that keeps a pretrained VLM vision encoder frozen to preserve generalization capabilities, outperforming full fine-tuning approaches on challenging driving scenarios.

Details

Motivation: Current E2E autonomous driving models struggle with generalization to novel/complex scenarios. Full fine-tuning of vision encoders on driving data causes over-specialization and loss of the rich world knowledge from pretrained VLMs, limiting generalization capabilities.

Method: FROST-Drive architecture with frozen pretrained VLM vision encoder, transformer-based adapter for multimodal fusion, GRU-based decoder for smooth waypoint generation, and custom loss function optimized for Rater Feedback Score (RFS).

Result: Extensive experiments on Waymo Open E2E Dataset (long-tail scenarios) show frozen-encoder approach significantly outperforms full fine-tuning models, demonstrating better generalization and robust trajectory planning.

Conclusion: Preserving broad knowledge from capable VLMs through frozen encoders is more effective for robust, generalizable driving performance than intensive domain-specific adaptation, offering new pathway for real-world vision-based models.

Abstract: End-to-end (E2E) models in autonomous driving aim to directly map sensor inputs to control commands, but their ability to generalize to novel and complex scenarios remains a key challenge. The common practice of fully fine-tuning the vision encoder on driving datasets potentially limits its generalization by causing the model to specialize too heavily in the training data. This work challenges the necessity of this training paradigm. We propose FROST-Drive, a novel E2E architecture designed to preserve and leverage the powerful generalization capabilities of a pretrained vision encoder from a Vision-Language Model (VLM). By keeping the encoder’s weights frozen, our approach directly transfers the rich, generalized world knowledge from the VLM to the driving task. Our model architecture combines this frozen encoder with a transformer-based adapter for multimodal fusion and a GRU-based decoder for smooth waypoint generation. Furthermore, we introduce a custom loss function designed to directly optimize for Rater Feedback Score (RFS), a metric that prioritizes robust trajectory planning. We conduct extensive experiments on Waymo Open E2E Dataset, a large-scale datasets deliberately curated to capture the long-tail scenarios, demonstrating that our frozen-encoder approach significantly outperforms models that employ full fine-tuning. Our results provide substantial evidence that preserving the broad knowledge of a capable VLM is a more effective strategy for achieving robust, generalizable driving performance than intensive domain-specific adaptation. This offers a new pathway for developing vision-based models that can better handle the complexities of real-world application domains.

[196] V-Agent: An Interactive Video Search System Using Vision-Language Models

SunYoung Park, Jong-Hyeon Lee, Youngjune Kim, Daegyu Sung, Younghyun Yu, Young-rok Cha, Jeongho Ju

Main category: cs.CV

TL;DR: V-Agent is a multi-agent platform for video search and conversation that uses a fine-tuned vision-language model with multimodal retrieval to understand both visual and spoken content, achieving state-of-the-art zero-shot performance.

Details

Motivation: Traditional text-based retrieval systems are limited in multimodal scenarios where videos contain both visual and audio content. There's a need for systems that can understand and search videos based on both visual and spoken elements in a conversational manner.

Method: Fine-tunes a vision-language model with a small video preference dataset and enhances it with retrieval vectors from an image-text retrieval model. Uses three collaborative agents: routing agent, search agent, and chat agent. The search agent embeds video frames and ASR transcriptions into shared multimodal space and includes a re-ranking module for improved retrieval quality.

Result: Demonstrates state-of-the-art zero-shot performance on the MultiVENT 2.0 benchmark, showing superior video retrieval capabilities compared to existing methods.

Conclusion: V-Agent provides an effective framework for advanced video search and interactive conversations, with potential for both academic research and real-world applications. The system successfully overcomes limitations of text-only retrieval in multimodal scenarios.

Abstract: We introduce V-Agent, a novel multi-agent platform designed for advanced video search and interactive user-system conversations. By fine-tuning a vision-language model (VLM) with a small video preference dataset and enhancing it with a retrieval vector from an image-text retrieval model, we overcome the limitations of traditional text-based retrieval systems in multimodal scenarios. The VLM-based retrieval model independently embeds video frames and audio transcriptions from an automatic speech recognition (ASR) module into a shared multimodal representation space, enabling V-Agent to interpret both visual and spoken content for context-aware video search. This system consists of three agents-a routing agent, a search agent, and a chat agent-that work collaboratively to address user intents by refining search outputs and communicating with users. The search agent utilizes the VLM-based retrieval model together with an additional re-ranking module to further enhance video retrieval quality. Our proposed framework demonstrates state-of-the-art zero-shot performance on the MultiVENT 2.0 benchmark, highlighting its potential for both academic research and real-world applications. The retrieval model and demo videos are available at https://huggingface.co/NCSOFT/multimodal-embedding.

[197] Experimental Comparison of Light-Weight and Deep CNN Models Across Diverse Datasets

Md. Hefzul Hossain Papon, Shadman Rabby

Main category: cs.CV

TL;DR: A well-regularized shallow CNN architecture serves as a highly competitive baseline across diverse Bangladeshi vision datasets without requiring large GPUs or specialized pre-trained models.

Details

Motivation: To establish a unified, reproducible benchmark for multiple Bangladeshi vision datasets and demonstrate the practical value of lightweight CNNs for real-world deployment in low-resource settings.

Method: Uses a well-regularized shallow architecture (lightweight CNN) that doesn’t require large computational resources or specialized pre-trained models.

Result: The shallow architecture serves as a highly competitive baseline across heterogeneous domains including smart-city surveillance and agricultural variety classification.

Conclusion: Lightweight CNNs have significant practical value for real-world deployment in low-resource settings, providing an effective alternative to resource-intensive deep models.

Abstract: Our results reveal that a well-regularized shallow architecture can serve as a highly competitive baseline across heterogeneous domains - from smart-city surveillance to agricultural variety classification - without requiring large GPUs or specialized pre-trained models. This work establishes a unified, reproducible benchmark for multiple Bangladeshi vision datasets and highlights the practical value of lightweight CNNs for real-world deployment in low-resource settings.

[198] Latent Geometry of Taste: Scalable Low-Rank Matrix Factorization

Joshua Salako

Main category: cs.CV

TL;DR: A scalable ALS-based recommender system on MovieLens 32M shows constrained low-rank models outperform higher-dimensional ones, capturing semantic genre clusters from interaction data and offering tunable cold-start solutions.

Details

Motivation: Address scalability and data sparsity challenges in collaborative filtering on massive datasets like MovieLens 32M.

Method: High-performance parallelized Alternating Least Squares (ALS) framework with extensive hyperparameter optimization, constrained low-rank modeling, and embedding space visualization.

Result: Constrained low-rank models significantly outperform higher-dimensional counterparts in generalization (RMSE and ranking precision), reveal semantic genre clusters in embeddings, and provide effective cold-start solutions with tunable popularity-personalization trade-off.

Conclusion: ALS-based constrained low-rank modeling effectively addresses scalability and sparsity while capturing deep structural relationships from interaction data, with practical utility demonstrated in cold-start scenarios.

Abstract: Scalability and data sparsity remain critical bottlenecks for collaborative filtering on massive interaction datasets. This work investigates the latent geometry of user preferences using the MovieLens 32M dataset, implementing a high-performance, parallelized Alternating Least Squares (ALS) framework. Through extensive hyperparameter optimization, we demonstrate that constrained low-rank models significantly outperform higher dimensional counterparts in generalization, achieving an optimal balance between Root Mean Square Error (RMSE) and ranking precision. We visualize the learned embedding space to reveal the unsupervised emergence of semantic genre clusters, confirming that the model captures deep structural relationships solely from interaction data. Finally, we validate the system’s practical utility in a cold-start scenario, introducing a tunable scoring parameter to manage the trade-off between popularity bias and personalized affinity effectively. The codebase for this research can be found here: https://github.com/joshsalako/recommender.git

[199] ThinkRL-Edit: Thinking in Reinforcement Learning for Reasoning-Centric Image Editing

Hengjia Li, Liming Jiang, Qing Yan, Yizhi Song, Hao Kang, Zichuan Liu, Xin Lu, Boxi Wu, Deng Cai

Main category: cs.CV

TL;DR: ThinkRL-Edit is a reasoning-centric RL framework for image editing that decouples visual reasoning from synthesis, uses CoT-based reasoning sampling with planning/reflection stages, unbiased chain preference grouping, and binary checklist rewards to improve instruction-faithful editing.

Details

Motivation: Current instruction-driven image editing models have limited visual reasoning capabilities, leading to poor performance on reasoning-centric edits. Existing RL approaches face three key challenges: limited reasoning exploration confined to denoising stochasticity, biased reward fusion, and unstable VLM-based instruction rewards.

Method: 1) Decouples visual reasoning from image synthesis and expands reasoning exploration beyond denoising. 2) Introduces Chain-of-Thought-based reasoning sampling with planning and reflection stages before generation. 3) Uses unbiased chain preference grouping strategy across multiple reward dimensions instead of weighted aggregation. 4) Replaces interval-based VLM scores with binary checklist for more precise, lower-variance rewards.

Result: The method significantly outperforms prior work on reasoning-centric image editing, producing instruction-faithful, visually coherent, and semantically grounded edits.

Conclusion: ThinkRL-Edit successfully addresses key challenges in RL-based image editing by introducing reasoning-centric approaches that improve exploration, reward design, and stability, leading to superior performance on complex reasoning tasks.

Abstract: Instruction-driven image editing with unified multimodal generative models has advanced rapidly, yet their underlying visual reasoning remains limited, leading to suboptimal performance on reasoning-centric edits. Reinforcement learning (RL) has been investigated for improving the quality of image editing, but it faces three key challenges: (1) limited reasoning exploration confined to denoising stochasticity, (2) biased reward fusion, and (3) unstable VLM-based instruction rewards. In this work, we propose ThinkRL-Edit, a reasoning-centric RL framework that decouples visual reasoning from image synthesis and expands reasoning exploration beyond denoising. To the end, we introduce Chain-of-Thought (CoT)-based reasoning sampling with planning and reflection stages prior to generation in online sampling, compelling the model to explore multiple semantic hypotheses and validate their plausibility before committing to a visual outcome. To avoid the failures of weighted aggregation, we propose an unbiased chain preference grouping strategy across multiple reward dimensions. Moreover, we replace interval-based VLM scores with a binary checklist, yielding more precise, lower-variance, and interpretable rewards for complex reasoning. Experiments show our method significantly outperforms prior work on reasoning-centric image editing, producing instruction-faithful, visually coherent, and semantically grounded edits.

[200] Understanding Reward Hacking in Text-to-Image Reinforcement Learning

Yunqi Hong, Kuei-Chun Kao, Hengguang Zhou, Cho-Jui Hsieh

Main category: cs.CV

TL;DR: The paper analyzes reward hacking in text-to-image RL post-training, identifies artifact generation as a common failure mode, and proposes a lightweight artifact reward model to mitigate the issue.

Details

Motivation: Existing reward functions for RL post-training of image generation models are imperfect proxies for human judgment, leading to reward hacking where models produce unrealistic/low-quality images that still achieve high reward scores.

Method: Systematically analyze reward hacking behaviors in T2I RL post-training, investigate individual contributions of aesthetic/human preference rewards and prompt-image consistency rewards, and propose a lightweight adaptive artifact reward model trained on curated artifact-free/artifact-containing samples.

Result: Experiments show that incorporating the artifact reward significantly improves visual realism and reduces reward hacking across multiple T2I RL setups, demonstrating effectiveness as a safeguard against reward hacking.

Conclusion: Lightweight reward augmentations can serve as effective regularizers against reward hacking in RL post-training for text-to-image models, improving alignment with true human judgment.

Abstract: Reinforcement learning (RL) has become a standard approach for post-training large language models and, more recently, for improving image generation models, which uses reward functions to enhance generation quality and human preference alignment. However, existing reward designs are often imperfect proxies for true human judgment, making models prone to reward hacking–producing unrealistic or low-quality images that nevertheless achieve high reward scores. In this work, we systematically analyze reward hacking behaviors in text-to-image (T2I) RL post-training. We investigate how both aesthetic/human preference rewards and prompt-image consistency rewards individually contribute to reward hacking and further show that ensembling multiple rewards can only partially mitigate this issue. Across diverse reward models, we identify a common failure mode: the generation of artifact-prone images. To address this, we propose a lightweight and adaptive artifact reward model, trained on a small curated dataset of artifact-free and artifact-containing samples. This model can be integrated into existing RL pipelines as an effective regularizer for commonly used reward models. Experiments demonstrate that incorporating our artifact reward significantly improves visual realism and reduces reward hacking across multiple T2I RL setups, demonstrating the effectiveness of lightweight reward augment serving as a safeguard against reward hacking.

[201] CroBIM-U: Uncertainty-Driven Referring Remote Sensing Image Segmentation

Yuzhe Sun, Zhe Dong, Haochen Jiang, Tianzhu Liu, Yanfeng Gu

Main category: cs.CV

TL;DR: Proposes an uncertainty-guided framework for referring remote sensing image segmentation that uses pixel-wise uncertainty maps to adaptively modulate language fusion and refinement, addressing spatial non-uniformity in cross-modal alignment.

Details

Motivation: Existing methods use uniform fusion/refinement across entire images, which introduces unnecessary linguistic noise in clear regions while failing to provide sufficient disambiguation in confused areas, due to extreme scale variations, dense distractors, and complex boundaries in remote sensing imagery.

Method: 1) Referring Uncertainty Scorer (RUS) trained via online error-consistency supervision to predict spatial referential ambiguity; 2) Uncertainty-Gated Fusion (UGF) dynamically modulates language injection strength based on uncertainty; 3) Uncertainty-Driven Local Refinement (UDLR) focuses refinement on error-prone boundaries using uncertainty-derived soft masks.

Result: Extensive experiments show the method significantly improves robustness and geometric fidelity in complex remote sensing scenes without altering backbone architecture, functioning as a unified plug-and-play solution.

Conclusion: The uncertainty-guided framework effectively addresses spatial non-uniformity in cross-modal alignment for referring remote sensing segmentation, providing adaptive inference that enhances performance in challenging scenarios while maintaining architectural compatibility.

Abstract: Referring remote sensing image segmentation aims to localize specific targets described by natural language within complex overhead imagery. However, due to extreme scale variations, dense similar distractors, and intricate boundary structures, the reliability of cross-modal alignment exhibits significant \textbf{spatial non-uniformity}. Existing methods typically employ uniform fusion and refinement strategies across the entire image, which often introduces unnecessary linguistic perturbations in visually clear regions while failing to provide sufficient disambiguation in confused areas. To address this, we propose an \textbf{uncertainty-guided framework} that explicitly leverages a pixel-wise \textbf{referring uncertainty map} as a spatial prior to orchestrate adaptive inference. Specifically, we introduce a plug-and-play \textbf{Referring Uncertainty Scorer (RUS)}, which is trained via an online error-consistency supervision strategy to interpretably predict the spatial distribution of referential ambiguity. Building on this prior, we design two plug-and-play modules: 1) \textbf{Uncertainty-Gated Fusion (UGF)}, which dynamically modulates language injection strength to enhance constraints in high-uncertainty regions while suppressing noise in low-uncertainty ones; and 2) \textbf{Uncertainty-Driven Local Refinement (UDLR)}, which utilizes uncertainty-derived soft masks to focus refinement on error-prone boundaries and fine details. Extensive experiments demonstrate that our method functions as a unified, plug-and-play solution that significantly improves robustness and geometric fidelity in complex remote sensing scenes without altering the backbone architecture.

[202] SDCD: Structure-Disrupted Contrastive Decoding for Mitigating Hallucinations in Large Vision-Language Models

Yuxuan Xia, Siheng Wang, Peng Li

Main category: cs.CV

TL;DR: SDCD is a training-free method that reduces object hallucinations in LVLMs by disrupting visual structure to suppress texture-driven bias.

Details

Motivation: Object hallucination remains a critical challenge in LVLMs, and existing approaches overlook the internal complexities of visual encoding. The paper identifies visual statistical bias from Vision Encoders' Bag-of-Patches behavior as a key factor causing hallucinations.

Method: Structure-Disrupted Contrastive Decoding (SDCD) - a training-free algorithm that performs contrastive calibration of output distribution by introducing a shuffled structure-disrupted view. It penalizes tokens that maintain high confidence under this structure-less view to suppress texture-driven bias.

Result: SDCD significantly mitigates hallucinations across multiple benchmarks and enhances overall multimodal capabilities of LVLMs.

Conclusion: Visual statistical bias from weak structural supervision in Vision Encoders contributes to object hallucinations, and SDCD effectively addresses this by suppressing texture-driven bias through structure-disrupted contrastive decoding.

Abstract: Large Vision-Language Models (LVLMs) demonstrate significant progress in multimodal understanding and reasoning, yet object hallucination remains a critical challenge. While existing research focuses on mitigating language priors or high-level statistical biases, they often overlook the internal complexities of the visual encoding process. We identify that visual statistical bias, arising from the inherent Bag-of-Patches behavior of Vision Encoders under weak structural supervision, acts as a contributing factor of object hallucinations. Under this bias, models prioritize local texture features within individual patches over holistic geometric structures. This tendency may induce spurious visual confidence and result in hallucinations. To address this, we introduce a training-free algorithm called Structure-Disrupted Contrastive Decoding (SDCD), which performs contrastive calibration of the output distribution by introducing a shuffled structure-disrupted view. By penalizing tokens that maintain high confidence under this structure-less view, SDCD effectively suppresses the texture-driven bias. Experimental results demonstrate that SDCD significantly mitigates hallucinations across multiple benchmarks and enhances the overall multimodal capabilities of LVLMs.

[203] REFA: Real-time Egocentric Facial Animations for Virtual Reality

Qiang Zhang, Tong Xiao, Haroun Habeeb, Larissa Laich, Sofien Bouaziz, Patrick Snape, Wenjing Zhang, Matthew Cioffi, Peizhao Zhang, Pavel Pidlypenskyi, Winnie Lin, Luming Ma, Mengjiao Wang, Kunpeng Li, Chengjiang Long, Steven Song, Martin Prazak, Alexander Sjoholm, Ajinkya Deogade, Jaebong Lee, Julio Delgado Mangas, Amaury Aubel

Main category: cs.CV

TL;DR: Real-time facial expression tracking system using VR headset infrared cameras for driving virtual character expressions without calibration.

Details

Motivation: Enable non-intrusive, accurate facial expression tracking for virtual characters in VR environments without requiring lengthy calibration steps.

Method: Distillation-based ML approach trained on heterogeneous data (synthetic/real images) from 18k subjects captured via mobile phone + custom VR headset with extra cameras, using differentiable rendering pipeline for automatic label extraction.

Result: Developed a real-time facial expression tracking system that accurately drives virtual character expressions without calibration, using infrared cameras embedded in VR headset.

Conclusion: System enables new communication and expression avenues in virtual environments for video conferencing, gaming, entertainment, and remote collaboration.

Abstract: We present a novel system for real-time tracking of facial expressions using egocentric views captured from a set of infrared cameras embedded in a virtual reality (VR) headset. Our technology facilitates any user to accurately drive the facial expressions of virtual characters in a non-intrusive manner and without the need of a lengthy calibration step. At the core of our system is a distillation based approach to train a machine learning model on heterogeneous data and labels coming form multiple sources, \eg synthetic and real images. As part of our dataset, we collected 18k diverse subjects using a lightweight capture setup consisting of a mobile phone and a custom VR headset with extra cameras. To process this data, we developed a robust differentiable rendering pipeline enabling us to automatically extract facial expression labels. Our system opens up new avenues for communication and expression in virtual environments, with applications in video conferencing, gaming, entertainment, and remote collaboration.

[204] G2P: Gaussian-to-Point Attribute Alignment for Boundary-Aware 3D Semantic Segmentation

Hojun Song, Chae-yeong Song, Jeong-hun Hong, Chaewon Moon, Dong-hwi Kim, Gahyeon Kim, Soo Ye Kim, Yiyi Liao, Jaehyup Lee, Sang-hyo Park

Main category: cs.CV

TL;DR: G2P transfers appearance-aware attributes from 3D Gaussian Splatting to point clouds for better semantic segmentation by addressing geometric ambiguity and improving boundary localization.

Details

Motivation: Point clouds have sparse and irregular distributions with limited appearance evidence, making geometry-only features insufficient to distinguish objects with similar shapes but different appearances (color, texture, material).

Method: Gaussian-to-Point (G2P) transfers appearance-aware attributes from 3D Gaussian Splatting to point clouds. It establishes point-wise correspondences to address misalignment between optimized Gaussians and original point geometry, leverages Gaussian opacity attributes to resolve geometric ambiguity, and uses Gaussian scale attributes for precise boundary localization.

Result: Extensive experiments show superior performance on standard benchmarks with significant improvements on geometrically challenging classes, all achieved without any 2D or language supervision.

Conclusion: G2P effectively enhances point cloud semantic segmentation by transferring appearance-aware attributes from 3D Gaussian Splatting, overcoming limitations of geometry-only features and improving discrimination between objects with similar shapes but different appearances.

Abstract: Semantic segmentation on point clouds is critical for 3D scene understanding. However, sparse and irregular point distributions provide limited appearance evidence, making geometry-only features insufficient to distinguish objects with similar shapes but distinct appearances (e.g., color, texture, material). We propose Gaussian-to-Point (G2P), which transfers appearance-aware attributes from 3D Gaussian Splatting to point clouds for more discriminative and appearance-consistent segmentation. Our G2P address the misalignment between optimized Gaussians and original point geometry by establishing point-wise correspondences. By leveraging Gaussian opacity attributes, we resolve the geometric ambiguity that limits existing models. Additionally, Gaussian scale attributes enable precise boundary localization in complex 3D scenes. Extensive experiments demonstrate that our approach achieves superior performance on standard benchmarks and shows significant improvements on geometrically challenging classes, all without any 2D or language supervision.

[205] Semantic Belief-State World Model for 3D Human Motion Prediction

Sarim Chaudhry

Main category: cs.CV

TL;DR: SBWM reframes human motion prediction as latent dynamical simulation on the human body manifold using a probabilistic belief state aligned with SMPL-X anatomical parameters, enabling stable long-horizon rollouts with lower computational cost.

Details

Motivation: Traditional motion prediction methods suffer from compounding drift, mean-pose collapse, and poorly calibrated uncertainty when rolled forward beyond training. They don't separate observation reconstruction from dynamics modeling and lack explicit representation of latent causes governing motion.

Method: Proposes Semantic Belief-State World Model (SBWM) that maintains a recurrent probabilistic belief state whose evolution is learned independently of pose reconstruction and explicitly aligned with SMPL-X anatomical parameterization. Uses stochastic latent transitions and rollout-centric training adapted from belief-state world models in reinforcement learning.

Result: SBWM demonstrates coherent long-horizon rollouts and competitive accuracy at substantially lower computational cost compared to RSSM-based, transformer, and diffusion approaches.

Conclusion: Treating the human body as part of the world model’s state space rather than its output fundamentally changes how motion is simulated and predicted, enabling stable forward simulation with explicit representation of motion dynamics and intent.

Abstract: Human motion prediction has traditionally been framed as a sequence regression problem where models extrapolate future joint coordinates from observed pose histories. While effective over short horizons this approach does not separate observation reconstruction with dynamics modeling and offers no explicit representation of the latent causes governing motion. As a result, existing methods exhibit compounding drift, mean-pose collapse, and poorly calibrated uncertainty when rolled forward beyond the training regime. Here we propose a Semantic Belief-State World Model (SBWM) that reframes human motion prediction as latent dynamical simulation on the human body manifold. Rather than predicting poses directly, SBWM maintains a recurrent probabilistic belief state whose evolution is learned independently of pose reconstruction and explicitly aligned with the SMPL-X anatomical parameterization. This alignment imposes a structural information bottleneck that prevents the latent state from encoding static geometry or sensor noise, forcing it to capture motion dynamics, intent, and control-relevant structure. Inspired by belief-state world models developed for model-based reinforcement learning, SBWM adapts stochastic latent transitions and rollout-centric training to the domain of human motion. In contrast to RSSM-based, transformer, and diffusion approaches optimized for reconstruction fidelity, SBWM prioritizes stable forward simulation. We demonstrate coherent long-horizon rollouts, and competitive accuracy at substantially lower computational cost. These results suggest that treating the human body as part of the world models state space rather than its output fundamentally changes how motion is simulated, and predicted.

[206] Physics-Constrained Cross-Resolution Enhancement Network for Optics-Guided Thermal UAV Image Super-Resolution

Zhicheng Zhao, Fengjiao Peng, Jinquan Yan, Wei Lu, Chenglong Li, Jin Tang

Main category: cs.CV

TL;DR: PCNet improves thermal UAV image super-resolution using optical guidance with physics-constrained thermal conduction to avoid artifacts and preserve high-frequency information.

Details

Motivation: Existing methods compress optical features for cross-modal alignment, causing loss of high-frequency information and introducing physically inconsistent artifacts like texture distortions and edge blurring due to overlooking imaging physics differences between optical and thermal modalities.

Method: Proposes PCNet with: 1) Cross-Resolution Mutual Enhancement Module (CRME) for joint optimization of thermal super-resolution and optical-to-thermal conversion with bidirectional feature interaction; 2) Physics-Driven Thermal Conduction Module (PDTM) incorporating 2D heat conduction into optical guidance; 3) Temperature consistency loss for regional distribution consistency and boundary gradient smoothness.

Result: Extensive experiments on VGTSR2.0 and DroneVehicle datasets show PCNet significantly outperforms state-of-the-art methods on both reconstruction quality and downstream tasks including semantic segmentation and object detection.

Conclusion: PCNet achieves robust thermal UAV image super-resolution through cross-resolution mutual enhancement and physics-constrained optical guidance, effectively addressing limitations of existing methods while maintaining physical consistency with real-world thermal radiation principles.

Abstract: Optics-guided thermal UAV image super-resolution has attracted significant research interest due to its potential in all-weather monitoring applications. However, existing methods typically compress optical features to match thermal feature dimensions for cross-modal alignment and fusion, which not only causes the loss of high-frequency information that is beneficial for thermal super-resolution, but also introduces physically inconsistent artifacts such as texture distortions and edge blurring by overlooking differences in the imaging physics between modalities. To address these challenges, we propose PCNet to achieve cross-resolution mutual enhancement between optical and thermal modalities, while physically constraining the optical guidance process via thermal conduction to enable robust thermal UAV image super-resolution. In particular, we design a Cross-Resolution Mutual Enhancement Module (CRME) to jointly optimize thermal image super-resolution and optical-to-thermal modality conversion, facilitating effective bidirectional feature interaction across resolutions while preserving high-frequency optical priors. Moreover, we propose a Physics-Driven Thermal Conduction Module (PDTM) that incorporates two-dimensional heat conduction into optical guidance, modeling spatially-varying heat conduction properties to prevent inconsistent artifacts. In addition, we introduce a temperature consistency loss that enforces regional distribution consistency and boundary gradient smoothness to ensure generated thermal images align with real-world thermal radiation principles. Extensive experiments on VGTSR2.0 and DroneVehicle datasets demonstrate that PCNet significantly outperforms state-of-the-art methods on both reconstruction quality and downstream tasks including semantic segmentation and object detection.

[207] CloudMatch: Weak-to-Strong Consistency Learning for Semi-Supervised Cloud Detection

Jiayi Zhao, Changlu Chen, Jingsheng Li, Tianxiang Xue, Kun Zhan

Main category: cs.CV

TL;DR: CloudMatch: A semi-supervised framework for cloud detection using view-consistency learning with scene-mixing augmentations to leverage unlabeled remote sensing imagery.

Details

Motivation: Pixel-level annotation for cloud detection is expensive, creating a need for semi-supervised approaches. Cloud patterns show structural diversity and contextual variability across different scenes, requiring methods that can capture this complexity without extensive labeled data.

Method: CloudMatch uses view-consistency learning with scene-mixing augmentations. For each unlabeled image, it generates one weakly augmented view and two strongly augmented views: one with inter-scene patch mixing for contextual variety, and another with intra-scene mixing for semantic coherence. This enforces prediction consistency across diverse views to guide pseudolabel generation.

Result: Extensive experiments demonstrate that CloudMatch achieves good performance, showing its capability to efficiently utilize unlabeled data and advance semi-supervised cloud detection.

Conclusion: CloudMatch effectively leverages unlabeled remote sensing imagery through view-consistency learning with scene-mixing augmentations, enabling the model to capture structural diversity and contextual richness of cloud patterns while advancing semi-supervised cloud detection.

Abstract: Due to the high cost of annotating accurate pixel-level labels, semi-supervised learning has emerged as a promising approach for cloud detection. In this paper, we propose CloudMatch, a semi-supervised framework that effectively leverages unlabeled remote sensing imagery through view-consistency learning combined with scene-mixing augmentations. An observation behind CloudMatch is that cloud patterns exhibit structural diversity and contextual variability across different scenes and within the same scene category. Our key insight is that enforcing prediction consistency across diversely augmented views, incorporating both inter-scene and intra-scene mixing, enables the model to capture the structural diversity and contextual richness of cloud patterns. Specifically, CloudMatch generates one weakly augmented view along with two complementary strongly augmented views for each unlabeled image: one integrates inter-scene patches to simulate contextual variety, while the other employs intra-scene mixing to preserve semantic coherence. This approach guides pseudolabel generation and enhances generalization. Extensive experiments show that CloudMatch achieves good performance, demonstrating its capability to utilize unlabeled data efficiently and advance semi-supervised cloud detection.

[208] EASLT: Emotion-Aware Sign Language Translation

Guobin Tu, Di Weng

Main category: cs.CV

TL;DR: EASLT is an emotion-aware sign language translation framework that treats facial affect as a semantic anchor to resolve ambiguities when distinct concepts share identical manual gestures, achieving state-of-the-art performance on benchmark datasets.

Details

Motivation: Current gloss-free SLT methods focus on manual signals but overlook the semantic importance of facial expressions, leading to ambiguity when different concepts have identical manual articulations. Facial affect is crucial for disambiguation but is often treated as auxiliary rather than central.

Method: EASLT introduces an emotional encoder to capture continuous affective dynamics from facial expressions. These emotional representations are integrated via an Emotion-Aware Fusion (EAF) module that adaptively recalibrates spatio-temporal sign features based on affective context to resolve semantic ambiguities.

Result: On PHOENIX14T and CSL-Daily benchmarks, EASLT achieves BLEU-4 scores of 26.15 and 22.80, and BLEURT scores of 61.0 and 57.8 respectively, establishing advanced performance among gloss-free methods. Ablation studies confirm that explicit emotion modeling effectively decouples affective semantics from manual dynamics.

Conclusion: Treating facial affect as a semantic anchor rather than auxiliary information significantly enhances sign language translation fidelity. Explicit emotion modeling resolves ambiguities in gloss-free SLT, demonstrating that emotional context is crucial for accurate cross-modal translation.

Abstract: Sign Language Translation (SLT) is a complex cross-modal task requiring the integration of Manual Signals (MS) and Non-Manual Signals (NMS). While recent gloss-free SLT methods have made strides in translating manual gestures, they frequently overlook the semantic criticality of facial expressions, resulting in ambiguity when distinct concepts share identical manual articulations. To address this, we present EASLT (Emotion-Aware Sign Language Translation), a framework that treats facial affect not as auxiliary information, but as a robust semantic anchor. Unlike methods that relegate facial expressions to a secondary role, EASLT incorporates a dedicated emotional encoder to capture continuous affective dynamics. These representations are integrated via a novel Emotion-Aware Fusion (EAF) module, which adaptively recalibrates spatio-temporal sign features based on affective context to resolve semantic ambiguities. Extensive evaluations on the PHOENIX14T and CSL-Daily benchmarks demonstrate that EASLT establishes advanced performance among gloss-free methods, achieving BLEU-4 scores of 26.15 and 22.80, and BLEURT scores of 61.0 and 57.8, respectively. Ablation studies confirm that explicitly modeling emotion effectively decouples affective semantics from manual dynamics, significantly enhancing translation fidelity. Code is available at https://github.com/TuGuobin/EASLT.

Tianyi Shang, Pengjie Xu, Zhaojun Deng, Zhenyu Li, Zhicong Chen, Lijun Wu

Main category: cs.CV

TL;DR: SpatiaLoc is a cross-modal localization framework that uses text descriptions and point clouds, employing a coarse-to-fine strategy with instance-level Bezier curve modeling and global-level frequency encoding, achieving state-of-the-art performance on KITTI360Pose.

Details

Motivation: Cross-modal localization using text and point clouds enables robots to localize via natural language descriptions, crucial for autonomous navigation and human-robot interaction. Since objects often recur across modalities, spatial relationships become the most discriminative cues for accurate localization.

Method: SpatiaLoc uses a coarse-to-fine strategy: 1) Coarse stage: BEOSE models instance-level spatial relationships using quadratic Bezier curves, while FAE generates global-level spatial representations in frequency domain. 2) Fine stage: UGFL regresses 2D positions by modeling predictions as Gaussian distributions with uncertainty-aware loss function.

Result: Extensive experiments on KITTI360Pose demonstrate that SpatiaLoc significantly outperforms existing state-of-the-art methods in cross-modal localization performance.

Conclusion: SpatiaLoc effectively leverages spatial relationships at both instance and global levels through a coarse-to-fine framework, achieving superior cross-modal localization performance by combining Bezier curve modeling, frequency domain encoding, and uncertainty-aware regression.

Abstract: Cross-modal localization using text and point clouds enables robots to localize themselves via natural language descriptions, with applications in autonomous navigation and interaction between humans and robots. In this task, objects often recur across text and point clouds, making spatial relationships the most discriminative cues for localization. Given this characteristic, we present SpatiaLoc, a framework utilizing a coarse-to-fine strategy that emphasizes spatial relationships at both the instance and global levels. In the coarse stage, we introduce a Bezier Enhanced Object Spatial Encoder (BEOSE) that models spatial relationships at the instance level using quadratic Bezier curves. Additionally, a Frequency Aware Encoder (FAE) generates spatial representations in the frequency domain at the global level. In the fine stage, an Uncertainty Aware Gaussian Fine Localizer (UGFL) regresses 2D positions by modeling predictions as Gaussian distributions with a loss function aware of uncertainty. Extensive experiments on KITTI360Pose demonstrate that SpatiaLoc significantly outperforms existing state-of-the-art (SOTA) methods.

[210] Detecting AI-Generated Images via Distributional Deviations from Real Images

Yakun Niu, Yingjian Chen, Lei Zhang

Main category: cs.CV

TL;DR: Proposes MPFT strategy with Texture-Aware Masking to fine-tune CLIP-ViT for AI-generated image detection, achieving state-of-the-art generalization performance with minimal training data.

Details

Motivation: As AI-generated images become more realistic, detecting them is crucial to combat misinformation. Existing methods using frozen CLIP models don't fully exploit the encoder's potential and lack generalization to unseen generative models.

Method: Masking-based Pre-trained model Fine-Tuning (MPFT) with Texture-Aware Masking (TAM) that masks textured areas containing generative model-specific patterns during fine-tuning, forcing CLIP-ViT to focus on “distributional deviations” from real images.

Result: Achieves 98.2% and 94.6% average accuracy on GenImage and UniversalFakeDetect datasets respectively, significantly outperforming existing methods with minimal training images.

Conclusion: The proposed MPFT strategy effectively enhances CLIP-ViT’s ability to detect AI-generated images by focusing on distributional deviations, achieving superior generalization performance with limited training data.

Abstract: The rapid advancement of generative models has significantly enhanced the quality of AI-generated images, raising concerns about misinformation and the erosion of public trust. Detecting AI-generated images has thus become a critical challenge, particularly in terms of generalizing to unseen generative models. Existing methods using frozen pre-trained CLIP models show promise in generalization but treat the image encoder as a basic feature extractor, failing to fully exploit its potential. In this paper, we perform an in-depth analysis of the frozen CLIP image encoder (CLIP-ViT), revealing that it effectively clusters real images in a high-level, abstract feature space. However, it does not truly possess the ability to distinguish between real and AI-generated images. Based on this analysis, we propose a Masking-based Pre-trained model Fine-Tuning (MPFT) strategy, which introduces a Texture-Aware Masking (TAM) mechanism to mask textured areas containing generative model-specific patterns during fine-tuning. This approach compels CLIP-ViT to attend to the “distributional deviations"from authentic images for AI-generated image detection, thereby achieving enhanced generalization performance. Extensive experiments on the GenImage and UniversalFakeDetect datasets demonstrate that our method, fine-tuned with only a minimal number of images, significantly outperforms existing approaches, achieving up to 98.2% and 94.6% average accuracy on the two datasets, respectively.

[211] Can LLMs See Without Pixels? Benchmarking Spatial Intelligence from Textual Descriptions

Zhongbin Guo, Zhen Yang, Yushan Li, Xinyue Zhang, Wenyu Gao, Jiacheng Wang, Chengzhi Li, Xiangrui Liu, Ping Jian

Main category: cs.CV

TL;DR: SiT-Bench is a novel benchmark for evaluating spatial intelligence in LLMs without visual input, using textual scene descriptions to test symbolic reasoning across 3,800 items covering navigation, perspective transformation, and robotic manipulation tasks.

Details

Motivation: To investigate whether spatial understanding originates from visual encoders or reasoning backbones in VLMs, and to evaluate LLMs' spatial intelligence capabilities without relying on pixel-level visual input.

Method: Created SiT-Bench with 3,800 expert-annotated items across 5 categories and 17 subtasks. Converted single/multi-view scenes into high-fidelity, coordinate-aware textual descriptions to test LLMs’ symbolic textual reasoning rather than visual pattern matching.

Result: SOTA LLMs show proficiency in localized semantic tasks but have a significant “spatial gap” in global consistency. Explicit spatial reasoning significantly boosts performance, suggesting LLMs possess latent world-modeling potential.

Conclusion: SiT-Bench serves as a foundational resource to develop spatially-grounded LLM backbones for future VLMs and embodied agents, demonstrating that spatial intelligence can be evaluated through textual reasoning without visual input.

Abstract: Recent advancements in Spatial Intelligence (SI) have predominantly relied on Vision-Language Models (VLMs), yet a critical question remains: does spatial understanding originate from visual encoders or the fundamental reasoning backbone? Inspired by this question, we introduce SiT-Bench, a novel benchmark designed to evaluate the SI performance of Large Language Models (LLMs) without pixel-level input, comprises over 3,800 expert-annotated items across five primary categories and 17 subtasks, ranging from egocentric navigation and perspective transformation to fine-grained robotic manipulation. By converting single/multi-view scenes into high-fidelity, coordinate-aware textual descriptions, we challenge LLMs to perform symbolic textual reasoning rather than visual pattern matching. Evaluation results of state-of-the-art (SOTA) LLMs reveals that while models achieve proficiency in localized semantic tasks, a significant “spatial gap” remains in global consistency. Notably, we find that explicit spatial reasoning significantly boosts performance, suggesting that LLMs possess latent world-modeling potential. Our proposed dataset SiT-Bench serves as a foundational resource to foster the development of spatially-grounded LLM backbones for future VLMs and embodied agents. Our code and benchmark will be released at https://github.com/binisalegend/SiT-Bench .

[212] Adaptive Attention Distillation for Robust Few-Shot Segmentation under Environmental Perturbations

Qianyu Guo, Jingrong Wu, Jieji Ren, Weifeng Ge, Wenqiang Zhang

Main category: cs.CV

TL;DR: Proposes Adaptive Attention Distillation (AAD) for environment-robust few-shot segmentation, addressing real-world challenges like motion blur and camouflage through attention distillation across support-query pairs.

Details

Motivation: Existing few-shot segmentation models fail in real-world scenarios due to complex environmental factors (illumination, background, viewpoint, motion blur, small objects, camouflage) that increase test difficulty beyond laboratory conditions.

Method: Introduces environment-robust FSS setting and ER-FSS benchmark with 8 datasets. Proposes Adaptive Attention Distillation (AAD) that repeatedly contrasts and distills key shared semantics between support and query images to derive class-specific attention for novel categories.

Result: AAD improves mIoU by 3.3% - 8.5% across all datasets and settings, demonstrating superior performance and strong generalization in complex environments.

Conclusion: The proposed environment-robust FSS setting and AAD method effectively enhance model robustness in realistic, dynamic conditions, bridging the gap between laboratory training and practical deployment requirements.

Abstract: Few-shot segmentation (FSS) aims to rapidly learn novel class concepts from limited examples to segment specific targets in unseen images, and has been widely applied in areas such as medical diagnosis and industrial inspection. However, existing studies largely overlook the complex environmental factors encountered in real world scenarios-such as illumination, background, and camera viewpoint-which can substantially increase the difficulty of test images. As a result, models trained under laboratory conditions often fall short of practical deployment requirements. To bridge this gap, in this paper, an environment-robust FSS setting is introduced that explicitly incorporates challenging test cases arising from complex environments-such as motion blur, small objects, and camouflaged targets-to enhance model’s robustness under realistic, dynamic conditions. An environment robust FSS benchmark (ER-FSS) is established, covering eight datasets across multiple real world scenarios. In addition, an Adaptive Attention Distillation (AAD) method is proposed, which repeatedly contrasts and distills key shared semantics between known (support) and unknown (query) images to derive class-specific attention for novel categories. This strengthens the model’s ability to focus on the correct targets in complex environments, thereby improving environmental robustness. Comparative experiments show that AAD improves mIoU by 3.3% - 8.5% across all datasets and settings, demonstrating superior performance and strong generalization. The source code and dataset are available at: https://github.com/guoqianyu-alberta/Adaptive-Attention-Distillation-for-FSS.

[213] Unveiling Text in Challenging Stone Inscriptions: A Character-Context-Aware Patching Strategy for Binarization

Pratyush Jena, Amal Joseph, Arnav Sharma, Ravi Kiran Sarvadevabhatla

Main category: cs.CV

TL;DR: A novel adaptive patching strategy combined with Attention U-Net for binarizing challenging stone inscription images, showing strong generalization across scripts.

Details

Motivation: Stone inscription images are extremely challenging for binarization due to poor contrast, surface degradation, artifacts, and variable text layouts, causing existing methods to fail in isolating coherent character regions.

Method: Proposes a robust adaptive patching strategy for binarizing Indic inscriptions, using patches to train an Attention U-Net. The attention mechanism focuses on subtle structural cues, while dynamic sampling and patch selection help overcome surface noise and layout irregularities.

Result: The novel patching mechanism significantly boosts binarization performance across classical and deep learning baselines. The model trained only on single script Indic dataset shows strong zero-shot generalization to other Indic and non-Indic scripts.

Conclusion: The method produces clean, structured representations of inscription content, laying foundation for downstream tasks like script identification, OCR, and historical text analysis, demonstrating robustness and script-agnostic generalization capabilities.

Abstract: Binarization is a popular first step towards text extraction in historical artifacts. Stone inscription images pose severe challenges for binarization due to poor contrast between etched characters and the stone background, non-uniform surface degradation, distracting artifacts, and highly variable text density and layouts. These conditions frequently cause existing binarization techniques to fail and struggle to isolate coherent character regions. Many approaches sub-divide the image into patches to improve text fragment resolution and improve binarization performance. With this in mind, we present a robust and adaptive patching strategy to binarize challenging Indic inscriptions. The patches from our approach are used to train an Attention U-Net for binarization. The attention mechanism allows the model to focus on subtle structural cues, while our dynamic sampling and patch selection method ensures that the model learns to overcome surface noise and layout irregularities. We also introduce a carefully annotated, pixel-precise dataset of Indic stone inscriptions at the character-fragment level. We demonstrate that our novel patching mechanism significantly boosts binarization performance across classical and deep learning baselines. Despite training only on single script Indic dataset, our model exhibits strong zero-shot generalization to other Indic and non-indic scripts, highlighting its robustness and script-agnostic generalization capabilities. By producing clean, structured representations of inscription content, our method lays the foundation for downstream tasks such as script identification, OCR, and historical text analysis. Project page: https://ihdia.iiit.ac.in/shilalekhya-binarization/

[214] Systematic Evaluation of Depth Backbones and Semantic Cues for Monocular Pseudo-LiDAR 3D Detection

Samson Oseiwe Ajadalu

Main category: cs.CV

TL;DR: Monocular 3D detection study shows depth backbone choice (NeWCRFs vs Depth Anything V2) and geometric fidelity matter most, with appearance/semantic features providing only marginal gains.

Details

Motivation: Monocular 3D object detection is cheaper than LiDAR but less accurate due to depth estimation challenges. The paper aims to systematically evaluate how depth backbones and feature engineering affect monocular Pseudo-LiDAR pipelines.

Method: Systematic evaluation comparing NeWCRFs (supervised metric depth) against Depth Anything V2 Metric-Outdoor (Base) in identical pseudo-LiDAR generation and PointRCNN detection protocol on KITTI validation. Tested point-cloud augmentations using appearance cues (grayscale intensity) and semantic cues (instance segmentation confidence).

Result: NeWCRFs outperforms Depth Anything V2, achieving 10.50% AP3D at IoU=0.7 on Moderate split. Appearance/semantic features provide only marginal gains, and mask-based sampling can degrade performance by removing contextual geometry. Depth backbone choice and geometric fidelity dominate performance over secondary feature injection.

Conclusion: Under off-the-shelf LiDAR detectors, depth-backbone choice and geometric fidelity are the primary factors affecting monocular 3D detection performance, outweighing secondary feature injection like appearance or semantic cues.

Abstract: Monocular 3D object detection offers a low-cost alternative to LiDAR, yet remains less accurate due to the difficulty of estimating metric depth from a single image. We systematically evaluate how depth backbones and feature engineering affect a monocular Pseudo-LiDAR pipeline on the KITTI validation split. Specifically, we compare NeWCRFs (supervised metric depth) against Depth Anything V2 Metric-Outdoor (Base) under an identical pseudo-LiDAR generation and PointRCNN detection protocol. NeWCRFs yields stronger downstream 3D detection, achieving 10.50% AP$_{3D}$ at IoU$=0.7$ on the Moderate split using grayscale intensity (Exp~2). We further test point-cloud augmentations using appearance cues (grayscale intensity) and semantic cues (instance segmentation confidence). Contrary to the expectation that semantics would substantially close the gap, these features provide only marginal gains, and mask-based sampling can degrade performance by removing contextual geometry. Finally, we report a depth-accuracy-versus-distance diagnostic using ground-truth 2D boxes (including Ped/Cyc), highlighting that coarse depth correctness does not fully predict strict 3D IoU. Overall, under an off-the-shelf LiDAR detector, depth-backbone choice and geometric fidelity dominate performance, outweighing secondary feature injection.

[215] Shape Classification using Approximately Convex Segment Features

Bimal Kumar Ray

Main category: cs.CV

TL;DR: Proposes a novel object classification method that eliminates the need for object alignment by sorting features from normalized boundary segments.

Details

Motivation: Existing object classification techniques require object alignment to compute similarity, which can be computationally expensive and complex. The paper aims to develop a method that doesn't require alignment while maintaining classification accuracy.

Method: 1. Normalize object boundaries and segment them into approximately convex segments. 2. Sort segments in descending order of length. 3. Extract a bag of features from each segment: length, number of extreme points, area, base, and width. 4. Use these sorted features to measure similarity between image boundaries for classification.

Result: The proposed method was tested on datasets and achieved acceptable classification results, demonstrating that object alignment is not necessary for effective object classification.

Conclusion: The paper successfully demonstrates a novel approach to object classification that eliminates the need for object alignment through feature sorting, providing a simpler yet effective alternative to existing alignment-dependent methods.

Abstract: The existing object classification techniques based on descriptive features rely on object alignment to compute the similarity of objects for classification. This paper replaces the necessity of object alignment through sorting of feature. The object boundary is normalized and segmented into approximately convex segments and the segments are then sorted in descending order of their length. The segment length, number of extreme points in segments, area of segments, the base and the width of the segments - a bag of features - is used to measure the similarity between image boundaries. The proposed method is tested on datasets and acceptable results are observed.

[216] MFC-RFNet: A Multi-scale Guided Rectified Flow Network for Radar Sequence Prediction

Wenjie Luo, Chuanhu Deng, Chaorong Li, Rongyao Deng, Qiang Yang

Main category: cs.CV

TL;DR: MFC-RFNet: A generative framework for radar precipitation nowcasting that integrates multi-scale feature communication with rectified flow training, wavelet-guided skip connections, and spatial alignment to improve accuracy and resolution.

Details

Motivation: Accurate high-resolution precipitation nowcasting from radar echo sequences is crucial for disaster mitigation and economic planning, but faces challenges in modeling complex multi-scale evolution, correcting inter-frame feature misalignment from displacement, and efficiently capturing long-range spatiotemporal context without sacrificing spatial fidelity.

Method: Proposes MFC-RFNet with: 1) Wavelet-Guided Skip Connection (WGSC) to preserve high-frequency components, 2) Feature Communication Module (FCM) for bidirectional cross-scale interaction, 3) Condition-Guided Spatial Transform Fusion (CGSTF) to align shallow features using spatial transforms from conditioning echoes, 4) Rectified flow training for near-linear probability-flow trajectories enabling few-step sampling, and 5) Lightweight Vision-RWKV blocks at key network positions to capture long-range dependencies efficiently.

Result: Evaluations on four public datasets (SEVIR, MeteoNet, Shanghai, and CIKM) show consistent improvements over strong baselines, yielding clearer echo morphology at higher rain-rate thresholds and sustained skill at longer lead times.

Conclusion: The synergy of rectified flow training with scale-aware communication, spatial alignment, and frequency-aware fusion presents an effective and robust approach for radar-based nowcasting, addressing key challenges in precipitation forecasting.

Abstract: Accurate and high-resolution precipitation nowcasting from radar echo sequences is crucial for disaster mitigation and economic planning, yet it remains a significant challenge. Key difficulties include modeling complex multi-scale evolution, correcting inter-frame feature misalignment caused by displacement, and efficiently capturing long-range spatiotemporal context without sacrificing spatial fidelity. To address these issues, we present the Multi-scale Feature Communication Rectified Flow (RF) Network (MFC-RFNet), a generative framework that integrates multi-scale communication with guided feature fusion. To enhance multi-scale fusion while retaining fine detail, a Wavelet-Guided Skip Connection (WGSC) preserves high-frequency components, and a Feature Communication Module (FCM) promotes bidirectional cross-scale interaction. To correct inter-frame displacement, a Condition-Guided Spatial Transform Fusion (CGSTF) learns spatial transforms from conditioning echoes to align shallow features. The backbone adopts rectified flow training to learn near-linear probability-flow trajectories, enabling few-step sampling with stable fidelity. Additionally, lightweight Vision-RWKV (RWKV) blocks are placed at the encoder tail, the bottleneck, and the first decoder layer to capture long-range spatiotemporal dependencies at low spatial resolutions with moderate compute. Evaluations on four public datasets (SEVIR, MeteoNet, Shanghai, and CIKM) demonstrate consistent improvements over strong baselines, yielding clearer echo morphology at higher rain-rate thresholds and sustained skill at longer lead times. These results suggest that the proposed synergy of RF training with scale-aware communication, spatial alignment, and frequency-aware fusion presents an effective and robust approach for radar-based nowcasting.

[217] CrackSegFlow: Controllable Flow-Matching Synthesis for Generalizable Crack Segmentation with the CSF-50K Benchmark

Babak Asadi, Peiyang Wu, Mani Golparvar-Fard, Ramez Hajj

Main category: cs.CV

TL;DR: CrackSegFlow is a controllable flow-matching synthesis framework that generates photorealistic crack images from binary masks to address scarce pixel-level labels and domain shift in automated crack segmentation.

Details

Motivation: Practical deployment of automated crack segmentation is limited by scarce pixel-level labels and severe domain shift across sensors, illumination, textures, and annotation conventions.

Method: Uses controllable flow-matching synthesis with topology-preserving mask injection and boundary-gated modulation to generate photorealistic crack images conditioned on binary masks. Includes a second class-conditional flow-matching model for crack mask synthesis with explicit control over crack coverage, and injects crack masks into crack-free backgrounds to diversify illumination and reduce false positives.

Result: Improves in-domain performance by 5.37 mIoU and 5.13 F1 on average, and target-guided cross-domain synthesis yields average gains of 13.12 mIoU and 14.82 F1. Provides faster deterministic sampling than diffusion-based methods with better fidelity and mask-image alignment for thin-structure crack geometry.

Conclusion: CrackSegFlow effectively addresses label scarcity and domain shift in crack segmentation through controllable synthesis, releasing CSF-50K dataset of 50,000 paired crack images and masks for large-scale benchmarking.

Abstract: Automated crack segmentation is essential for scalable condition assessment of pavements and civil infrastructure, yet practical deployment is limited by scarce pixel-level labels and severe domain shift across sensors, illumination, textures, and annotation conventions. This paper presents CrackSegFlow, a controllable flow-matching synthesis framework that generates photorealistic crack images conditioned on binary masks while preserving strict mask-image alignment. The generator combines topology-preserving mask injection with boundary-gated modulation to maintain thin-structure continuity and suppress texture-driven false positives. A second class-conditional flow-matching model synthesizes crack masks with explicit control over crack coverage, enabling balanced, topology-diverse paired data without additional manual annotation. We further inject crack masks into crack-free backgrounds to diversify illumination and surface artifacts and reduce false positives caused by shadows, joints, and pavement markings. Experiments on five benchmarks spanning four asphalt datasets and the crack class of a concrete-domain dataset demonstrate consistent improvements under an established hybrid CNN–Transformer segmentation backbone and a fixed training protocol. With real plus synthesized pairs, in-domain performance improves on average by 5.37 mIoU and 5.13 F1, and target-guided cross-domain synthesis yields average gains of 13.12 mIoU and 14.82 F1 using only limited target mask statistics. Compared with diffusion-based semantic synthesis, CrackSegFlow provides substantially faster deterministic sampling and improves fidelity and mask-image alignment for thin-structure crack geometry. Finally, we release CSF-50K, a public dataset of 50,000 paired crack images and pixel-accurate masks for large-scale benchmarking of generalizable crack segmentation.

[218] VideoMemory: Toward Consistent Video Generation via Memory Integration

Jinsong Zhou, Yihua Du, Xinli Xu, Luozhou Wang, Zijie Zhuang, Yehang Zhang, Shuaibo Li, Xiaojun Hu, Bolan Su, Ying-cong Chen

Main category: cs.CV

TL;DR: VideoMemory is an entity-centric framework for consistent narrative video generation using a Dynamic Memory Bank to maintain character/prop/background identity across shots.

Details

Motivation: Existing video generation models fail to preserve entity identity and appearance across scene changes and temporal gaps, making consistent long-form narrative video generation challenging.

Method: Uses a multi-agent system to decompose scripts into shots, retrieves entity representations from a Dynamic Memory Bank (storing visual/semantic descriptors), and synthesizes videos conditioned on retrieved states with memory updates after each shot.

Result: Achieves strong entity-level coherence and high perceptual quality across diverse narrative sequences, evaluated on a 54-case multi-shot consistency benchmark covering character-, prop-, and background-persistent scenarios.

Conclusion: VideoMemory’s retrieval-update mechanism enables consistent entity portrayal across distant shots and supports coherent long-form video generation through explicit memory management.

Abstract: Maintaining consistent characters, props, and environments across multiple shots is a central challenge in narrative video generation. Existing models can produce high-quality short clips but often fail to preserve entity identity and appearance when scenes change or when entities reappear after long temporal gaps. We present VideoMemory, an entity-centric framework that integrates narrative planning with visual generation through a Dynamic Memory Bank. Given a structured script, a multi-agent system decomposes the narrative into shots, retrieves entity representations from memory, and synthesizes keyframes and videos conditioned on these retrieved states. The Dynamic Memory Bank stores explicit visual and semantic descriptors for characters, props, and backgrounds, and is updated after each shot to reflect story-driven changes while preserving identity. This retrieval-update mechanism enables consistent portrayal of entities across distant shots and supports coherent long-form generation. To evaluate this setting, we construct a 54-case multi-shot consistency benchmark covering character-, prop-, and background-persistent scenarios. Extensive experiments show that VideoMemory achieves strong entity-level coherence and high perceptual quality across diverse narrative sequences.

[219] MGPC: Multimodal Network for Generalizable Point Cloud Completion With Modality Dropout and Progressive Decoding

Jiangyuan Liu, Hongxuan Ma, Yuhao Zhao, Zhe Liu, Jian Wang, Wei Zou

Main category: cs.CV

TL;DR: MGPC is a multimodal point cloud completion framework that integrates point clouds, RGB images, and text to improve generalization to novel objects and real-world scenarios.

Details

Motivation: Existing point cloud completion methods (3D CNN-based, point-based, Transformer-based) have limitations in modality, scalability, and generative capacity, making generalization to novel objects and real-world scenarios challenging.

Method: Proposes MGPC with: 1) modality dropout strategy for robustness, 2) Transformer-based fusion module for scalability, 3) progressive generator for geometric modeling, and 4) automatic data generation pipeline creating MGPC-1M benchmark with 1,000+ categories and 1M training pairs.

Result: Extensive experiments on MGPC-1M and in-the-wild data show consistent outperformance over prior baselines and strong generalization under real-world conditions.

Conclusion: MGPC provides a generalizable multimodal framework that effectively addresses limitations of existing methods and demonstrates superior performance and generalization capability for point cloud completion.

Abstract: Point cloud completion aims to recover complete 3D geometry from partial observations caused by limited viewpoints and occlusions. Existing learning-based works, including 3D Convolutional Neural Network (CNN)-based, point-based, and Transformer-based methods, have achieved strong performance on synthetic benchmarks. However, due to the limitations of modality, scalability, and generative capacity, their generalization to novel objects and real-world scenarios remains challenging. In this paper, we propose MGPC, a generalizable multimodal point cloud completion framework that integrates point clouds, RGB images, and text within a unified architecture. MGPC introduces an innovative modality dropout strategy, a Transformer-based fusion module, and a novel progressive generator to improve robustness, scalability, and geometric modeling capability. We further develop an automatic data generation pipeline and construct MGPC-1M, a large-scale benchmark with over 1,000 categories and one million training pairs. Extensive experiments on MGPC-1M and in-the-wild data demonstrate that the proposed method consistently outperforms prior baselines and exhibits strong generalization under real-world conditions.

[220] PhysVideoGenerator: Towards Physically Aware Video Generation via Latent Physics Guidance

Siddarth Nilol Kundur Satish, Devesh Jaiswal, Hongyu Chen, Abhishek Bakshi

Main category: cs.CV

TL;DR: PhysVideoGenerator embeds learnable physics priors into video generation to address physical realism issues like unnatural collisions and gravity inconsistencies in current models.

Details

Motivation: Current video generation models produce aesthetically pleasing videos but struggle with realistic physics dynamics, resulting in artifacts like unnatural object collisions, inconsistent gravity, and temporal flickering.

Method: Proposes PhysVideoGenerator with a lightweight predictor network (PredictorP) that regresses high-level physical features from pre-trained V-JEPA 2 directly from noisy diffusion latents. These physics tokens are injected into temporal attention layers of a DiT-based generator (Latte) via cross-attention.

Result: Demonstrates technical feasibility: diffusion latents contain sufficient information to recover V-JEPA 2 physical representations, and multi-task optimization remains stable during training. Establishes foundation for future large-scale physics-aware generative models.

Conclusion: The framework successfully integrates physics priors into video generation, addressing physical realism issues while maintaining training stability, paving the way for more physically accurate video generation models.

Abstract: Current video generation models produce high-quality aesthetic videos but often struggle to learn representations of real-world physics dynamics, resulting in artifacts such as unnatural object collisions, inconsistent gravity, and temporal flickering. In this work, we propose PhysVideoGenerator, a proof-of-concept framework that explicitly embeds a learnable physics prior into the video generation process. We introduce a lightweight predictor network, PredictorP, which regresses high-level physical features extracted from a pre-trained Video Joint Embedding Predictive Architecture (V-JEPA 2) directly from noisy diffusion latents. These predicted physics tokens are injected into the temporal attention layers of a DiT-based generator (Latte) via a dedicated cross-attention mechanism. Our primary contribution is demonstrating the technical feasibility of this joint training paradigm: we show that diffusion latents contain sufficient information to recover V-JEPA 2 physical representations, and that multi-task optimization remains stable over training. This report documents the architectural design, technical challenges, and validation of training stability, establishing a foundation for future large-scale evaluation of physics-aware generative models.

[221] TRec: Egocentric Action Recognition using 2D Point Tracks

Dennis Holzmann, Sven Wachsmuth

Main category: cs.CV

TL;DR: Novel egocentric action recognition method using 2D point tracks as motion cues, achieving improved accuracy without hand/object detection by tracking random points with CoTracker and feeding trajectories to a Transformer model.

Details

Motivation: Most existing egocentric action recognition methods rely on RGB appearance, human pose estimation, or their combination, but there's potential for improvement by incorporating additional motion cues through 2D point tracking.

Method: Use CoTracker to follow randomly initialized points through video frames, then feed resulting trajectories along with image frames to a Transformer-based recognition model. Surprisingly effective even with only initial frame and associated point tracks.

Result: Method consistently enhances performance compared to same model trained without motion information, achieving notable gains even with limited input (initial frame + point tracks only).

Conclusion: 2D point tracks serve as a lightweight yet effective representation for egocentric action understanding, offering substantial accuracy improvements without requiring hand/object detection or interaction region identification.

Abstract: We present a novel approach for egocentric action recognition that leverages 2D point tracks as an additional motion cue. While most existing methods rely on RGB appearance, human pose estimation, or their combination, our work demonstrates that tracking randomly sampled image points across video frames can substantially improve recognition accuracy. Unlike prior approaches, we do not detect hands, objects, or interaction regions. Instead, we employ CoTracker to follow a set of randomly initialized points through each video and use the resulting trajectories, together with the corresponding image frames, as input to a Transformer-based recognition model. Surprisingly, our method achieves notable gains even when only the initial frame and its associated point tracks are provided, without incorporating the full video sequence. Experimental results confirm that integrating 2D point tracks consistently enhances performance compared to the same model trained without motion information, highlighting their potential as a lightweight yet effective representation for egocentric action understanding.

[222] BREATH-VL: Vision-Language-Guided 6-DoF Bronchoscopy Localization via Semantic-Geometric Fusion

Qingyao Tian, Bingyu Yang, Huai Liao, Xinyan Huang, Junyong Li, Dong Yi, Hongbin Liu

Main category: cs.CV

TL;DR: BREATH-VL: A hybrid vision-language framework for 6-DoF endoscopic camera localization that combines semantic understanding from VLMs with geometric registration, achieving 25.5% reduction in translational error.

Details

Motivation: Address three key challenges in applying VLMs to endoscopic localization: 1) lack of large-scale medical vision-language datasets, 2) limited fine-grained pose regression capability, and 3) high computational latency for temporal feature extraction.

Method: 1) Construct BREATH dataset - largest in-vivo endoscopic localization dataset; 2) Develop BREATH-VL hybrid framework integrating semantic cues from VLMs with geometric information from vision-based registration; 3) Introduce lightweight context-learning mechanism using linguistic prompts for efficient temporal reasoning.

Result: Outperforms state-of-the-art vision-only methods in accuracy and generalization, reducing translational error by 25.5% compared to best baseline, while maintaining competitive computational latency. Vision-language module shows robust semantic localization in challenging surgical scenes.

Conclusion: The hybrid vision-language approach effectively combines semantic understanding from VLMs with geometric precision from registration methods, demonstrating superior performance for 6-DoF endoscopic camera localization in complex medical environments.

Abstract: Vision-language models (VLMs) have recently shown remarkable performance in navigation and localization tasks by leveraging large-scale pretraining for semantic understanding. However, applying VLMs to 6-DoF endoscopic camera localization presents several challenges: 1) the lack of large-scale, high-quality, densely annotated, and localization-oriented vision-language datasets in real-world medical settings; 2) limited capability for fine-grained pose regression; and 3) high computational latency when extracting temporal features from past frames. To address these issues, we first construct BREATH dataset, the largest in-vivo endoscopic localization dataset to date, collected in the complex human airway. Building on this dataset, we propose BREATH-VL, a hybrid framework that integrates semantic cues from VLMs with geometric information from vision-based registration methods for accurate 6-DoF pose estimation. Our motivation lies in the complementary strengths of both approaches: VLMs offer generalizable semantic understanding, while registration methods provide precise geometric alignment. To further enhance the VLM’s ability to capture temporal context, we introduce a lightweight context-learning mechanism that encodes motion history as linguistic prompts, enabling efficient temporal reasoning without expensive video-level computation. Extensive experiments demonstrate that the vision-language module delivers robust semantic localization in challenging surgical scenes. Building on this, our BREATH-VL outperforms state-of-the-art vision-only localization methods in both accuracy and generalization, reducing translational error by 25.5% compared with the best-performing baseline, while achieving competitive computational latency.

[223] CSMCIR: CoT-Enhanced Symmetric Alignment with Memory Bank for Composed Image Retrieval

Zhipeng Qian, Zihan Liang, Yufei Ma, Ben Chen, Huangyu Dai, Yiwei Ma, Jiayi Ji, Chenyi Lei, Han Li, Xiaoshuai Sun

Main category: cs.CV

TL;DR: CSMCIR is a unified framework for composed image retrieval that addresses representation space fragmentation through symmetric architecture, multimodal chain-of-thought prompting, and dynamic memory bank for negative sampling.

Details

Motivation: Existing CIR methods suffer from representation space fragmentation where queries (image+text) and targets (images) are processed by different encoders, creating misaligned representation spaces that limit retrieval performance. This manifests as three separate clusters in feature space.

Method: 1) Multi-level Chain-of-Thought (MCoT) prompting for MLLMs to generate discriminative captions for target images, establishing modal symmetry. 2) Symmetric dual-tower architecture with shared-parameter Q-Former for both query and target encoding. 3) Entropy-based, temporally dynamic Memory Bank for high-quality negative samples.

Result: Extensive experiments on four benchmark datasets show CSMCIR achieves state-of-the-art performance with superior training efficiency. Ablation studies validate each component’s effectiveness.

Conclusion: CSMCIR successfully addresses representation space fragmentation in CIR through architectural symmetry and unified representation learning, enabling efficient query-target alignment and improved retrieval performance.

Abstract: Composed Image Retrieval (CIR) enables users to search for target images using both a reference image and manipulation text, offering substantial advantages over single-modality retrieval systems. However, existing CIR methods suffer from representation space fragmentation: queries and targets comprise heterogeneous modalities and are processed by distinct encoders, forcing models to bridge misaligned representation spaces only through post-hoc alignment, which fundamentally limits retrieval performance. This architectural asymmetry manifests as three distinct, well-separated clusters in the feature space, directly demonstrating how heterogeneous modalities create fundamentally misaligned representation spaces from initialization. In this work, we propose CSMCIR, a unified representation framework that achieves efficient query-target alignment through three synergistic components. First, we introduce a Multi-level Chain-of-Thought (MCoT) prompting strategy that guides Multimodal Large Language Models to generate discriminative, semantically compatible captions for target images, establishing modal symmetry. Building upon this, we design a symmetric dual-tower architecture where both query and target sides utilize the identical shared-parameter Q-Former for cross-modal encoding, ensuring consistent feature representations and further reducing the alignment gap. Finally, this architectural symmetry enables an entropy-based, temporally dynamic Memory Bank strategy that provides high-quality negative samples while maintaining consistency with the evolving model state. Extensive experiments on four benchmark datasets demonstrate that our CSMCIR achieves state-of-the-art performance with superior training efficiency. Comprehensive ablation studies further validate the effectiveness of each proposed component.

[224] MATANet: A Multi-context Attention and Taxonomy-Aware Network for Fine-Grained Underwater Recognition of Marine Species

Donghwan Lee, Byeongjin Kim, Geunhee Kim, Hyukjin Kwon, Nahyeon Maeng, Wooju Kim

Main category: cs.CV

TL;DR: MATANet is a novel model for fine-grained marine species classification that uses multi-context environmental attention and taxonomic hierarchy encoding to improve classification accuracy.

Details

Motivation: Existing methods for marine animal classification often overlook contextual interactions from the surrounding environment and insufficiently incorporate the hierarchical structure of marine biological taxonomy, which limits their effectiveness for fine-grained classification tasks.

Method: MATANet consists of two key components: 1) Multi-Context Environmental Attention Module (MCEAM) that learns relationships between regions of interest and their surrounding environments, and 2) Hierarchical Separation-Induced Learning Module (HSLM) that encodes taxonomic hierarchy into the feature space. The model combines instance features, environmental context, and taxonomic structure.

Result: Experiments on FathomNet2025, FAIR1M, and LifeCLEF2015-Fish datasets demonstrate state-of-the-art performance in fine-grained marine species classification.

Conclusion: MATANet effectively addresses the limitations of existing methods by incorporating environmental context and taxonomic hierarchy, providing a robust solution for fine-grained marine species classification that supports ecological research and conservation efforts.

Abstract: Fine-grained classification of marine animals supports ecology, biodiversity and habitat conservation, and evidence-based policy-making. However, existing methods often overlook contextual interactions from the surrounding environment and insufficiently incorporate the hierarchical structure of marine biological taxonomy. To address these challenges, we propose MATANet (Multi-context Attention and Taxonomy-Aware Network), a novel model designed for fine-grained marine species classification. MATANet mimics expert strategies by using taxonomy and environmental context to interpret ambiguous features of underwater animals. It consists of two key components: a Multi-Context Environmental Attention Module (MCEAM), which learns relationships between regions of interest (ROIs) and their surrounding environments, and a Hierarchical Separation-Induced Learning Module (HSLM), which encodes taxonomic hierarchy into the feature space. MATANet combines instance and environmental features with taxonomic structure to enhance fine-grained classification. Experiments on the FathomNet2025, FAIR1M, and LifeCLEF2015-Fish datasets demonstrate state-of-the-art performance. The source code is available at: https://github.com/dhlee-work/fathomnet-cvpr2025-ssl

[225] RadDiff: Describing Differences in Radiology Image Sets with Natural Language

Xiaoxian Shen, Yuhui Zhang, Sahithi Ankireddy, Xiaohan Wang, Maya Varma, Henry Guo, Curtis Langlotz, Serena Yeung-Levy

Main category: cs.CV

TL;DR: RadDiff is a multimodal AI system that performs radiologist-style comparative reasoning to identify clinically meaningful differences between paired radiology studies, outperforming general-domain baselines on a new benchmark.

Details

Motivation: Understanding differences between radiology image sets is critical for clinical insights and interpreting medical AI systems. Current methods lack the sophisticated comparative reasoning that radiologists perform when analyzing paired studies.

Method: RadDiff builds on a proposer-ranker framework with four innovations: (1) medical knowledge injection via domain-adapted vision-language models, (2) multimodal reasoning integrating images with clinical reports, (3) iterative hypothesis refinement across multiple reasoning rounds, and (4) targeted visual search that localizes and zooms in on salient regions.

Result: On RadDiffBench (57 expert-validated study pairs), RadDiff achieves 47% accuracy (50% with ground-truth reports), significantly outperforming the VisDiff baseline. The system demonstrates versatility across clinical tasks including COVID-19 phenotype comparison, racial subgroup analysis, and survival-related feature discovery.

Conclusion: RadDiff and RadDiffBench provide the first method-and-benchmark foundation for systematically uncovering meaningful differences in radiological data, advancing comparative reasoning in medical imaging analysis.

Abstract: Understanding how two radiology image sets differ is critical for generating clinical insights and for interpreting medical AI systems. We introduce RadDiff, a multimodal agentic system that performs radiologist-style comparative reasoning to describe clinically meaningful differences between paired radiology studies. RadDiff builds on a proposer-ranker framework from VisDiff, and incorporates four innovations inspired by real diagnostic workflows: (1) medical knowledge injection through domain-adapted vision-language models; (2) multimodal reasoning that integrates images with their clinical reports; (3) iterative hypothesis refinement across multiple reasoning rounds; and (4) targeted visual search that localizes and zooms in on salient regions to capture subtle findings. To evaluate RadDiff, we construct RadDiffBench, a challenging benchmark comprising 57 expert-validated radiology study pairs with ground-truth difference descriptions. On RadDiffBench, RadDiff achieves 47% accuracy, and 50% accuracy when guided by ground-truth reports, significantly outperforming the general-domain VisDiff baseline. We further demonstrate RadDiff’s versatility across diverse clinical tasks, including COVID-19 phenotype comparison, racial subgroup analysis, and discovery of survival-related imaging features. Together, RadDiff and RadDiffBench provide the first method-and-benchmark foundation for systematically uncovering meaningful differences in radiological data.

[226] HyperCOD: The First Challenging Benchmark and Baseline for Hyperspectral Camouflaged Object Detection

Shuyan Bai, Tingfa Xu, Peifu Liu, Yuhao Qiu, Huiyan Bai, Huan Chen, Yanyan Peng, Jianan Li

Main category: cs.CV

TL;DR: The paper introduces HyperCOD, the first large-scale benchmark for hyperspectral camouflaged object detection (HCOD), and proposes HSC-SAM, a novel method that adapts SAM for HCOD by decoupling hyperspectral images into spatial and spectral components.

Details

Motivation: RGB-based camouflaged object detection fails in real-world scenarios with ambiguous color/texture cues. Hyperspectral imaging offers better spectral signatures, but HCOD research has been limited by the lack of a dedicated, large-scale benchmark.

Method: Proposes HSC-SAM which reformulates hyperspectral images by decoupling them into: 1) a spatial map fed to SAM’s image encoder, and 2) a spectral saliency map that serves as an adaptive prompt. This bridges the modality gap between hyperspectral data and SAM’s RGB-based architecture.

Result: HSC-SAM achieves state-of-the-art performance on the new HyperCOD benchmark (350 high-resolution hyperspectral images) and demonstrates robust generalization to other public HSI datasets.

Conclusion: The HyperCOD dataset and HSC-SAM baseline provide a strong foundation for future HCOD research, addressing the critical need for dedicated benchmarks in this emerging field.

Abstract: RGB-based camouflaged object detection struggles in real-world scenarios where color and texture cues are ambiguous. While hyperspectral image offers a powerful alternative by capturing fine-grained spectral signatures, progress in hyperspectral camouflaged object detection (HCOD) has been critically hampered by the absence of a dedicated, large-scale benchmark. To spur innovation, we introduce HyperCOD, the first challenging benchmark for HCOD. Comprising 350 high-resolution hyperspectral images, It features complex real-world scenarios with minimal objects, intricate shapes, severe occlusions, and dynamic lighting to challenge current models. The advent of foundation models like the Segment Anything Model (SAM) presents a compelling opportunity. To adapt the Segment Anything Model (SAM) for HCOD, we propose HyperSpectral Camouflage-aware SAM (HSC-SAM). HSC-SAM ingeniously reformulates the hyperspectral image by decoupling it into a spatial map fed to SAM’s image encoder and a spectral saliency map that serves as an adaptive prompt. This translation effectively bridges the modality gap. Extensive experiments show that HSC-SAM sets a new state-of-the-art on HyperCOD and generalizes robustly to other public HSI datasets. The HyperCOD dataset and our HSC-SAM baseline provide a robust foundation to foster future research in this emerging area.

[227] I2E: From Image Pixels to Actionable Interactive Environments for Text-Guided Image Editing

Jinghan Yu, Junhao Xiao, Chenyu Zhu, Jiaming Li, Jia Li, HanMing Deng, Xirui Wang, Guoli Jia, Jianjun Li, Zhiyuan Ma, Xiang Bai, Bowen Zhou

Main category: cs.CV

TL;DR: I2E introduces a “Decompose-then-Action” paradigm for compositional image editing using object decomposition and physics-aware agents, outperforming existing methods on complex spatial reasoning tasks.

Details

Motivation: Existing pixel-level inpainting methods struggle with compositional editing requiring precise local control and multi-object spatial reasoning due to implicit planning-execution coupling, lack of object-level granularity, and unstructured pixel-centric modeling.

Method: I2E uses a Decomposer to transform images into discrete object layers, then employs a physics-aware Vision-Language-Action Agent that parses complex instructions into atomic actions via Chain-of-Thought reasoning.

Result: I2E significantly outperforms state-of-the-art methods on I2E-Bench and multiple public benchmarks in handling complex compositional instructions, maintaining physical plausibility, and ensuring multi-turn editing stability.

Conclusion: The “Decompose-then-Action” paradigm effectively addresses limitations of existing image editing methods by introducing structured object-level control and explicit reasoning for complex compositional tasks.

Abstract: Existing text-guided image editing methods primarily rely on end-to-end pixel-level inpainting paradigm. Despite its success in simple scenarios, this paradigm still significantly struggles with compositional editing tasks that require precise local control and complex multi-object spatial reasoning. This paradigm is severely limited by 1) the implicit coupling of planning and execution, 2) the lack of object-level control granularity, and 3) the reliance on unstructured, pixel-centric modeling. To address these limitations, we propose I2E, a novel “Decompose-then-Action” paradigm that revisits image editing as an actionable interaction process within a structured environment. I2E utilizes a Decomposer to transform unstructured images into discrete, manipulable object layers and then introduces a physics-aware Vision-Language-Action Agent to parse complex instructions into a series of atomic actions via Chain-of-Thought reasoning. Further, we also construct I2E-Bench, a benchmark designed for multi-instance spatial reasoning and high-precision editing. Experimental results on I2E-Bench and multiple public benchmarks demonstrate that I2E significantly outperforms state-of-the-art methods in handling complex compositional instructions, maintaining physical plausibility, and ensuring multi-turn editing stability.

[228] MVP: Enhancing Video Large Language Models via Self-supervised Masked Video Prediction

Xiaokun Sun, Zezhong Wu, Zewen Ding, Linli Xu

Main category: cs.CV

TL;DR: Proposes Masked Video Prediction (MVP) post-training objective for VideoLLMs to enhance temporal reasoning and causal understanding by reconstructing masked video segments from distractors.

Details

Motivation: Current RL-based post-training for VideoLLMs focuses on holistic content understanding but lacks explicit supervision for temporal coherence and inter-frame correlations, limiting models' ability to capture intricate dynamics and fine-grained visual causality.

Method: Introduces Masked Video Prediction (MVP) objective requiring models to reconstruct masked continuous video segments from challenging distractors. Uses scalable data synthesis pipeline to transform video corpora into MVP training samples, and employs Group Relative Policy Optimization (GRPO) with fine-grained reward function.

Result: Comprehensive evaluations demonstrate that MVP enhances video reasoning capabilities by directly reinforcing temporal reasoning and causal understanding.

Conclusion: MVP bridges the gap in current VideoLLM post-training by explicitly targeting temporal coherence and inter-frame correlations, improving models’ ability to understand video dynamics and causality.

Abstract: Reinforcement learning based post-training paradigms for Video Large Language Models (VideoLLMs) have achieved significant success by optimizing for visual-semantic tasks such as captioning or VideoQA. However, while these approaches effectively enhance perception abilities, they primarily target holistic content understanding, often lacking explicit supervision for intrinsic temporal coherence and inter-frame correlations. This tendency limits the models’ ability to capture intricate dynamics and fine-grained visual causality. To explicitly bridge this gap, we propose a novel post-training objective: Masked Video Prediction (MVP). By requiring the model to reconstruct a masked continuous segment from a set of challenging distractors, MVP forces the model to attend to the sequential logic and temporal context of events. To support scalable training, we introduce a scalable data synthesis pipeline capable of transforming arbitrary video corpora into MVP training samples, and further employ Group Relative Policy Optimization (GRPO) with a fine-grained reward function to enhance the model’s understanding of video context and temporal properties. Comprehensive evaluations demonstrate that MVP enhances video reasoning capabilities by directly reinforcing temporal reasoning and causal understanding.

[229] A Comparative Study of 3D Model Acquisition Methods for Synthetic Data Generation of Agricultural Products

Steven Moonen, Rob Salaets, Kenneth Batstone, Abdellatif Bey-Temsamani, Nick Michiels

Main category: cs.CV

TL;DR: The paper presents techniques for generating synthetic training data without CAD models in agricultural settings, specifically for stone-potato object detection in bin picking.

Details

Motivation: AI vision systems in manufacturing need large annotated datasets, but agricultural industries lack readily available CAD models for synthetic data generation, making training costly and difficult.

Method: Proposes techniques to substitute CAD files: using highly representative 3D models acquired through scanning or image-to-3D approaches to generate synthetic datasets for training object detection models.

Result: Demonstrates that representative 3D models can generate effective synthetic training data. Finetuning on small real datasets significantly improves performance, achieving similar results even with less representative models.

Conclusion: Synthetic data generation using alternative 3D modeling approaches (scanning/image-to-3D) can effectively reduce data acquisition costs in agricultural AI applications, especially when combined with limited real data finetuning.

Abstract: In the manufacturing industry, computer vision systems based on artificial intelligence (AI) are widely used to reduce costs and increase production. Training these AI models requires a large amount of training data that is costly to acquire and annotate, especially in high-variance, low-volume manufacturing environments. A popular approach to reduce the need for real data is the use of synthetic data that is generated by leveraging computer-aided design (CAD) models available in the industry. However, in the agricultural industry these models are not readily available, increasing the difficulty in leveraging synthetic data. In this paper, we present different techniques for substituting CAD files to create synthetic datasets. We measure their relative performance when used to train an AI object detection model to separate stones and potatoes in a bin picking environment. We demonstrate that using highly representative 3D models acquired by scanning or using image-to-3D approaches can be used to generate synthetic data for training object detection models. Finetuning on a small real dataset can significantly improve the performance of the models and even get similar performance when less representative models are used.

[230] From Brute Force to Semantic Insight: Performance-Guided Data Transformation Design with LLMs

Usha Shrestha, Dmitry Ignatov, Radu Timofte

Main category: cs.CV

TL;DR: LLMs can autonomously engineer optimal code transformations by learning from empirical performance feedback, reducing search by 600x while maintaining competitive accuracy.

Details

Motivation: Current data-aware augmentation for code synthesis relies on heuristic design or brute-force approaches, which are inefficient and limited. There's a need for LLMs to autonomously learn optimal transformations from empirical performance feedback rather than symbolic objectives.

Method: Fine-tune LLMs with Low-Rank Adaptation on a novel repository of 6,000+ empirically evaluated PyTorch augmentation functions, each annotated solely by downstream model accuracy. Training uses pairwise performance ordering (better-worse transformations) to align models through empirical feedback without reinforcement learning, reward models, or symbolic objectives.

Result: Achieves up to 600x fewer evaluated candidates than brute-force discovery while maintaining competitive peak accuracy. Shifts generation from random synthesis to task-aligned design. Direct prompting outperforms Chain-of-Thought prompting which introduces syntactic noise. Model internalizes semantic performance cues rather than memorizing syntax.

Conclusion: LLMs can exhibit task-level reasoning through non-textual feedback loops, bypassing explicit symbolic rewards. This demonstrates that performance-aware closed-loop solutions enable autonomous engineering of optimal code transformations through empirical learning.

Abstract: Large language models (LLMs) have achieved notable performance in code synthesis; however, data-aware augmentation remains a limiting factor, handled via heuristic design or brute-force approaches. We introduce a performance-aware, closed-loop solution in the NNGPT ecosystem of projects that enables LLMs to autonomously engineer optimal transformations by internalizing empirical performance cues. We fine-tune LLMs with Low-Rank Adaptation on a novel repository of more than 6,000 empirically evaluated PyTorch augmentation functions, each annotated solely by downstream model accuracy. Training uses pairwise performance ordering (better-worse transformations), enabling alignment through empirical feedback without reinforcement learning, reward models, or symbolic objectives. This reduces the need for exhaustive search, achieving up to 600x times fewer evaluated candidates than brute-force discovery while maintaining competitive peak accuracy and shifting generation from random synthesis to task-aligned design. Ablation studies show that structured Chain-of-Thought prompting introduces syntactic noise and degrades performance, whereas direct prompting ensures stable optimization in performance-critical code tasks. Qualitative and quantitative analyses demonstrate that the model internalizes semantic performance cues rather than memorizing syntax. These results show that LLMs can exhibit task-level reasoning through non-textual feedback loops, bypassing explicit symbolic rewards.

[231] EvalBlocks: A Modular Pipeline for Rapidly Evaluating Foundation Models in Medical Imaging

Jan Tagscherer, Sarah de Boer, Lena Philipp, Fennie van der Graaf, Dré Peeters, Joeran Bosma, Lars Leijten, Bogdan Obreja, Ewoud Smit, Alessa Hering

Main category: cs.CV

TL;DR: EvalBlocks is a modular framework built on Snakemake for efficient evaluation of medical imaging foundation models, enabling reproducible experiments with centralized tracking and parallel execution.

Details

Motivation: Current medical imaging foundation model development requires manual tracking of numerous experiments and design choices, which is slow, error-prone, and burdens researchers with evaluation logistics instead of focusing on model innovation.

Method: Built on Snakemake, EvalBlocks provides a modular, plug-and-play framework that supports seamless integration of new datasets, foundation models, aggregation methods, and evaluation strategies with efficient caching and parallel execution.

Result: Demonstrated on five state-of-the-art foundation models and three medical imaging classification tasks, EvalBlocks successfully streamlines model evaluation, enabling faster iteration and reproducible experiments with centralized tracking.

Conclusion: EvalBlocks addresses the evaluation bottleneck in medical imaging foundation model development by providing an open-source framework that automates experiment tracking and execution, allowing researchers to focus more on model innovation rather than evaluation logistics.

Abstract: Developing foundation models in medical imaging requires continuous monitoring of downstream performance. Researchers are burdened with tracking numerous experiments, design choices, and their effects on performance, often relying on ad-hoc, manual workflows that are inherently slow and error-prone. We introduce EvalBlocks, a modular, plug-and-play framework for efficient evaluation of foundation models during development. Built on Snakemake, EvalBlocks supports seamless integration of new datasets, foundation models, aggregation methods, and evaluation strategies. All experiments and results are tracked centrally and are reproducible with a single command, while efficient caching and parallel execution enable scalable use on shared compute infrastructure. Demonstrated on five state-of-the-art foundation models and three medical imaging classification tasks, EvalBlocks streamlines model evaluation, enabling researchers to iterate faster and focus on model innovation rather than evaluation logistics. The framework is released as open source software at https://github.com/DIAGNijmegen/eval-blocks.

[232] IDESplat: Iterative Depth Probability Estimation for Generalizable 3D Gaussian Splatting

Wei Long, Haifeng Wu, Shiyin Jiang, Jinhua Zhang, Xinchun Ji, Shuhang Gu

Main category: cs.CV

TL;DR: IDESplat improves 3D Gaussian Splatting by iteratively boosting depth probability estimation through cascaded warp operations for more accurate Gaussian mean prediction.

Details

Motivation: Existing methods for generalizable 3D Gaussian Splatting rely on single-warp depth estimation, which fails to fully leverage cross-view geometric cues, resulting in unstable and coarse depth maps that hinder accurate Gaussian mean prediction.

Method: Proposes IDESplat with Depth Probability Boosting Unit (DPBU) that integrates epipolar attention maps from cascading warp operations multiplicatively, and stacks multiple DPBUs in an iterative process to progressively refine depth probability estimates and identify high-likelihood depth candidates.

Result: Achieves state-of-the-art performance on RealEstate10K, ACID, and DL3DV datasets with real-time efficiency. Outperforms DepthSplat by 0.33 dB PSNR on RE10K using only 10.7% parameters and 70% memory, and improves by 2.95 dB PSNR on DTU in cross-dataset experiments.

Conclusion: IDESplat’s iterative depth probability boosting approach significantly improves Gaussian mean prediction accuracy, leading to better reconstruction quality and strong generalization ability while maintaining computational efficiency.

Abstract: Generalizable 3D Gaussian Splatting aims to directly predict Gaussian parameters using a feed-forward network for scene reconstruction. Among these parameters, Gaussian means are particularly difficult to predict, so depth is usually estimated first and then unprojected to obtain the Gaussian sphere centers. Existing methods typically rely solely on a single warp to estimate depth probability, which hinders their ability to fully leverage cross-view geometric cues, resulting in unstable and coarse depth maps. To address this limitation, we propose IDESplat, which iteratively applies warp operations to boost depth probability estimation for accurate Gaussian mean prediction. First, to eliminate the inherent instability of a single warp, we introduce a Depth Probability Boosting Unit (DPBU) that integrates epipolar attention maps produced by cascading warp operations in a multiplicative manner. Next, we construct an iterative depth estimation process by stacking multiple DPBUs, progressively identifying potential depth candidates with high likelihood. As IDESplat iteratively boosts depth probability estimates and updates the depth candidates, the depth map is gradually refined, resulting in accurate Gaussian means. We conduct experiments on RealEstate10K, ACID, and DL3DV. IDESplat achieves outstanding reconstruction quality and state-of-the-art performance with real-time efficiency. On RE10K, it outperforms DepthSplat by 0.33 dB in PSNR, using only 10.7% of the parameters and 70% of the memory. Additionally, our IDESplat improves PSNR by 2.95 dB over DepthSplat on the DTU dataset in cross-dataset experiments, demonstrating its strong generalization ability.

Arun Muthukkumar

Main category: cs.CV

TL;DR: MDENeRF: An iterative framework that refines monocular depth estimates using Neural Radiance Fields (NeRFs) to add fine geometric details while maintaining global structure.

Details

Motivation: Current monocular depth estimation methods often produce smooth depth maps lacking fine geometric details needed for accurate scene understanding in applications like autonomous navigation and extended reality.

Method: Three-component iterative framework: (1) initial monocular estimate for global structure, (2) NeRF trained on perturbed viewpoints with per-pixel uncertainty derived from volume rendering, (3) Bayesian fusion of noisy monocular and NeRF depths to iteratively inject high-frequency details.

Result: Demonstrates superior performance on key metrics and experiments using indoor scenes from the SUN RGB-D dataset.

Conclusion: MDENeRF effectively combines monocular depth estimation with NeRF-based refinement to produce detailed depth maps with both global structure and fine geometric details.

Abstract: Monocular depth estimation has applications in many fields, such as autonomous navigation and extended reality, making it an essential computer vision task. However, current methods often produce smooth depth maps that lack the fine geometric detail needed for accurate scene understanding. We propose MDENeRF, an iterative framework that refines monocular depth estimates using depth information from Neural Radiance Fields (NeRFs). MDENeRF consists of three components: (1) an initial monocular estimate for global structure, (2) a NeRF trained on perturbed viewpoints, with per-pixel uncertainty, and (3) Bayesian fusion of the noisy monocular and NeRF depths. We derive NeRF uncertainty from the volume rendering process to iteratively inject high-frequency fine details. Meanwhile, our monocular prior maintains global structure. We demonstrate superior performance on key metrics and experiments using indoor scenes from the SUN RGB-D dataset.

[234] FLNet: Flood-Induced Agriculture Damage Assessment using Super Resolution of Satellite Images

Sanidhya Ghosal, Anurag Sharma, Sushil Ghildiyal, Mukesh Saini

Main category: cs.CV

TL;DR: FLNet uses deep learning with super-resolution to enhance Sentinel-2 satellite images from 10m to 3m resolution for improved crop damage assessment after floods, achieving near-commercial high-resolution performance at lower cost.

Details

Motivation: Traditional manual flood damage surveys are slow and biased, while current satellite methods face limitations like cloud cover and low spatial resolution. There's a need for rapid, accurate, and cost-effective crop damage assessment for post-disaster agricultural management in flood-prone regions like India.

Method: FLNet is a novel deep learning architecture that first applies super-resolution to enhance Sentinel-2 satellite images from 10m to 3m spatial resolution, then performs damage classification on the enhanced images.

Result: On the Bihar Flood Impacted Croplands Dataset (BFCD-22), FLNet improved the critical “Full Damage” F1-score from 0.83 to 0.89, nearly matching the 0.89 score achieved using commercial high-resolution imagery.

Conclusion: FLNet provides a cost-effective and scalable solution for automated, high-fidelity crop damage assessment, enabling a potential nationwide shift from manual to automated assessment methods for post-flood agricultural management.

Abstract: Distributing government relief efforts after a flood is challenging. In India, the crops are widely affected by floods; therefore, making rapid and accurate crop damage assessment is crucial for effective post-disaster agricultural management. Traditional manual surveys are slow and biased, while current satellite-based methods face challenges like cloud cover and low spatial resolution. Therefore, to bridge this gap, this paper introduced FLNet, a novel deep learning based architecture that used super-resolution to enhance the 10 m spatial resolution of Sentinel-2 satellite images into 3 m resolution before classifying damage. We tested our model on the Bihar Flood Impacted Croplands Dataset (BFCD-22), and the results showed an improved critical “Full Damage” F1-score from 0.83 to 0.89, nearly matching the 0.89 score of commercial high-resolution imagery. This work presented a cost-effective and scalable solution, paving the way for a nationwide shift from manual to automated, high-fidelity damage assessment.

[235] HemBLIP: A Vision-Language Model for Interpretable Leukemia Cell Morphology Analysis

Julie van Logtestijn, Petru Manescu

Main category: cs.CV

TL;DR: HemBLIP is a vision-language model that generates interpretable descriptions of blood cell morphology for leukemia diagnosis, outperforming existing models while being more transparent and computationally efficient.

Details

Motivation: Current deep learning models for white blood cell morphology analysis in leukemia diagnosis act as black boxes, limiting clinical trust and adoption. There's a need for more interpretable models that can provide transparent morphological descriptions.

Method: Developed HemBLIP, a vision-language model trained on a new dataset of 14k healthy and leukemic cells paired with expert-derived attribute captions. Used both full fine-tuning and LoRA-based parameter-efficient training, and benchmarked against the biomedical foundation model MedGEMMA.

Result: HemBLIP achieves higher caption quality and morphological accuracy compared to MedGEMMA. LoRA adaptation provides further performance gains with significantly reduced computational cost.

Conclusion: Vision-language models like HemBLIP show promise for transparent and scalable hematological diagnostics by generating interpretable, morphology-aware descriptions of blood cells.

Abstract: Microscopic evaluation of white blood cell morphology is central to leukemia diagnosis, yet current deep learning models often act as black boxes, limiting clinical trust and adoption. We introduce HemBLIP, a vision language model designed to generate interpretable, morphology aware descriptions of peripheral blood cells. Using a newly constructed dataset of 14k healthy and leukemic cells paired with expert-derived attribute captions, we adapt a general-purpose VLM via both full fine-tuning and LoRA based parameter efficient training, and benchmark against the biomedical foundation model MedGEMMA. HemBLIP achieves higher caption quality and morphological accuracy, while LoRA adaptation provides further gains with significantly reduced computational cost. These results highlight the promise of vision language models for transparent and scalable hematological diagnostics.

[236] FocusUI: Efficient UI Grounding via Position-Preserving Visual Token Selection

Mingyu Ouyang, Kevin Qinghong Lin, Mike Zheng Shou, Hwee Tou Ng

Main category: cs.CV

TL;DR: FocusUI is an efficient UI grounding framework that reduces visual token overhead by selecting instruction-relevant patches while preserving positional continuity, achieving faster inference with minimal accuracy loss.

Details

Motivation: Current Vision-Language Models for UI grounding tokenize high-resolution screenshots into thousands of visual tokens, causing significant computational overhead and diluted attention, unlike humans who focus on relevant regions.

Method: FocusUI addresses two challenges: (1) Eliminates redundant tokens using patch-level supervision combining instruction-conditioned scores with rule-based UI-graph scores, and (2) Preserves positional continuity with PosPad strategy that compresses dropped token sequences into special markers at their last index.

Result: FocusUI surpasses GUI-specific baselines on four grounding benchmarks, achieving 3.7% improvement over GUI-Actor-7B on ScreenSpot-Pro. With only 30% token retention, it drops only 3.2% accuracy while achieving 1.44x faster inference and 17% lower peak GPU memory.

Conclusion: FocusUI demonstrates that efficient UI grounding is achievable through intelligent visual token selection that preserves positional information, offering significant computational benefits with minimal performance degradation.

Abstract: Vision-Language Models (VLMs) have shown remarkable performance in User Interface (UI) grounding tasks, driven by their ability to process increasingly high-resolution screenshots. However, screenshots are tokenized into thousands of visual tokens (e.g., about 4700 for 2K resolution), incurring significant computational overhead and diluting attention. In contrast, humans typically focus on regions of interest when interacting with UI. In this work, we pioneer the task of efficient UI grounding. Guided by practical analysis of the task’s characteristics and challenges, we propose FocusUI, an efficient UI grounding framework that selects patches most relevant to the instruction while preserving positional continuity for precise grounding. FocusUI addresses two key challenges: (1) Eliminating redundant tokens in visual encoding. We construct patch-level supervision by fusing an instruction-conditioned score with a rule-based UI-graph score that down-weights large homogeneous regions to select distinct and instruction-relevant visual tokens. (2) Preserving positional continuity during visual token selection. We find that general visual token pruning methods suffer from severe accuracy degradation on UI grounding tasks due to broken positional information. We introduce a novel PosPad strategy, which compresses each contiguous sequence of dropped visual tokens into a single special marker placed at the sequence’s last index to preserve positional continuity. Comprehensive experiments on four grounding benchmarks demonstrate that FocusUI surpasses GUI-specific baselines. On the ScreenSpot-Pro benchmark, FocusUI-7B achieves a performance improvement of 3.7% over GUI-Actor-7B. Even with only 30% visual token retention, FocusUI-7B drops by only 3.2% while achieving up to 1.44x faster inference and 17% lower peak GPU memory.

[237] ResTok: Learning Hierarchical Residuals in 1D Visual Tokenizers for Autoregressive Image Generation

Xu Zhang, Cheng Da, Huan Yang, Kun Gai, Ming Lu, Zhan Ma

Main category: cs.CV

TL;DR: ResTok introduces a hierarchical residual visual tokenizer for autoregressive image generation that improves efficiency and quality by incorporating vision-specific priors rather than treating images as flat token sequences like language.

Details

Motivation: Current 1D visual tokenizers follow language modeling principles, treating visual data as flat sequential tokens and overlooking key vision properties like hierarchical and residual network designs that are essential for convergence and efficiency in visual models.

Method: Proposes Residual Tokenizer (ResTok) that builds hierarchical residuals for both image tokens and latent tokens, enabling cross-level feature fusion at each layer. Also introduces a hierarchical AR generator that predicts entire levels of latent tokens at once to reduce sampling steps.

Result: Achieves gFID of 2.34 on ImageNet-256 with only 9 sampling steps, significantly improving AR image generation by restoring hierarchical residual priors in visual tokenization.

Conclusion: Incorporating vision-specific hierarchical residual priors into visual tokenization substantially enhances representational capacity and generation efficiency for autoregressive image modeling, outperforming language-inspired approaches.

Abstract: Existing 1D visual tokenizers for autoregressive (AR) generation largely follow the design principles of language modeling, as they are built directly upon transformers whose priors originate in language, yielding single-hierarchy latent tokens and treating visual data as flat sequential token streams. However, this language-like formulation overlooks key properties of vision, particularly the hierarchical and residual network designs that have long been essential for convergence and efficiency in visual models. To bring “vision” back to vision, we propose the Residual Tokenizer (ResTok), a 1D visual tokenizer that builds hierarchical residuals for both image tokens and latent tokens. The hierarchical representations obtained through progressively merging enable cross-level feature fusion at each layer, substantially enhancing representational capacity. Meanwhile, the semantic residuals between hierarchies prevent information overlap, yielding more concentrated latent distributions that are easier for AR modeling. Cross-level bindings consequently emerge without any explicit constraints. To accelerate the generation process, we further introduce a hierarchical AR generator that substantially reduces sampling steps by predicting an entire level of latent tokens at once rather than generating them strictly token-by-token. Extensive experiments demonstrate that restoring hierarchical residual priors in visual tokenization significantly improves AR image generation, achieving a gFID of 2.34 on ImageNet-256 with only 9 sampling steps. Code is available at https://github.com/Kwai-Kolors/ResTok.

[238] FUSION: Full-Body Unified Motion Prior for Body and Hands via Diffusion

Enes Duran, Nikos Athanasiou, Muhammed Kocabas, Michael J. Black, Omid Taheri

Main category: cs.CV

TL;DR: FUSION is the first diffusion-based unconditional full-body motion prior that jointly models body and hand motion, addressing the lack of large-scale datasets for detailed hand articulation in full-body motion synthesis.

Details

Motivation: Existing motion synthesis methods either ignore hand motions or generate full-body motions only for narrow tasks. There's a lack of large-scale datasets that jointly capture diverse full-body motion with detailed hand articulation, as current datasets are either limited in scale/diversity or focus on body/hand separately.

Method: Curate and unify existing hand motion datasets with large-scale body motion data to generate full-body sequences. Propose FUSION, a diffusion-based unconditional full-body motion prior that jointly models body and hand motion. Develop an optimization pipeline that refines the latent space of the diffusion model for task-specific motions.

Result: FUSION surpasses state-of-the-art skeletal control models on Keypoint Tracking in HumanML3D dataset and achieves superior motion naturalness. Successfully demonstrates two applications: (1) generating detailed full-body motion with fingers during object interaction, and (2) generating Self-Interaction motions using LLM to transform natural language into motion constraints.

Conclusion: FUSION enables precise control over hand motion while maintaining plausible full-body coordination, going beyond typical motion prior uses. The approach addresses the critical gap in full-body motion synthesis with detailed hand articulation, with code to be made public.

Abstract: Hands are central to interacting with our surroundings and conveying gestures, making their inclusion essential for full-body motion synthesis. Despite this, existing human motion synthesis methods fall short: some ignore hand motions entirely, while others generate full-body motions only for narrowly scoped tasks under highly constrained settings. A key obstacle is the lack of large-scale datasets that jointly capture diverse full-body motion with detailed hand articulation. While some datasets capture both, they are limited in scale and diversity. Conversely, large-scale datasets typically focus either on body motion without hands or on hand motions without the body. To overcome this, we curate and unify existing hand motion datasets with large-scale body motion data to generate full-body sequences that capture both hand and body. We then propose the first diffusion-based unconditional full-body motion prior, FUSION, which jointly models body and hand motion. Despite using a pose-based motion representation, FUSION surpasses state-of-the-art skeletal control models on the Keypoint Tracking task in the HumanML3D dataset and achieves superior motion naturalness. Beyond standard benchmarks, we demonstrate that FUSION can go beyond typical uses of motion priors through two applications: (1) generating detailed full-body motion including fingers during interaction given the motion of an object, and (2) generating Self-Interaction motions using an LLM to transform natural language cues into actionable motion constraints. For these applications, we develop an optimization pipeline that refines the latent space of our diffusion model to generate task-specific motions. Experiments on these tasks highlight precise control over hand motion while maintaining plausible full-body coordination. The code will be public.

[239] PosterVerse: A Full-Workflow Framework for Commercial-Grade Poster Generation with HTML-Based Scalable Typography

Junle Liu, Peirong Zhang, Yuyi Zhang, Pengyu Yan, Hui Zhou, Xinyue Zhou, Fengjun Guo, Lianwen Jin

Main category: cs.CV

TL;DR: PosterVerse is a full-workflow commercial poster generation system that automates design through LLM-based blueprint creation, diffusion-based background generation, and MLLM-powered HTML rendering, using a novel HTML-based dataset called PosterDNA.

Details

Motivation: Current automated poster generation systems have significant limitations: incomplete design workflows, poor text rendering accuracy, and insufficient flexibility for commercial applications. Commercial-grade posters require seamless integration of aesthetic appeal with precise, informative content delivery.

Method: Three-stage approach: (1) Blueprint creation using fine-tuned LLMs to extract key design elements from user requirements, (2) Graphical background generation via customized diffusion models for visual appeal, (3) Unified layout-text rendering with MLLM-powered HTML engine for high text accuracy and customization. Also introduces PosterDNA dataset - first Chinese poster generation dataset with HTML typography files.

Result: Experimental results show PosterVerse consistently produces commercial-grade posters with appealing visuals, accurate text alignment, and customizable layouts. The HTML-based approach fundamentally solves challenges of rendering small and high-density text.

Conclusion: PosterVerse is a promising solution for automating commercial poster design, offering a full-workflow approach that addresses key limitations of existing systems through its three-stage pipeline and novel HTML-based dataset.

Abstract: Commercial-grade poster design demands the seamless integration of aesthetic appeal with precise, informative content delivery. Current automated poster generation systems face significant limitations, including incomplete design workflows, poor text rendering accuracy, and insufficient flexibility for commercial applications. To address these challenges, we propose PosterVerse, a full-workflow, commercial-grade poster generation method that seamlessly automates the entire design process while delivering high-density and scalable text rendering. PosterVerse replicates professional design through three key stages: (1) blueprint creation using fine-tuned LLMs to extract key design elements from user requirements, (2) graphical background generation via customized diffusion models to create visually appealing imagery, and (3) unified layout-text rendering with an MLLM-powered HTML engine to guarantee high text accuracy and flexible customization. In addition, we introduce PosterDNA, a commercial-grade, HTML-based dataset tailored for training and validating poster design models. To the best of our knowledge, PosterDNA is the first Chinese poster generation dataset to introduce HTML typography files, enabling scalable text rendering and fundamentally solving the challenges of rendering small and high-density text. Experimental results demonstrate that PosterVerse consistently produces commercial-grade posters with appealing visuals, accurate text alignment, and customizable layouts, making it a promising solution for automating commercial poster design. The code and model are available at https://github.com/wuhaer/PosterVerse.

[240] Thinking with Frames: Generative Video Distortion Evaluation via Frame Reward Model

Yuan Wang, Borui Liao, Huijuan Huang, Jinda Lu, Ouxiang Li, Kuien Liu, Meng Wang, Xiang Wang

Main category: cs.CV

TL;DR: REACT is a frame-level reward model for evaluating structural distortions in generative videos, trained on a large human-annotated dataset and using a two-stage framework with supervised fine-tuning and reinforcement learning.

Details

Motivation: Existing video reward models focus on visual quality, motion quality, and text alignment but overlook structural distortions like abnormal object appearances and interactions, which degrade generative video quality.

Method: 1) Construct large-scale human preference dataset with structural distortion taxonomy; 2) Use CoT synthesis pipeline for additional data; 3) Two-stage training: supervised fine-tuning with masked loss, then reinforcement learning with GRPO and pairwise rewards; 4) Dynamic sampling during inference to focus on distortion-prone frames.

Result: REACT effectively complements existing reward models in assessing structural distortions, achieving accurate quantitative evaluations and interpretable attribution analysis, as demonstrated through REACT-Bench benchmark.

Conclusion: REACT addresses the gap in structural distortion evaluation for generative videos, providing a specialized reward model that enhances overall video quality assessment through frame-level analysis and reasoning.

Abstract: Recent advances in video reward models and post-training strategies have improved text-to-video (T2V) generation. While these models typically assess visual quality, motion quality, and text alignment, they often overlook key structural distortions, such as abnormal object appearances and interactions, which can degrade the overall quality of the generative video. To address this gap, we introduce REACT, a frame-level reward model designed specifically for structural distortions evaluation in generative videos. REACT assigns point-wise scores and attribution labels by reasoning over video frames, focusing on recognizing distortions. To support this, we construct a large-scale human preference dataset, annotated based on our proposed taxonomy of structural distortions, and generate additional data using a efficient Chain-of-Thought (CoT) synthesis pipeline. REACT is trained with a two-stage framework: ((1) supervised fine-tuning with masked loss for domain knowledge injection, followed by (2) reinforcement learning with Group Relative Policy Optimization (GRPO) and pairwise rewards to enhance reasoning capability and align output scores with human preferences. During inference, a dynamic sampling mechanism is introduced to focus on frames most likely to exhibit distortion. We also present REACT-Bench, a benchmark for generative video distortion evaluation. Experimental results demonstrate that REACT complements existing reward models in assessing structutal distortion, achieving both accurate quantitative evaluations and interpretable attribution analysis.

[241] Unsupervised Modular Adaptive Region Growing and RegionMix Classification for Wind Turbine Segmentation

Raül Pérez-Gonzalo, Riccardo Magro, Andreas Espersen, Antonio Agudo

Main category: cs.CV

TL;DR: Annotation-efficient wind turbine blade segmentation using unsupervised region growing and region classification instead of pixel-wise deep learning.

Details

Motivation: Wind turbine inspections require accurate blade segmentation, but traditional pixel-wise deep learning methods need extensive annotated datasets, creating scalability challenges.

Method: Reframes segmentation as binary region classification using unsupervised Modular Adaptive Region Growing with Adaptive Thresholding and Region Merging, plus RegionMix augmentation for better generalization.

Result: Achieves state-of-the-art segmentation accuracy and strong cross-site generalization across different windfarms.

Conclusion: The annotation-efficient approach enables reliable turbine blade segmentation without extensive labeled data, improving scalability for automated wind turbine inspections.

Abstract: Reliable operation of wind turbines requires frequent inspections, as even minor surface damages can degrade aerodynamic performance, reduce energy output, and accelerate blade wear. Central to automating these inspections is the accurate segmentation of turbine blades from visual data. This task is traditionally addressed through dense, pixel-wise deep learning models. However, such methods demand extensive annotated datasets, posing scalability challenges. In this work, we introduce an annotation-efficient segmentation approach that reframes the pixel-level task into a binary region classification problem. Image regions are generated using a fully unsupervised, interpretable Modular Adaptive Region Growing technique, guided by image-specific Adaptive Thresholding and enhanced by a Region Merging process that consolidates fragmented areas into coherent segments. To improve generalization and classification robustness, we introduce RegionMix, an augmentation strategy that synthesizes new training samples by combining distinct regions. Our framework demonstrates state-of-the-art segmentation accuracy and strong cross-site generalization by consistently segmenting turbine blades across distinct windfarms.

[242] Mind the Generative Details: Direct Localized Detail Preference Optimization for Video Diffusion Models

Zitong Huang, Kaidong Zhang, Yukang Ding, Chao Gao, Rui Ding, Ying Chen, Wangmeng Zuo

Main category: cs.CV

TL;DR: LocalDPO: A novel post-training framework that aligns text-to-video diffusion models using localized preference pairs from real videos, optimizing at spatio-temporal region level for more efficient and fine-grained alignment.

Details

Motivation: Existing DPO methods for text-to-video alignment are inefficient (relying on multi-sample ranking and task-specific critic models) and provide ambiguous global supervision. There's a need for more efficient and fine-grained alignment approaches.

Method: Proposes LocalDPO with automated pipeline: uses real videos as positive samples, generates negatives by locally corrupting them with random spatio-temporal masks and restoring only masked regions using frozen base model. Uses region-aware DPO loss that restricts preference learning to corrupted areas for rapid convergence.

Result: Experiments on Wan2.1 and CogVideoX show LocalDPO consistently improves video fidelity, temporal coherence and human preference scores over other post-training approaches.

Conclusion: LocalDPO establishes a more efficient and fine-grained paradigm for video generator alignment by operating at spatio-temporal region level rather than global supervision.

Abstract: Aligning text-to-video diffusion models with human preferences is crucial for generating high-quality videos. Existing Direct Preference Otimization (DPO) methods rely on multi-sample ranking and task-specific critic models, which is inefficient and often yields ambiguous global supervision. To address these limitations, we propose LocalDPO, a novel post-training framework that constructs localized preference pairs from real videos and optimizes alignment at the spatio-temporal region level. We design an automated pipeline to efficiently collect preference pair data that generates preference pairs with a single inference per prompt, eliminating the need for external critic models or manual annotation. Specifically, we treat high-quality real videos as positive samples and generate corresponding negatives by locally corrupting them with random spatio-temporal masks and restoring only the masked regions using the frozen base model. During training, we introduce a region-aware DPO loss that restricts preference learning to corrupted areas for rapid convergence. Experiments on Wan2.1 and CogVideoX demonstrate that LocalDPO consistently improves video fidelity, temporal coherence and human preference scores over other post-training approaches, establishing a more efficient and fine-grained paradigm for video generator alignment.

Zhihao Zhu, Jiafeng Liang, Shixin Jiang, Jinlan Fu, Ming Liu, Guanglu Sun, See-Kiong Ng, Bing Qin

Main category: cs.CV

TL;DR: LMMs suffer from “textual inertia” - once hallucinations occur in reasoning chains, models blindly follow erroneous text ignoring visual evidence. Proposed LogicGraph Perturbation Protocol reveals poor self-correction (<10% success), and Active Visual-Context Refinement improves robustness.

Details

Motivation: Large Multimodal Models show impressive video reasoning via Chain-of-Thought, but their reasoning robustness is questionable. The paper identifies "textual inertia" - models blindly adhere to erroneous text in reasoning chains while neglecting conflicting visual evidence, which undermines reliable reasoning.

Method: 1) LogicGraph Perturbation Protocol: Structurally injects perturbations into reasoning chains of diverse LMMs to evaluate self-reflection capabilities. 2) Active Visual-Context Refinement: Training-free inference paradigm with active visual re-grounding for fine-grained verification and adaptive context refinement to summarize/denoise reasoning history.

Result: Models successfully self-correct in less than 10% of cases and predominantly succumb to blind textual error propagation. The proposed Active Visual-Context Refinement approach significantly stifles hallucination propagation and enhances reasoning robustness.

Conclusion: Textual inertia is a critical failure mode in LMM reasoning. The LogicGraph Perturbation Protocol reveals severe self-correction deficiencies, while Active Visual-Context Refinement provides an effective training-free solution to mitigate hallucination propagation and improve reasoning reliability.

Abstract: Large Multimodal Models (LMMs) have demonstrated impressive capabilities in video reasoning via Chain-of-Thought (CoT). However, the robustness of their reasoning chains remains questionable. In this paper, we identify a critical failure mode termed textual inertia, where once a textual hallucination occurs in the thinking process, models tend to blindly adhere to the erroneous text while neglecting conflicting visual evidence. To systematically investigate this, we propose the LogicGraph Perturbation Protocol that structurally injects perturbations into the reasoning chains of diverse LMMs spanning both native reasoning architectures and prompt-driven paradigms to evaluate their self-reflection capabilities. The results reveal that models successfully self-correct in less than 10% of cases and predominantly succumb to blind textual error propagation. To mitigate this, we introduce Active Visual-Context Refinement, a training-free inference paradigm which orchestrates an active visual re-grounding mechanism to enforce fine-grained verification coupled with an adaptive context refinement strategy to summarize and denoise the reasoning history. Experiments demonstrate that our approach significantly stifles hallucination propagation and enhances reasoning robustness.

[244] Gen3R: 3D Scene Generation Meets Feed-Forward Reconstruction

Jiaxin Huang, Yuanbo Yang, Bangbang Yang, Lin Ma, Yuewen Ma, Yiyi Liao

Main category: cs.CV

TL;DR: Gen3R bridges foundational 3D reconstruction models with video diffusion models for scene-level 3D generation, producing both RGB videos and corresponding 3D geometry from single or multiple images.

Details

Motivation: To leverage the complementary strengths of reconstruction models (strong geometric priors) and generative models (appearance priors) for improved 3D scene generation, enabling mutual enhancement between reconstruction and generation tasks.

Method: Repurposes VGGT reconstruction model to produce geometric latents, trains an adapter on its tokens that are regularized to align with appearance latents from pre-trained video diffusion models, then jointly generates disentangled but aligned latents for both geometry and appearance.

Result: Achieves state-of-the-art results in single- and multi-image conditioned 3D scene generation, produces RGB videos with corresponding 3D geometry (camera poses, depth maps, global point clouds), and enhances reconstruction robustness through generative priors.

Conclusion: Tight coupling of reconstruction and generative models provides mutual benefits, enabling high-quality scene-level 3D generation that combines geometric accuracy with realistic appearance, demonstrating the synergy between these complementary approaches.

Abstract: We present Gen3R, a method that bridges the strong priors of foundational reconstruction models and video diffusion models for scene-level 3D generation. We repurpose the VGGT reconstruction model to produce geometric latents by training an adapter on its tokens, which are regularized to align with the appearance latents of pre-trained video diffusion models. By jointly generating these disentangled yet aligned latents, Gen3R produces both RGB videos and corresponding 3D geometry, including camera poses, depth maps, and global point clouds. Experiments demonstrate that our approach achieves state-of-the-art results in single- and multi-image conditioned 3D scene generation. Additionally, our method can enhance the robustness of reconstruction by leveraging generative priors, demonstrating the mutual benefit of tightly coupling reconstruction and generative models.

[245] GeoReason: Aligning Thinking And Answering In Remote Sensing Vision-Language Models Via Logical Consistency Reinforcement Learning

Wenshuai Li, Xiantai Xiang, Zixiao Wen, Guangyao Zhou, Ben Niu, Feng Wang, Lijia Huang, Qiantong Wang, Yuxin Hu

Main category: cs.CV

TL;DR: GeoReason framework enhances RS-VLMs’ cognitive reliability by synchronizing internal reasoning with final decisions through logic-driven dataset and two-stage training with consistency-aware reinforcement learning.

Details

Motivation: Current Remote Sensing Vision-Language Models suffer from logical hallucinations where correct answers come from flawed reasoning or positional shortcuts, undermining reliability in strategic spatial decision-making. There's a need to transition from perception-centric recognition to high-level deductive reasoning.

Method: 1. Construct GeoReason-Bench: 4,000 reasoning trajectories synthesized from geometric primitives and expert knowledge. 2. Two-stage training: Supervised Knowledge Initialization (teaches reasoning syntax and domain expertise) and Consistency-Aware Reinforcement Learning with Logical Consistency Reward (penalizes logical drift via option permutation strategy).

Result: The framework significantly enhances cognitive reliability and interpretability of RS-VLMs, achieving state-of-the-art performance compared to other advanced methods.

Conclusion: GeoReason successfully addresses logical hallucinations in RS-VLMs by ensuring decisions are anchored in verifiable reasoning traces, improving reliability for complex spatial tasks.

Abstract: The evolution of Remote Sensing Vision-Language Models(RS-VLMs) emphasizes the importance of transitioning from perception-centric recognition toward high-level deductive reasoning to enhance cognitive reliability in complex spatial tasks. However, current models often suffer from logical hallucinations, where correct answers are derived from flawed reasoning chains or rely on positional shortcuts rather than spatial logic. This decoupling undermines reliability in strategic spatial decision-making. To address this, we present GeoReason, a framework designed to synchronize internal thinking with final decisions. We first construct GeoReason-Bench, a logic-driven dataset containing 4,000 reasoning trajectories synthesized from geometric primitives and expert knowledge. We then formulate a two-stage training strategy: (1) Supervised Knowledge Initialization to equip the model with reasoning syntax and domain expertise, and (2) Consistency-Aware Reinforcement Learning to refine deductive reliability. This second stage integrates a novel Logical Consistency Reward, which penalizes logical drift via an option permutation strategy to anchor decisions in verifiable reasoning traces. Experimental results demonstrate that our framework significantly enhances the cognitive reliability and interpretability of RS-VLMs, achieving state-of-the-art performance compared to other advanced methods.

[246] Pixel-Wise Multimodal Contrastive Learning for Remote Sensing Images

Leandro Stival, Ricardo da Silva Torres, Helio Pedrini

Main category: cs.CV

TL;DR: Proposes PIMC, a multimodal self-supervised approach using 2D recurrence plots from pixel time series and remote sensing imagery, outperforming SOTA on Earth observation tasks.

Details

Motivation: Satellites generate massive Earth observation data (SITS), but most deep learning models process entire images or complete time series, missing pixel-level temporal patterns. Need more effective feature extraction from pixel-wise time series data.

Method: 1) Generate recurrence plots from pixel-based vegetation index time series (NDVI, EVI, SAVI) as 2D representations instead of raw values. 2) Propose PIxel-wise Multimodal Contrastive (PIMC) - multimodal self-supervision that creates encoders using 2D pixel time series representations and remote sensing imagery.

Result: Evaluated on PASTIS (pixel-level forecasting/classification) and EuroSAT (land cover classification). Outperforms SOTA methods on all downstream tasks. 2D representations enhance SITS feature extraction, contrastive learning improves both pixel time series and RSI representations.

Conclusion: Multimodal method outperforms existing models, establishing robust self-supervision framework for processing both SITS and RSI. 2D recurrence plots provide more informative representations than raw pixel values.

Abstract: Satellites continuously generate massive volumes of data, particularly for Earth observation, including satellite image time series (SITS). However, most deep learning models are designed to process either entire images or complete time series sequences to extract meaningful features for downstream tasks. In this study, we propose a novel multimodal approach that leverages pixel-wise two-dimensional (2D) representations to encode visual property variations from SITS more effectively. Specifically, we generate recurrence plots from pixel-based vegetation index time series (NDVI, EVI, and SAVI) as an alternative to using raw pixel values, creating more informative representations. Additionally, we introduce PIxel-wise Multimodal Contrastive (PIMC), a new multimodal self-supervision approach that produces effective encoders based on two-dimensional pixel time series representations and remote sensing imagery (RSI). To validate our approach, we assess its performance on three downstream tasks: pixel-level forecasting and classification using the PASTIS dataset, and land cover classification on the EuroSAT dataset. Moreover, we compare our results to state-of-the-art (SOTA) methods on all downstream tasks. Our experimental results show that the use of 2D representations significantly enhances feature extraction from SITS, while contrastive learning improves the quality of representations for both pixel time series and RSI. These findings suggest that our multimodal method outperforms existing models in various Earth observation tasks, establishing it as a robust self-supervision framework for processing both SITS and RSI. Code avaliable on

[247] Diffusion-DRF: Differentiable Reward Flow for Video Diffusion Fine-Tuning

Yifan Wang, Yanyu Li, Sergey Tulyakov, Yun Fu, Anil Kag

Main category: cs.CV

TL;DR: Diffusion-DRF: A differentiable reward flow method that uses frozen VLMs as training-free critics to fine-tune video diffusion models without needing reward models or preference datasets.

Details

Motivation: Current DPO methods for T2V generation rely on non-differentiable preference signals from human annotations or learned reward models, which makes training label-intensive, bias-prone, and susceptible to reward hacking and unstable training.

Method: Uses a frozen, off-the-shelf VLM as a training-free critic; backpropagates VLM feedback through the diffusion denoising chain; converts logit-level responses into token-aware gradients; employs automated aspect-structured prompting for multi-dimensional VLM feedback; uses gradient checkpointing for efficient updates through final denoising steps.

Result: Improves video quality and semantic alignment while mitigating reward hacking and collapse; works without additional reward models or preference datasets; is model-agnostic and generalizes to other diffusion-based generative tasks.

Conclusion: Diffusion-DRF provides a more stable and efficient approach to fine-tuning video diffusion models by leveraging frozen VLMs as differentiable critics, addressing limitations of current DPO methods while maintaining generalization capabilities.

Abstract: Direct Preference Optimization (DPO) has recently improved Text-to-Video (T2V) generation by enhancing visual fidelity and text alignment. However, current methods rely on non-differentiable preference signals from human annotations or learned reward models. This reliance makes training label-intensive, bias-prone, and easy-to-game, which often triggers reward hacking and unstable training. We propose Diffusion-DRF, a differentiable reward flow for fine-tuning video diffusion models using a frozen, off-the-shelf Vision-Language Model (VLM) as a training-free critic. Diffusion-DRF directly backpropagates VLM feedback through the diffusion denoising chain, converting logit-level responses into token-aware gradients for optimization. We propose an automated, aspect-structured prompting pipeline to obtain reliable multi-dimensional VLM feedback, while gradient checkpointing enables efficient updates through the final denoising steps. Diffusion-DRF improves video quality and semantic alignment while mitigating reward hacking and collapse – without additional reward models or preference datasets. It is model-agnostic and readily generalizes to other diffusion-based generative tasks.

[248] ToTMNet: FFT-Accelerated Toeplitz Temporal Mixing Network for Lightweight Remote Photoplethysmography

Vladimir Frants, Sos Agaian, Karen Panetta

Main category: cs.CV

TL;DR: ToTMNet: Lightweight rPPG architecture using FFT-accelerated Toeplitz temporal mixing instead of attention, achieving strong heart-rate estimation with only 63k parameters.

Details

Motivation: Current deep rPPG models have high computational costs and quadratic scaling with temporal length due to attention mechanisms. Need for efficient, lightweight alternatives that maintain accuracy.

Method: Proposes ToTMNet with FFT-accelerated Toeplitz temporal mixing layer that provides full-sequence temporal receptive field with linear parameters. Combines local depthwise temporal convolution with gated global Toeplitz mixing in compact architecture.

Result: Achieves 1.055 bpm MAE with Pearson correlation 0.996 on UBFC-rPPG intra-dataset, and 1.582 bpm MAE with Pearson correlation 0.994 in synthetic-to-real setting (SCAMPS to UBFC-rPPG). Only 63k parameters.

Conclusion: Toeplitz-structured temporal mixing is a practical and efficient alternative to attention for rPPG, enabling strong performance with compact design. Gating mechanism crucial for domain shift robustness.

Abstract: Remote photoplethysmography (rPPG) estimates a blood volume pulse (BVP) waveform from facial videos captured by commodity cameras. Although recent deep models improve robustness compared to classical signal-processing approaches, many methods increase computational cost and parameter count, and attention-based temporal modeling introduces quadratic scaling with respect to the temporal length. This paper proposes ToTMNet, a lightweight rPPG architecture that replaces temporal attention with an FFT-accelerated Toeplitz temporal mixing layer. The Toeplitz operator provides full-sequence temporal receptive field using a linear number of parameters in the clip length and can be applied in near-linear time using circulant embedding and FFT-based convolution. ToTMNet integrates the global Toeplitz temporal operator into a compact gated temporal mixer that combines a local depthwise temporal convolution branch with gated global Toeplitz mixing, enabling efficient long-range temporal filtering while only having 63k parameters. Experiments on two datasets, UBFC-rPPG (real videos) and SCAMPS (synthetic videos), show that ToTMNet achieves strong heart-rate estimation accuracy with a compact design. On UBFC-rPPG intra-dataset evaluation, ToTMNet reaches 1.055 bpm MAE with Pearson correlation 0.996. In a synthetic-to-real setting (SCAMPS to UBFC-rPPG), ToTMNet reaches 1.582 bpm MAE with Pearson correlation 0.994. Ablation results confirm that the gating mechanism is important for effectively using global Toeplitz mixing, especially under domain shift. The main limitation of this preprint study is the use of only two datasets; nevertheless, the results indicate that Toeplitz-structured temporal mixing is a practical and efficient alternative to attention for rPPG.

[249] ImLoc: Revisiting Visual Localization with Image-based Representation

Xudong Jiang, Fangjinhua Wang, Silvano Galliani, Christoph Vogel, Marc Pollefeys

Main category: cs.CV

TL;DR: 2D image-based localization augmented with depth maps achieves state-of-the-art accuracy while maintaining easy map building and updating.

Details

Motivation: Existing visual localization methods have trade-offs: 2D image-based methods are easy to build/maintain but limited in geometric reasoning, while 3D structure-based methods are accurate but require centralized reconstruction and are hard to update.

Method: Augment 2D images with estimated depth maps to capture geometric structure, use dense matchers for matching, implement compact compression and GPU-accelerated LO-RANSAC for efficiency.

Result: Achieves new state-of-the-art accuracy on various standard benchmarks and outperforms existing memory-efficient methods at comparable map sizes.

Conclusion: Depth-augmented 2D image representation provides the best of both worlds: easy map building/maintenance like 2D methods with high accuracy approaching 3D methods, while being efficient in storage and computation.

Abstract: Existing visual localization methods are typically either 2D image-based, which are easy to build and maintain but limited in effective geometric reasoning, or 3D structure-based, which achieve high accuracy but require a centralized reconstruction and are difficult to update. In this work, we revisit visual localization with a 2D image-based representation and propose to augment each image with estimated depth maps to capture the geometric structure. Supported by the effective use of dense matchers, this representation is not only easy to build and maintain, but achieves highest accuracy in challenging conditions. With compact compression and a GPU-accelerated LO-RANSAC implementation, the whole pipeline is efficient in both storage and computation and allows for a flexible trade-off between accuracy and highest memory efficiency. Our method achieves a new state-of-the-art accuracy on various standard benchmarks and outperforms existing memory-efficient methods at comparable map sizes. Code will be available at https://github.com/cvg/Hierarchical-Localization.

[250] Choreographing a World of Dynamic Objects

Yanzhe Lyu, Chen Geng, Karthik Dharmarajan, Yunzhi Zhang, Hadi Alzayer, Shangzhe Wu, Jiajun Wu

Main category: cs.CV

TL;DR: CHORD is a universal generative pipeline for synthesizing 4D (3D+time) dynamic scenes by distilling Lagrangian motion information from 2D videos, enabling category-agnostic generation of diverse multi-body dynamics.

Details

Motivation: Traditional rule-based graphics pipelines for creating 4D scene dynamics are labor-intensive and category-specific, while learning-based methods require large datasets that may not cover all object categories of interest. There's a need for a universal, scalable approach to generate diverse 4D dynamics.

Method: Proposes a distillation-based pipeline that extracts rich Lagrangian motion information from the Eulerian representations of 2D videos using video generative models. The approach is universal, versatile, and category-agnostic.

Result: Demonstrates effectiveness in generating diverse multi-body 4D dynamics, shows advantages over existing methods, and demonstrates applicability in generating robotics manipulation policies.

Conclusion: CHORD provides a universal generative pipeline for synthesizing 4D dynamic scenes by leveraging video generative models to extract motion information from 2D videos, offering a scalable alternative to traditional rule-based and data-intensive learning approaches.

Abstract: Dynamic objects in our physical 4D (3D + time) world are constantly evolving, deforming, and interacting with other objects, leading to diverse 4D scene dynamics. In this paper, we present a universal generative pipeline, CHORD, for CHOReographing Dynamic objects and scenes and synthesizing this type of phenomena. Traditional rule-based graphics pipelines to create these dynamics are based on category-specific heuristics, yet are labor-intensive and not scalable. Recent learning-based methods typically demand large-scale datasets, which may not cover all object categories in interest. Our approach instead inherits the universality from the video generative models by proposing a distillation-based pipeline to extract the rich Lagrangian motion information hidden in the Eulerian representations of 2D videos. Our method is universal, versatile, and category-agnostic. We demonstrate its effectiveness by conducting experiments to generate a diverse range of multi-body 4D dynamics, show its advantage compared to existing methods, and demonstrate its applicability in generating robotics manipulation policies. Project page: https://yanzhelyu.github.io/chord

[251] Generalized Logit Adjustment: Calibrating Fine-tuned Models by Removing Label Bias in Foundation Models

Beier Zhu, Kaihua Tang, Qianru Sun, Hanwang Zhang

Main category: cs.CV

TL;DR: GLA (Generalized Logit Adjustment) addresses inherent biases in foundation models like CLIP by estimating and correcting class biases without access to pre-training data, achieving significant performance gains across multiple tasks.

Details

Motivation: Foundation models trained on imbalanced web-scale data have inherent biases toward frequent semantics, which persist even after fine-tuning or ensembling. Current methods overlook these biases, limiting zero-shot and downstream task performance.

Method: Proposes Generalized Logit Adjustment (GLA) with optimization-based bias estimation to debias foundation models. Unlike traditional long-tailed classification, GLA estimates biases without explicit access to pre-training data distribution.

Result: Achieves 1.5 pp accuracy gain on ImageNet, 1.4-4.6 pp average improvement on 11 few-shot datasets, and 2.4 pp gains on long-tailed classification tasks.

Conclusion: Addressing inherent biases in foundation models is crucial for improving performance. GLA effectively corrects these biases without requiring access to pre-training data, demonstrating significant improvements across diverse tasks.

Abstract: Foundation models like CLIP allow zero-shot transfer on various tasks without additional training data. Yet, the zero-shot performance is less competitive than a fully supervised one. Thus, to enhance the performance, fine-tuning and ensembling are also commonly adopted to better fit the downstream tasks. However, we argue that such prior work has overlooked the inherent biases in foundation models. Due to the highly imbalanced Web-scale training set, these foundation models are inevitably skewed toward frequent semantics, and thus the subsequent fine-tuning or ensembling is still biased. In this study, we systematically examine the biases in foundation models and demonstrate the efficacy of our proposed Generalized Logit Adjustment (GLA) method. Note that bias estimation in foundation models is challenging, as most pre-train data cannot be explicitly accessed like in traditional long-tailed classification tasks. To this end, GLA has an optimization-based bias estimation approach for debiasing foundation models. As our work resolves a fundamental flaw in the pre-training, the proposed GLA demonstrates significant improvements across a diverse range of tasks: it achieves 1.5 pp accuracy gains on ImageNet, an large average improvement (1.4-4.6 pp) on 11 few-shot datasets, 2.4 pp gains on long-tailed classification. Codes are in https://github.com/BeierZhu/GLA.

[252] Efficient 3D affinely equivariant CNNs with adaptive fusion of augmented spherical Fourier-Bessel bases

Wenzhao Zhao, Steffen Albert, Barbara D. Wichtmann, Angelika Maurer, Ulrike Attenberger, Frank G. Zöllner, Jürgen Hesser

Main category: cs.CV

TL;DR: A novel 3D affine group equivariant CNN using adaptive Monte Carlo augmented spherical Fourier-Bessel filters achieves better equivariance and segmentation accuracy for volumetric images than existing methods.

Details

Motivation: Existing filter-decomposition-based group equivariant CNNs underperform for 3D medical images with dense textures due to parameter sharing and discrete transformation groups, limiting their effectiveness in modern deep architectures.

Method: Proposes a non-parameter-sharing continuous 3D affine group equivariant network using adaptive aggregation of Monte Carlo augmented spherical Fourier-Bessel filter bases, incorporating both angular and radial orthogonality for improved feature extraction.

Result: Experiments on four medical image segmentation datasets and two seismic datasets show superior affine group equivariance and segmentation accuracy compared to existing 3D group equivariant CNN layers, with significant improvements in training stability and data efficiency.

Conclusion: The proposed method effectively addresses limitations of existing group equivariant CNNs for volumetric images, achieving better performance through continuous affine equivariance and improved filter design with both angular and radial orthogonality.

Abstract: Filter-decomposition-based group equivariant convolutional neural networks (CNNs) have shown promising stability and data efficiency for 3D image feature extraction. However, these networks, which rely on parameter sharing and discrete transformation groups, often underperform in modern deep neural network architectures for processing volumetric images with dense 3D textures, such as the common 3D medical images. To address these limitations, this paper presents an efficient non-parameter-sharing continuous 3D affine group equivariant neural network for volumetric images. This network uses an adaptive aggregation of Monte Carlo augmented spherical Fourier-Bessel filter bases to improve the efficiency and flexibility of 3D group equivariant CNNs for volumetric data. Unlike existing methods that focus only on angular orthogonality in filter bases, the introduced spherical Bessel Fourier filter base incorporates both angular and radial orthogonality to improve feature extraction. Experiments on four medical image segmentation datasets and two seismic datasets show that the proposed methods achieve better affine group equivariance and superior segmentation accuracy than existing 3D group equivariant convolutional neural network layers, significantly improving the training stability and data efficiency of conventional CNN layers (at 0.05 significance level). The code is available at https://github.com/ZhaoWenzhao/WMCSFB.

[253] Point Cloud Synthesis Using Inner Product Transforms

Ernst Röell, Bastian Rieck

Main category: cs.CV

TL;DR: A novel method for point cloud synthesis using inner products to encode geometrical-topological characteristics, achieving high quality results with orders of magnitude faster inference times than existing methods.

Details

Motivation: Point cloud synthesis remains challenging despite numerous complex machine learning models, creating a need for more efficient and expressive representations.

Method: Develops a novel encoding method that captures geometrical-topological characteristics of point clouds using inner products, creating a highly-efficient representation with provable expressivity properties that can be integrated into deep learning models.

Result: The method exhibits high quality performance in typical tasks like reconstruction, generation, and interpolation, with inference times orders of magnitude faster than existing methods.

Conclusion: The inner product-based encoding provides an efficient and expressive alternative to complex machine learning models for point cloud synthesis, enabling high-quality results with dramatically improved computational efficiency.

Abstract: Point cloud synthesis, i.e. the generation of novel point clouds from an input distribution, remains a challenging task, for which numerous complex machine learning models have been devised. We develop a novel method that encodes geometrical-topological characteristics of point clouds using inner products, leading to a highly-efficient point cloud representation with provable expressivity properties. Integrated into deep learning models, our encoding exhibits high quality in typical tasks like reconstruction, generation, and interpolation, with inference times orders of magnitude faster than existing methods.

[254] Difficulty Controlled Diffusion Model for Synthesizing Effective Training Data

Zerun Wang, Jiafeng Mao, Xueting Wang, Toshihiko Yamasaki

Main category: cs.CV

TL;DR: A method to generate valuable ‘hard samples’ for training by controlling learning difficulty during generation, achieving better performance with less synthetic data.

Details

Motivation: Current generative models for training data synthesis only generate 'easy samples' that align with common dataset features, missing rare 'hard samples' that are crucial for performance improvement. This leads to needing large volumes of synthetic data for limited gains.

Method: Incorporates learning difficulty as an additional conditioning signal in generative models, with a designed encoder structure and training-generation strategy to control sample difficulty while maintaining domain alignment.

Result: Achieves higher performance with lower generation cost - best performance with only 10% additional synthetic data, saving 63.4 GPU hours compared to previous SOTA on ImageNet. Also provides visualizations of category-specific hard factors for dataset analysis.

Conclusion: The method effectively generates valuable hard samples that yield significant performance improvements, offering an efficient alternative to large-scale synthetic data generation while providing insights into dataset characteristics.

Abstract: Generative models have become a powerful tool for synthesizing training data in computer vision tasks. Current approaches solely focus on aligning generated images with the target dataset distribution. As a result, they capture only the common features in the real dataset and mostly generate ’easy samples’, which are already well learned by models trained on real data. In contrast, those rare ‘hard samples’, with atypical features but crucial for enhancing performance, cannot be effectively generated. Consequently, these approaches must synthesize large volumes of data to yield appreciable performance gains, yet the improvement remains limited. To overcome this limitation, we present a novel method that can learn to control the learning difficulty of samples during generation while also achieving domain alignment. Thus, it can efficiently generate valuable ‘hard samples’ that yield significant performance improvements for target tasks. This is achieved by incorporating learning difficulty as an additional conditioning signal in generative models, together with a designed encoder structure and training-generation strategy. Experimental results across multiple datasets show that our method can achieve higher performance with lower generation cost. Specifically, we obtain the best performance with only 10% additional synthetic data, saving 63.4 GPU hours of generation time compared to the previous SOTA on ImageNet. Moreover, our method provides insightful visualizations of category-specific hard factors, serving as a tool for analyzing datasets.

[255] BiPO: Bidirectional Partial Occlusion Network for Text-to-Motion Synthesis

Seong-Eun Hong, Soobin Lim, Juyeong Hwang, Minwook Chang, Hyeongyeop Kang

Main category: cs.CV

TL;DR: BiPO introduces a bidirectional autoregressive network with partial occlusion for text-to-motion synthesis, achieving SOTA results on HumanML3D and excelling in motion editing tasks.

Details

Motivation: Text-to-motion synthesis is challenging due to the complexity of coordinating full-body dynamics and capturing nuanced motion patterns over extended sequences that accurately reflect textual descriptions.

Method: BiPO integrates part-based generation with a bidirectional autoregressive architecture, using Partial Occlusion technique to probabilistically occlude certain motion part information during training to relax interdependency among body parts.

Result: BiPO achieves state-of-the-art performance on HumanML3D dataset, outperforming recent methods like ParCo, MoMask, and BAMM in FID scores and motion quality. It also excels in motion editing tasks.

Conclusion: BiPO demonstrates effectiveness in advancing text-to-motion synthesis and shows potential for practical applications, particularly in motion editing tasks.

Abstract: Generating natural and expressive human motions from textual descriptions is challenging due to the complexity of coordinating full-body dynamics and capturing nuanced motion patterns over extended sequences that accurately reflect the given text. To address this, we introduce BiPO, Bidirectional Partial Occlusion Network for Text-to-Motion Synthesis, a novel model that enhances text-to-motion synthesis by integrating part-based generation with a bidirectional autoregressive architecture. This integration allows BiPO to consider both past and future contexts during generation while enhancing detailed control over individual body parts without requiring ground-truth motion length. To relax the interdependency among body parts caused by the integration, we devise the Partial Occlusion technique, which probabilistically occludes the certain motion part information during training. In our comprehensive experiments, BiPO achieves state-of-the-art performance on the HumanML3D dataset, outperforming recent methods such as ParCo, MoMask, and BAMM in terms of FID scores and overall motion quality. Notably, BiPO excels not only in the text-to-motion generation task but also in motion editing tasks that synthesize motion based on partially generated motion sequences and textual descriptions. These results reveal the BiPO’s effectiveness in advancing text-to-motion synthesis and its potential for practical applications.

[256] Video LLMs for Temporal Reasoning in Long Videos

Fawad Javed Fateh, Umer Ahmed, Hamza Khan, M. Zeeshan Zia, Quoc-Huy Tran

Main category: cs.CV

TL;DR: TemporalVLM is a video LLM for temporal reasoning in long videos using time-aware visual features with local-global encoding via BiLSTM, evaluated on a new industrial assembly dataset.

Details

Motivation: Need for video LLMs that can handle temporal reasoning and fine-grained understanding in long videos, particularly for industrial applications like assembly processes where temporal relationships are crucial.

Method: 1) Visual encoder divides videos into short-term clips with timestamps, fuses them into time-sensitive local features; 2) BiLSTM module aggregates local features into global representations; 3) First work to incorporate LSTMs into video LLMs.

Result: Outperforms previous methods on temporal reasoning tasks: dense video captioning, temporal video grounding, video highlight detection, and temporal action segmentation. Introduces IndustryASM dataset for industrial assembly evaluation.

Conclusion: TemporalVLM effectively addresses temporal reasoning in long videos through novel time-aware feature encoding with BiLSTM aggregation, demonstrating superior performance on multiple video understanding tasks with practical industrial applications.

Abstract: We introduce TemporalVLM, a video large language model (video LLM) for temporal reasoning and fine-grained understanding in long videos. Our approach includes a visual encoder for mapping a long-term video into features which are time-aware and contain both local and global cues. It first divides an input video into short-term clips, which are jointly encoded with timestamps and fused across overlapping temporal windows into time-sensitive local features. Next, the local features are passed through a bidirectional long short-term memory (BiLSTM) module for global feature aggregation. Moreover, to facilitate the evaluation of TemporalVLM, we present a large-scale long video dataset of industry assembly processes, namely IndustryASM, consisting of videos recorded on factory floors with actions and timestamps annotated by industrial engineers for time and motion studies and temporal action segmentation evaluation. Finally, extensive experiments show that TemporalVLM outperforms previous methods across temporal reasoning and fine-grained understanding tasks, i.e., dense video captioning, temporal video grounding, video highlight detection, and temporal action segmentation. To our best knowledge, our work is the first to incorporate LSTMs into video LLMs.

Kebin Peng, Haotang Li, Zhenyu Qi, Huashan Chen, Zi Wang, Wei Zhang, Sen He, Huanrui Yang, Qing Guo

Main category: cs.CV

TL;DR: PhysDepth improves monocular depth estimation in challenging conditions by incorporating physical priors based on atmospheric attenuation and Rayleigh scattering theory.

Details

Motivation: Current monocular depth estimation models struggle in challenging environments because they overlook robust physical information. The authors found that prediction error increases with atmospheric attenuation, indicating a fundamental weakness in existing approaches.

Method: Proposes PhysDepth framework with two key components: 1) Physical Prior Module (PPM) that uses Rayleigh Scattering theory to extract robust features from the high-SNR red channel, and 2) physics-derived Red Channel Attenuation Loss (RCA) that enforces learning of the Beer-Lambert law.

Result: Extensive evaluations demonstrate that PhysDepth achieves state-of-the-art accuracy in challenging conditions, effectively addressing the fragility of existing models in environments with atmospheric attenuation.

Conclusion: Incorporating physical priors based on atmospheric physics significantly improves monocular depth estimation robustness in challenging environments, providing a plug-and-play solution that can enhance existing SOTA backbones.

Abstract: State-of-the-art monocular depth estimation (MDE) models often struggle in challenging environments, primarily because they overlook robust physical information. To demonstrate this, we first conduct an empirical study by computing the covariance between a model’s prediction error and atmospheric attenuation. We find that the error of existing SOTAs increases with atmospheric attenuation. Based on this finding, we propose PhysDepth, a plug-and-play framework that solves this fragility by infusing physical priors into modern SOTA backbones. PhysDepth incorporates two key components: a Physical Prior Module (PPM) that leverages Rayleigh Scattering theory to extract robust features from the high-SNR red channel, and a physics-derived Red Channel Attenuation Loss (RCA) that enforces model to learn the Beer-Lambert law. Extensive evaluations demonstrate that PhysDepth achieves SOTA accuracy in challenging conditions.

[258] A Novel Convolution and Attention Mechanism-based Model for 6D Object Pose Estimation

Alexander Du, Xiujin Liu

Main category: cs.CV

TL;DR: PoseLecTr is a graph-based encoder-decoder framework using Legendre convolution with attention for 6-DOF object pose estimation from RGB images, achieving competitive performance on standard datasets.

Details

Motivation: Conventional grid-structured convolutions in learning-based approaches have limitations in modeling higher-order and long-range dependencies among image features, especially in cluttered or occluded scenes. This paper addresses these limitations for more robust 6-DOF object pose estimation.

Method: PoseLecTr constructs a graph representation from image features where spatial relationships are explicitly modeled through graph connectivity. It incorporates a Legendre convolution layer for improved numerical stability in graph convolution, along with spatial-attention and self-attention distillation mechanisms to enhance feature selection.

Result: Experiments on LINEMOD, Occluded LINEMOD, and YCB-VIDEO datasets demonstrate competitive performance and consistent improvements across a wide range of objects and scene complexities.

Conclusion: The proposed PoseLecTr framework effectively addresses limitations of conventional grid-structured convolutions by leveraging graph representations with Legendre convolution and attention mechanisms, resulting in robust 6-DOF object pose estimation performance across various challenging scenarios.

Abstract: This paper proposes PoseLecTr, a graph-based encoder-decoder framework that integrates a novel Legendre convolution with attention mechanisms for six-degree-of-freedom (6-DOF) object pose estimation from monocular RGB images. Conventional learning-based approaches predominantly rely on grid-structured convolutions, which can limit their ability to model higher-order and long-range dependencies among image features, especially in cluttered or occluded scenes. PoseLecTr addresses this limitation by constructing a graph representation from image features, where spatial relationships are explicitly modeled through graph connectivity. The proposed framework incorporates a Legendre convolution layer to improve numerical stability in graph convolution, together with spatial-attention and self-attention distillation to enhance feature selection. Experiments conducted on the LINEMOD, Occluded LINEMOD, and YCB-VIDEO datasets demonstrate that our method achieves competitive performance and shows consistent improvements across a wide range of objects and scene complexities.

Jinya Sakurai, Yuki Koyama, Issei Sato

Main category: cs.CV

TL;DR: FairT2I is a training-free framework that uses latent variable guidance and LLM-based bias detection to mitigate societal biases in text-to-image generation while maintaining image quality and diversity.

Details

Motivation: Text-to-image models trained on large uncurated datasets often reproduce societal biases, creating a need for effective debiasing approaches that don't require retraining.

Method: A mathematically principled latent variable guidance formulation that decomposes generative score functions, LLM-based bias detection to identify bias-prone categories, and attribute resampling with adjustable distributions.

Result: LLMs outperform human annotators in bias detection granularity, and FairT2I achieves superior bias mitigation and image diversity compared to baselines while preserving quality and prompt fidelity.

Conclusion: FairT2I provides a unified, flexible, and training-free framework for bias-aware text-to-image generation that subsumes existing approaches and enables real-time interactive debiasing.

Abstract: Text-to-image (T2I) models have advanced creative content generation, yet their reliance on large uncurated datasets often reproduces societal biases. We present FairT2I, a training-free and interactive framework grounded in a mathematically principled latent variable guidance formulation. This formulation decomposes the generative score function into attribute-conditioned components and reweights them according to a defined distribution, providing a unified and flexible mechanism for bias-aware generation that also subsumes many existing ad hoc debiasing approaches as special cases. Building upon this foundation, FairT2I incorporates (1) latent variable guidance as the core mechanism, (2) LLM-based bias detection to automatically infer bias-prone categories and attributes from text prompts as part of the latent structure, and (3) attribute resampling, which allows users to adjust or redefine the attribute distribution based on uniform, real-world, or user-specified statistics. The accompanying user interface supports this pipeline by enabling users to inspect detected biases, modify attributes or weights, and generate debiased images in real time. Experimental results show that LLMs outperform average human annotators in the number and granularity of detected bias categories and attributes. Moreover, FairT2I achieves superior performance to baseline models in both societal bias mitigation and image diversity, while preserving image quality and prompt fidelity.

[260] Deflickering Vision-Based Occupancy Networks through Lightweight Spatio-Temporal Correlation

Fengcheng Yu, Haoran Xu, Canming Xia, Ziyang Zong, Guang Tan

Main category: cs.CV

TL;DR: OccLinker is a plugin framework that improves temporal consistency in vision-based occupancy networks for autonomous driving by efficiently integrating historical information to reduce flickering artifacts.

Details

Motivation: Existing vision-based occupancy networks suffer from temporal inconsistencies (flickering effects) that degrade 3D reconstruction quality and affect downstream decision-making in autonomous driving. Current solutions that incorporate historical information are computationally expensive and may introduce misaligned or redundant features that interfere with object detection.

Method: OccLinker is a plugin framework that can be integrated into existing VONs. It efficiently consolidates historical static and motion cues, learns sparse latent correlations with current features through a dual cross-attention mechanism, and generates correction occupancy components to refine base network predictions. The method also introduces a new temporal consistency metric to quantitatively measure flickering effects.

Result: Extensive experiments on two benchmark datasets demonstrate that OccLinker achieves superior performance with minimal computational overhead while effectively reducing flickering artifacts.

Conclusion: OccLinker provides an effective solution to the temporal inconsistency problem in vision-based occupancy networks, offering improved performance with low computational cost, making it suitable for real-world autonomous driving applications.

Abstract: Vision-based occupancy networks (VONs) provide an end-to-end solution for reconstructing 3D environments in autonomous driving. However, existing methods often suffer from temporal inconsistencies, manifesting as flickering effects that degrade temporal coherence and adversely affect downstream decision-making. While recent approaches incorporate historical information to alleviate this issue, they often incur high computational costs and may introduce misaligned or redundant features that interfere with object detection. We propose OccLinker, a novel plugin framework that can be easily integrated into existing VONs to improve performance. Our method efficiently consolidates historical static and motion cues, learns sparse latent correlations with current features through a dual cross-attention mechanism, and generates correction occupancy components to refine the base network predictions. In addition, we introduce a new temporal consistency metric to quantitatively measure flickering effects. Extensive experiments on two benchmark datasets demonstrate that our method achieves superior performance with minimal computational overhead while effectively reducing flickering artifacts.

[261] U-REPA: Aligning Diffusion U-Nets to ViTs

Yuchuan Tian, Hanting Chen, Mengyu Zheng, Yuchen Liang, Chao Xu, Yunhe Wang

Main category: cs.CV

TL;DR: U-REPA adapts representation alignment from DiT to U-Net diffusion models by addressing architectural differences through middle-stage alignment, upsampling, and manifold loss, achieving faster convergence and better image generation quality.

Details

Motivation: REPA has shown effectiveness in DiT training but hasn't been validated on U-Net architectures, which have faster convergence. Adapting REPA to U-Net faces challenges due to different block functionalities, spatial-dimension inconsistencies from downsampling, and space gaps between U-Net and ViT that hinder tokenwise alignment.

Method: U-REPA proposes three key solutions: 1) Align with U-Net’s middle stage (best option due to skip connections), 2) Upsample U-Net features after MLPs to handle spatial inconsistencies, 3) Use manifold loss instead of tokenwise similarity alignment to regularize relative similarity between samples.

Result: U-REPA achieves excellent generation quality and greatly accelerates convergence speed. With CFG guidance, it reaches FID < 1.5 in 200 epochs or 1M iterations on ImageNet 256×256, and needs only half the total epochs to outperform REPA under sd-vae-ft-ema.

Conclusion: U-REPA successfully adapts representation alignment to U-Net architectures, overcoming the unique challenges of U-Net-ViT alignment and demonstrating superior performance and faster convergence compared to the original REPA approach.

Abstract: Representation Alignment (REPA) that aligns Diffusion Transformer (DiT) hidden-states with ViT visual encoders has proven highly effective in DiT training, demonstrating superior convergence properties, but it has not been validated on the canonical diffusion U-Net architecture that shows faster convergence compared to DiTs. However, adapting REPA to U-Net architectures presents unique challenges: (1) different block functionalities necessitate revised alignment strategies; (2) spatial-dimension inconsistencies emerge from U-Net’s spatial downsampling operations; (3) space gaps between U-Net and ViT hinder the effectiveness of tokenwise alignment. To encounter these challenges, we propose \textbf{U-REPA}, a representation alignment paradigm that bridges U-Net hidden states and ViT features as follows: Firstly, we propose via observation that due to skip connection, the middle stage of U-Net is the best alignment option. Secondly, we propose upsampling of U-Net features after passing them through MLPs. Thirdly, we observe difficulty when performing tokenwise similarity alignment, and further introduces a manifold loss that regularizes the relative similarity between samples. Experiments indicate that the resulting U-REPA could achieve excellent generation quality and greatly accelerates the convergence speed. With CFG guidance interval, U-REPA could reach $FID<1.5$ in 200 epochs or 1M iterations on ImageNet 256 $\times$ 256, and needs only half the total epochs to perform better than REPA under sd-vae-ft-ema. Codes: https://github.com/YuchuanTian/U-REPA

Junyuan Gao, Jiahe Song, Jiang Wu, Runchuan Zhu, Guanlin Shen, Shasha Wang, Xingjian Wei, Haote Yang, Songyang Zhang, Weijia Li, Bin Wang, Dahua Lin, Lijun Wu, Conghui He

Main category: cs.CV

TL;DR: PM4Bench is a multilingual multimodal benchmark using parallel corpora across 10 languages with visually embedded text to fairly evaluate LVLM cross-lingual alignment and real-world multimodal understanding.

Details

Motivation: Current LVLM evaluation has two critical limitations: 1) non-parallel corpora conflate language capability gaps with dataset artifacts, preventing fair cross-lingual assessment; 2) disjointed multimodal inputs don't reflect real-world scenarios where text is embedded within visual contexts.

Method: Created PM4Bench, the first multilingual multi-modal multi-task benchmark using strictly parallel corpus across 10 languages. Introduced vision setting where textual queries are visually fused into images, forcing models to jointly “see, read, and think” rather than processing text and images separately.

Result: Evaluation of 10 LVLMs shows substantial performance drop in vision setting compared to standard inputs. Analysis reveals OCR capability is both a general bottleneck and contributes to cross-lingual performance disparities, suggesting multilingual OCR improvement is essential for advancing LVLM performance.

Conclusion: PM4Bench enables fair cross-lingual LVLM evaluation by eliminating content divergence. The vision setting reveals critical limitations in current models’ ability to process visually embedded text, highlighting the need for improved multilingual OCR capabilities in LVLMs.

Abstract: While Large Vision-Language Models (LVLMs) demonstrate promising multilingual capabilities, their evaluation is currently hindered by two critical limitations: (1) the use of non-parallel corpora, which conflates inherent language capability gaps with dataset artifacts, precluding a fair assessment of cross-lingual alignment; and (2) disjointed multimodal inputs, which deviate from real-world scenarios where most texts are embedded within visual contexts. To address these challenges, we propose PM4Bench, the first Multilingual Multi-Modal Multi-task Benchmark constructed on a strictly parallel corpus across 10 languages. By eliminating content divergence, our benchmark enables a fair comparison of model capabilities across different languages. We also introduce a vision setting where textual queries are visually fused into images, compelling models to jointly “see,” “read,” and “think”. Extensive evaluation of 10 LVLMs uncover a substantial performance drop in the Vision setting compared to standard inputs. Further analysis reveals that OCR capability is not only a general bottleneck but also contributes to cross-lingual performance disparities, suggesting that improving multilingual OCR is essential for advancing LVLM performance. We will release PM4Bench at https://github.com/opendatalab/PM4Bench .

[263] Boosting Resolution Generalization of Diffusion Transformers with Randomized Positional Encodings

Liang Hou, Cong Liu, Mingwu Zheng, Xin Tao, Pengfei Wan, Di Zhang, Kun Gai

Main category: cs.CV

TL;DR: RPE-2D: A novel 2D randomized positional encoding method for diffusion transformers that enables resolution generalization by focusing on patch order rather than absolute distances, allowing seamless high- and low-resolution image generation without multi-resolution training.

Details

Motivation: Current diffusion transformers struggle with resolution generalization due to positional encoding mismatches between training and inference. Existing interpolation/extrapolation methods don't fully solve this problem, limiting the ability to generate higher-resolution images without training at multiple resolutions.

Method: Proposes RPE-2D: 1) Independently samples positions along horizontal and vertical axes over expanded range during training, 2) Uses random resize-and-crop augmentation to strengthen order modeling, 3) Adds micro-conditioning to indicate cropping patterns, 4) Focuses on patch order rather than absolute distances.

Result: Achieves SOTA resolution generalization on ImageNet: outperforms competitive methods when trained at 256² and evaluated at 384²/512², and when trained at 512² and evaluated at 768²/1024². Also excels at low-resolution generation, multi-stage training acceleration, and multi-resolution inheritance.

Conclusion: RPE-2D effectively solves the positional encoding mismatch problem in diffusion transformers, enabling robust resolution generalization without requiring training at multiple resolutions, while maintaining strong performance across various resolution-related tasks.

Abstract: Resolution generalization in image generation tasks enables the production of higher-resolution images with lower training resolution overhead. However, a key obstacle for diffusion transformers in addressing this problem is the mismatch between positional encodings seen at inference and those used during training. Existing strategies such as positional encodings interpolation, extrapolation, or hybrids, do not fully resolve this mismatch. In this paper, we propose a novel two-dimensional randomized positional encodings, namely RPE-2D, that prioritizes the order of image patches rather than their absolute distances, enabling seamless high- and low-resolution generation without training on multiple resolutions. Concretely, RPE-2D independently samples positions along the horizontal and vertical axes over an expanded range during training, ensuring that the encodings used at inference lie within the training distribution and thereby improving resolution generalization. We further introduce a simple random resize-and-crop augmentation to strengthen order modeling and add micro-conditioning to indicate the applied cropping pattern. On the ImageNet dataset, RPE-2D achieves state-of-the-art resolution generalization performance, outperforming competitive methods when trained at $256^2$ and evaluated at $384^2$ and $512^2$, and when trained at $512^2$ and evaluated at $768^2$ and $1024^2$. RPE-2D also exhibits outstanding capabilities in low-resolution image generation, multi-stage training acceleration, and multi-resolution inheritance.

[264] VisualCloze: A Universal Image Generation Framework via Visual In-Context Learning

Zhong-Yu Li, Ruoyi Du, Juncheng Yan, Le Zhuo, Qilong Wu, Zhen Li, Peng Gao, Zhanyu Ma, Ming-Ming Cheng

Main category: cs.CV

TL;DR: VisualCloze is a universal image generation framework that uses visual in-context learning instead of language instructions, supported by Graph200K dataset and leveraging pre-trained infilling models’ architecture.

Details

Motivation: Current diffusion models are task-specific, limiting efficiency for diverse needs. Universal models face challenges with task instruction generalization, appropriate task distributions, and unified architectural design.

Method: Proposes VisualCloze with visual in-context learning (identifying tasks from visual demonstrations), Graph200K dataset (graph-structured with interrelated tasks), and leverages pre-trained infilling models without architecture modification.

Result: The framework supports wide range of in-domain tasks, generalization to unseen tasks, unification of multiple tasks, and reverse generation by addressing task ambiguity and sparse task distribution issues.

Conclusion: VisualCloze provides a universal image generation solution that overcomes limitations of language-based task instruction and sparse task distributions through visual demonstrations and enhanced task density.

Abstract: Recent progress in diffusion models significantly advances various image generation tasks. However, the current mainstream approach remains focused on building task-specific models, which have limited efficiency when supporting a wide range of different needs. While universal models attempt to address this limitation, they face critical challenges, including generalizable task instruction, appropriate task distributions, and unified architectural design. To tackle these challenges, we propose VisualCloze, a universal image generation framework, which supports a wide range of in-domain tasks, generalization to unseen ones, unseen unification of multiple tasks, and reverse generation. Unlike existing methods that rely on language-based task instruction, leading to task ambiguity and weak generalization, we integrate visual in-context learning, allowing models to identify tasks from visual demonstrations. Meanwhile, the inherent sparsity of visual task distributions hampers the learning of transferable knowledge across tasks. To this end, we introduce Graph200K, a graph-structured dataset that establishes various interrelated tasks, enhancing task density and transferable knowledge. Furthermore, we uncover that our unified image generation formulation shared a consistent objective with image infilling, enabling us to leverage the strong generative priors of pre-trained infilling models without modifying the architectures.

[265] VISTA: Mitigating Semantic Inertia in Video-LLMs via Training-Free Dynamic Chain-of-Thought Routing

Hongbo Jin, Jiayu Ding, Siyi Xie, Guibo Luo, Ge Li

Main category: cs.CV

TL;DR: VISTA is a training-free framework that addresses Semantic Inertia in Video-LLMs by aligning perception with logical deduction through dynamic inference routing and explicit textual anchors, achieving significant performance gains on video understanding benchmarks.

Details

Motivation: Current Video-LLMs suffer from Semantic Inertia - a cognitive misalignment where models suppress valid visual evidence in favor of dominant language priors, despite advances in System 2 reasoning for text-based LLMs.

Method: VISTA uses a training-free framework with dynamic inference path routing, materializes implicit visual features into explicit textual anchors to counterbalance parametric knowledge influence, and incorporates a Latent Reasoning Consensus mechanism to mitigate stochastic hallucinations.

Result: VISTA outperforms its base model by 9.3% on Egochema and 5.6% on VideoEspresso, achieving outstanding results across a wide range of benchmarks and rivaling or surpassing larger proprietary models.

Conclusion: The proposed VISTA framework effectively addresses Semantic Inertia in Video-LLMs through perception-logic alignment without requiring additional training, demonstrating that cognitive misalignment rather than perceptual limitations is the key challenge in video understanding.

Abstract: Recent advancements in Large Language Models have successfully transitioned towards System 2 reasoning, yet applying these paradigms to video understanding remains challenging. While prevailing research attributes failures in Video-LLMs to perceptual limitations, our empirical analysis reveals a cognitive misalignment termed Semantic Inertia, where models suppress valid visual evidence in favor of dominant language priors. To rectify this, we propose VISTA, a training-free framework designed to align perception with logical deduction. By dynamically routing inference paths and materializing implicit visual features into explicit textual anchors, our approach effectively counterbalances the influence of parametric knowledge. Furthermore, we incorporate a Latent Reasoning Consensus mechanism to mitigate stochastic hallucinations. VISTA showed outstanding results on a wide range of benchmarks, and outperforms its base model by 9.3% on Egochema and 5.6% on VideoEspresso, rivalling or even surpassing larger and proprietary models. Our codebase will be publicly available soon.

[266] Adapting Vision-Language Foundation Model for Next Generation Medical Ultrasound Image Analysis

Jingguo Qu, Xinyang Han, Jia Ai, Juan Wu, Tong Zhao, Tonghuan Xiao, Sheng Ning, Yuqi Yang, Jing Qin, Ann Dorothy King, Winnie Chiu-Wing Chu, Jing Cai, Michael Tin-Cheung Ying

Main category: cs.CV

TL;DR: A novel Hybrid-tuning strategy for adapting CLIP-based vision-language models to medical ultrasound analysis, addressing domain shift issues through frequency-domain filtering, noise estimation, and multi-scale feature aggregation.

Details

Motivation: Vision-Language Models struggle with medical ultrasound due to significant domain shift from natural images to sonographic data with unique physics like speckle noise, shadowing, and artifacts, leading to suboptimal performance of off-the-shelf foundation models.

Method: Proposes Hybrid-tuning strategy with lightweight adapter module integrated into frozen visual backbone, featuring frequency-domain filtering to suppress periodic artifacts and dynamic noise estimation for feature calibration. Includes specialized segmentation and classification heads with multi-scale feature aggregation to leverage pre-trained semantic priors.

Result: Extensive evaluations across six multi-center datasets (lymph nodes, breast, thyroid, prostate) show HT-enhanced models significantly outperform state-of-the-art methods including BiomedCLIP and standard LoRA fine-tuning, demonstrating superior data efficiency and robustness.

Conclusion: The approach enables practical foundational intelligence for automated ultrasound diagnosis by efficiently adapting VLMs to medical ultrasound domain, addressing unique physics challenges while maintaining pre-trained semantic knowledge.

Abstract: Vision-Language Models (VLMs) have demonstrated remarkable generalization capabilities, yet their application to medical ultrasound remains constrained by the significant domain shift between natural images and sonographic data. The unique physics of ultrasound, manifesting as speckle noise, shadowing, and variable artifacts, often leads to suboptimal performance when applying off-the-shelf foundation models. To address this, we propose a novel Hybrid-tuning (HT) strategy for the efficient adaptation of CLIP-based models to ultrasound analysis. Our method introduces a lightweight adapter module integrated into the frozen visual backbone, featuring frequency-domain filtering to suppress periodic artifacts and dynamic noise estimation to calibrate feature representations. Furthermore, we design specialized segmentation and classification heads that employ multi-scale feature aggregation to maximize the utility of pre-trained semantic priors. Extensive evaluations across six multi-center datasets (covering lymph nodes, breast, thyroid, and prostate) reveal that our HT-enhanced models significantly outperform existing state-of-the-art methods, including BiomedCLIP and standard LoRA fine-tuning. The results highlight the superior data efficiency and robustness of our approach, paving the way for practical, foundational intelligence in automated ultrasound diagnosis. The source code is available at https://github.com/jinggqu/NextGen-UIA.

[267] MoTE: Mixture of Ternary Experts for Memory-efficient Large Multimodal Models

Hongyu Wang, Jiayu Xu, Ruiping Wang, Yan Feng, Yitao Zhai, Peng Pei, Xunliang Cai, Xilin Chen

Main category: cs.CV

TL;DR: MoTE: A memory-efficient approach that trains ternary-valued experts (-1, 0, 1) instead of full-precision experts in multimodal MoE models, reducing memory footprint while maintaining performance.

Details

Motivation: Large multimodal MoE models with full-precision experts have high memory requirements that challenge deployment on edge devices. Current approaches use fewer high-precision experts, but this limits scaling potential.

Method: Train ternary experts with parameters constrained to {-1, 0, 1} values, using pre-trained FFN as shared expert. This allows training more low-precision experts during up-cycling rather than fewer high-precision ones.

Result: MoTE achieves comparable performance to full-precision MoE-LLaVA baseline while reducing memory footprint. With 3.4GB expert memory and post-training quantization, it outperforms MoE-LLaVA by 4.3% average accuracy on end tasks.

Conclusion: MoTE provides a scalable, memory-efficient approach for multimodal MoE models that works well with memory-constrained devices and is compatible with post-training quantization methods.

Abstract: Large multimodal Mixture-of-Experts (MoEs) effectively scale the model size to boost performance while maintaining fixed active parameters. However, previous works primarily utilized full-precision experts during sparse up-cycling. Despite they show superior performance on end tasks, the large amount of experts introduces higher memory footprint, which poses significant challenges for the deployment on edge devices. In this work, we propose MoTE, a scalable and memory-efficient approach to train Mixture-of-Ternary-Experts models from dense checkpoint. Instead of training fewer high-precision experts, we propose to train more low-precision experts during up-cycling. Specifically, we use the pre-trained FFN as a shared expert and train ternary routed experts with parameters in {-1, 0, 1}. Extensive experiments show that our approach has promising scaling trend along model size. MoTE achieves comparable performance to full-precision baseline MoE-LLaVA while offering lower memory footprint. Furthermore, our approach is compatible with post-training quantization methods and the advantage further amplifies when memory-constraint goes lower. Given the same amount of expert memory footprint of 3.4GB and combined with post-training quantization, MoTE outperforms MoE-LLaVA by a gain of 4.3% average accuracy on end tasks, demonstrating its effectiveness and potential for memory-constrained devices.

[268] Benchmarking Content-Based Puzzle Solvers on Corrupted Jigsaw Puzzles

Richard Dirauf, Florian Wolz, Dario Zanca, Björn Eskofier

Main category: cs.CV

TL;DR: The paper evaluates content-based puzzle solvers’ robustness to real-world corruptions like missing pieces and erosion, finding performance declines but showing deep learning models can improve through fine-tuning, with Positional Diffusion models performing best.

Details

Motivation: Current content-based puzzle solvers lack evaluation on realistic challenges needed for real-world applications like reassembling fragmented artefacts or shredded documents, which often involve various types of corruption.

Method: The study introduces three types of jigsaw puzzle corruptions (missing pieces, eroded edges, eroded contents) and evaluates both heuristic and deep learning-based solvers. It analyzes their ability to handle these corruptions and tests deep learning models’ robustness through fine-tuning with augmented data.

Result: Standard puzzle solvers show rapid performance decline with increasing corruption. Deep learning models can significantly improve robustness through fine-tuning, with Positional Diffusion models outperforming competitors in most experiments.

Conclusion: The research highlights promising directions for enhancing automated reconstruction of real-world artefacts by addressing corruption robustness in puzzle solvers, with deep learning fine-tuning showing particular promise.

Abstract: Content-based puzzle solvers have been extensively studied, demonstrating significant progress in computational techniques. However, their evaluation often lacks realistic challenges crucial for real-world applications, such as the reassembly of fragmented artefacts or shredded documents. In this work, we investigate the robustness of State-Of-The-Art content-based puzzle solvers introducing three types of jigsaw puzzle corruptions: missing pieces, eroded edges, and eroded contents. Evaluating both heuristic and deep learning-based solvers, we analyse their ability to handle these corruptions and identify key limitations. Our results show that solvers developed for standard puzzles have a rapid decline in performance if more pieces are corrupted. However, deep learning models can significantly improve their robustness through fine-tuning with augmented data. Notably, the advanced Positional Diffusion model adapts particularly well, outperforming its competitors in most experiments. Based on our findings, we highlight promising research directions for enhancing the automated reconstruction of real-world artefacts.

[269] Sortblock: Similarity-Aware Feature Reuse for Diffusion Model

Hanqi Chen, Xu Zhang, Xiaoliu Guan, Lielin Jiang, Guanzhong Wang, Zeyu Chen, Yi Liu

Main category: cs.CV

TL;DR: Sortblock is a training-free inference acceleration framework for Diffusion Transformers that dynamically caches block-wise features and selectively skips redundant computations to achieve over 2x speedup with minimal quality degradation.

Details

Motivation: Diffusion Transformers have high inference latency due to their sequential denoising process, limiting real-time deployment. Existing acceleration approaches overlook the evolving semantic focus across denoising stages and Transformer blocks.

Method: Sortblock dynamically caches block-wise features based on similarity across adjacent timesteps, ranks evolution of residuals to determine recomputation ratio, selectively skips redundant computations, and incorporates lightweight linear prediction to reduce accumulated errors.

Result: Extensive experiments across various tasks and DiT architectures demonstrate Sortblock achieves over 2x inference speedup with minimal degradation in output quality.

Conclusion: Sortblock offers an effective and generalizable training-free solution for accelerating diffusion-based generative models while preserving generation quality.

Abstract: Diffusion Transformers (DiTs) have demonstrated remarkable generative capabilities, particularly benefiting from Transformer architectures that enhance visual and artistic fidelity. However, their inherently sequential denoising process results in high inference latency, limiting their deployment in real-time scenarios. Existing training-free acceleration approaches typically reuse intermediate features at fixed timesteps or layers, overlooking the evolving semantic focus across denoising stages and Transformer blocks.To address this, we propose Sortblock, a training-free inference acceleration framework that dynamically caches block-wise features based on their similarity across adjacent timesteps. By ranking the evolution of residuals, Sortblock adaptively determines a recomputation ratio, selectively skipping redundant computations while preserving generation quality. Furthermore, we incorporate a lightweight linear prediction mechanism to reduce accumulated errors in skipped blocks.Extensive experiments across various tasks and DiT architectures demonstrate that Sortblock achieves over 2$\times$ inference speedup with minimal degradation in output quality, offering an effective and generalizable solution for accelerating diffusion-based generative models.

[270] ViMoNet: A Multimodal Vision-Language Framework for Human Behavior Understanding from Motion and Video

Rajan Das Gupta, Lei Wei, Md Yeasin Rahat, Nafiz Fahad, Abir Ahmed, Liew Tze Hui

Main category: cs.CV

TL;DR: ViMoNet is a multimodal vision-language framework that combines motion and video data for human behavior understanding, outperforming existing methods and showing potential for assistive healthcare applications.

Details

Motivation: Prior approaches using only motion or only video data have limitations in capturing both fine-grained motion dynamics and contextual semantics of human actions. Integrating these complementary modalities is essential for comprehensive human behavior understanding.

Method: Proposed ViMoNet framework with two-stage alignment and instruction-tuning strategy combining precise motion-text supervision with large-scale video-text data. Also introduced VIMOS dataset (motion sequences, videos, instruction annotations) and ViMoNet-Bench benchmark.

Result: ViMoNet consistently outperforms existing methods across caption generation, motion understanding, and human behavior interpretation tasks. Shows significant potential for assistive healthcare applications like elderly monitoring and fall detection.

Conclusion: The framework contributes to SDG 3 (Good Health and Well-being) by enabling accessible AI-driven tools that promote universal health coverage, reduce preventable health issues, and enhance overall well-being through multimodal human behavior understanding.

Abstract: This study investigates the use of large language models (LLMs) for human behavior understanding by jointly leveraging motion and video data. We argue that integrating these complementary modalities is essential for capturing both fine-grained motion dynamics and contextual semantics of human actions, addressing the limitations of prior motion-only or video-only approaches. To this end, we propose ViMoNet, a multimodal vision-language framework trained through a two-stage alignment and instruction-tuning strategy that combines precise motion-text supervision with large-scale video-text data. We further introduce VIMOS, a multimodal dataset comprising human motion sequences, videos, and instruction-level annotations, along with ViMoNet-Bench, a standardized benchmark for evaluating behavior-centric reasoning. Experimental results demonstrate that ViMoNet consistently outperforms existing methods across caption generation, motion understanding, and human behavior interpretation tasks. The proposed framework shows significant potential in assistive healthcare applications, such as elderly monitoring, fall detection, and early identification of health risks in aging populations. This work contributes to the United Nations Sustainable Development Goal 3 (SDG 3: Good Health and Well-being) by enabling accessible AI-driven tools that promote universal health coverage, reduce preventable health issues, and enhance overall well-being.

[271] Where MLLMs Attend and What They Rely On: Explaining Autoregressive Token Generation

Ruoyu Chen, Xiaoqing Guo, Kangwei Liu, Siyuan Liang, Shiming Liu, Qunli Zhang, Laiyuan Wang, Hua Zhang, Xiaochun Cao

Main category: cs.CV

TL;DR: EAGLE is a lightweight black-box framework for explaining token generation in multimodal LLMs, attributing tokens to visual regions while quantifying language vs. perceptual influence.

Details

Motivation: Current MLLMs lack understanding of how generated tokens depend on visual modalities, limiting interpretability and reliability. There's a need for better attribution methods to understand what tokens rely on visual evidence vs. language priors.

Method: EAGLE uses an objective function unifying sufficiency (insight score) and indispensability (necessity score), optimized via greedy search over sparsified image regions. It performs modality-aware analysis to disentangle token dependencies.

Result: EAGLE consistently outperforms existing methods in faithfulness, localization, and hallucination diagnosis across open-source MLLMs, while requiring substantially less GPU memory.

Conclusion: EAGLE provides an effective and practical framework for advancing MLLM interpretability through faithful attribution of token generation to visual regions and modality analysis.

Abstract: Multimodal large language models (MLLMs) have demonstrated remarkable capabilities in aligning visual inputs with natural language outputs. Yet, the extent to which generated tokens depend on visual modalities remains poorly understood, limiting interpretability and reliability. In this work, we present EAGLE, a lightweight black-box framework for explaining autoregressive token generation in MLLMs. EAGLE attributes any selected tokens to compact perceptual regions while quantifying the relative influence of language priors and perceptual evidence. The framework introduces an objective function that unifies sufficiency (insight score) and indispensability (necessity score), optimized via greedy search over sparsified image regions for faithful and efficient attribution. Beyond spatial attribution, EAGLE performs modality-aware analysis that disentangles what tokens rely on, providing fine-grained interpretability of model decisions. Extensive experiments across open-source MLLMs show that EAGLE consistently outperforms existing methods in faithfulness, localization, and hallucination diagnosis, while requiring substantially less GPU memory. These results highlight its effectiveness and practicality for advancing the interpretability of MLLMs.

[272] UniVideo: Unified Understanding, Generation, and Editing for Videos

Cong Wei, Quande Liu, Zixuan Ye, Qiulin Wang, Xintao Wang, Pengfei Wan, Kun Gai, Wenhu Chen

Main category: cs.CV

TL;DR: UniVideo is a unified multimodal framework for video generation and editing that combines an MLLM for instruction understanding with an MMDiT for video synthesis, enabling diverse video tasks under a single instruction paradigm.

Details

Motivation: Current unified multimodal models are mostly limited to the image domain, lacking comprehensive video generation and editing capabilities. There's a need to extend unified modeling to video while maintaining instruction understanding and visual consistency.

Method: Dual-stream architecture with Multimodal Large Language Model (MLLM) for instruction interpretation and Multimodal DiT (MMDiT) for video generation. Joint training across diverse video tasks under unified multimodal instruction paradigm.

Result: Matches or surpasses state-of-the-art task-specific baselines in text/image-to-video generation, in-context video generation and editing. Enables task composition and generalization to unseen editing instructions without explicit training.

Conclusion: UniVideo successfully extends unified multimodal modeling to video domain, demonstrating strong performance, task composition capabilities, and generalization to novel editing scenarios while maintaining visual consistency.

Abstract: Unified multimodal models have shown promising results in multimodal content generation and editing but remain largely limited to the image domain. In this work, we present UniVideo, a versatile framework that extends unified modeling to the video domain. UniVideo adopts a dual-stream design, combining a Multimodal Large Language Model (MLLM) for instruction understanding with a Multimodal DiT (MMDiT) for video generation. This design preserves the MLLM’s original text generation capabilities, enables accurate interpretation of complex multimodal instructions, and maintains visual consistency in the generated content. Built on this architecture, UniVideo unifies diverse video generation and editing tasks under a single multimodal instruction paradigm and is jointly trained across them. Extensive experiments demonstrate that UniVideo matches or surpasses state-of-the-art task-specific baselines in text/image-to-video generation, in-context video generation and in-context video editing. Notably, the unified design of UniVideo enables two forms of generalization. First, UniVideo supports task composition, such as combining editing with style transfer, by integrating multiple capabilities within a single instruction. Second, even without explicit training on free-form video editing, UniVideo transfers its editing capability from large-scale image editing data to this setting, handling unseen instructions such as changing the environment or altering materials within a video. Beyond these core capabilities, UniVideo also supports visual-prompt-based video generation, where the MLLM interprets visual prompts and guides the MMDiT during synthesis. To foster future research, we released our model and code.

[273] Semantic-E2VID: a Semantic-Enriched Paradigm for Event-to-Video Reconstruction

Jingqian Wu, Yunbo Jia, Shengpeng Xu, Edmund Y. Lam

Main category: cs.CV

TL;DR: Semantic-E2VID improves event-to-video reconstruction by incorporating semantic information from pretrained vision models to address the inherent semantic under-determination of event streams.

Details

Motivation: Event streams lack object-level structure and contextual information due to their change-driven sensing mechanism, making existing temporal-spatial approaches insufficient for faithful reconstruction. The paper argues that effective E2V reconstruction requires explicit semantic modeling beyond just temporal and spatial signal recovery.

Method: Proposes Semantic-E2VID framework that: 1) performs semantic abstraction by bridging event representations with semantics from pretrained Segment Anything Model (SAM), 2) fuses learned semantics into event latent space in representation-compatible manner, and 3) introduces semantic-aware supervision to guide reconstruction toward semantically meaningful regions.

Result: Extensive experiments on six public benchmarks show Semantic-E2VID consistently outperforms state-of-the-art E2V methods.

Conclusion: Reformulating E2V reconstruction as a semantic learning, fusing and decoding process significantly improves reconstruction quality by addressing the semantic under-determination of event data.

Abstract: Event cameras provide a promising sensing modality for high-speed and high-dynamic-range vision by asynchronously capturing brightness changes. A fundamental task in event-based vision is event-to-video (E2V) reconstruction, which aims to recover intensity videos from event streams. Most existing E2V approaches formulate reconstruction as a temporal–spatial signal recovery problem, relying on temporal aggregation and spatial feature learning to infer intensity frames. While effective to some extent, this formulation overlooks a critical limitation of event data: due to the change-driven sensing mechanism, event streams are inherently semantically under-determined, lacking object-level structure and contextual information that are essential for faithful reconstruction. In this work, we revisit E2V from a semantic perspective and argue that effective reconstruction requires going beyond temporal and spatial modeling to explicitly account for missing semantic information. Based on this insight, we propose \textit{Semantic-E2VID}, a semantic-enriched end-to-end E2V framework that reformulates reconstruction as a process of semantic learning, fusing and decoding. Our approach first performs semantic abstraction by bridging event representations with semantics extracted from a pretrained Segment Anything Model (SAM), while avoiding modality-induced feature drift. The learned semantics are then fused into the event latent space in a representation-compatible manner, enabling event features to capture object-level structure and contextual cues. Furthermore, semantic-aware supervision is introduced to explicitly guide the reconstruction process toward semantically meaningful regions, complementing conventional pixel-level and temporal objectives. Extensive experiments on six public benchmarks demonstrate that Semantic-E2VID consistently outperforms state-of-the-art E2V methods.

[274] ProCLIP: Progressive Vision-Language Alignment via LLM-based Embedder

Xiaoxing Hu, Kaicheng Yang, Ziyang Gong, Qi Ming, Zonghao Guo, Yu Tian, Xiang An, Ziyong Feng, Xue Yang

Main category: cs.CV

TL;DR: ProCLIP is a progressive alignment framework that uses curriculum learning to align CLIP’s image encoder with an LLM-based text embedder, overcoming limitations of CLIP’s short text encoder while preserving its vision-language alignment.

Details

Motivation: CLIP's text encoder has limitations: 77-token max length restricts long text processing, lacks multilingual support, and hampers fine-grained semantic understanding. Directly replacing it with LLM-based embedders disrupts CLIP's existing vision-language alignment due to mismatched representation spaces.

Method: ProCLIP uses curriculum learning with two stages: 1) Knowledge distillation from CLIP’s text encoder to LLM-based embedder to establish initial alignment while leveraging CLIP’s pretrained knowledge, 2) Image-text contrastive tuning with self-distillation regularization to prevent overfitting. Uses instance semantic alignment loss and embedding structure alignment loss during both stages.

Result: The framework effectively aligns CLIP’s image encoder with LLM-based embedders while preserving CLIP’s original vision-language alignment, enabling better long-text processing, multilingual understanding, and fine-grained semantic comprehension.

Conclusion: ProCLIP provides an effective progressive alignment approach that overcomes CLIP’s text encoder limitations by integrating LLM capabilities while maintaining the valuable vision-language alignment learned during CLIP’s pretraining.

Abstract: The original CLIP text encoder is limited by a maximum input length of 77 tokens, which hampers its ability to effectively process long texts and perform fine-grained semantic understanding. In addition, the CLIP text encoder lacks support for multilingual inputs. All these limitations significantly restrict its applicability across a broader range of tasks. Recent studies have attempted to replace the CLIP text encoder with an LLM-based embedder to enhance its ability in processing long texts, multilingual understanding, and fine-grained semantic comprehension. However, because the representation spaces of LLMs and the vision-language space of CLIP are pretrained independently without alignment priors, direct alignment using contrastive learning can disrupt the intrinsic vision-language alignment in the CLIP image encoder, leading to an underutilization of the knowledge acquired during pre-training. To address this challenge, we propose ProCLIP, a curriculum learning-based progressive vision-language alignment framework to effectively align the CLIP image encoder with an LLM-based embedder. Specifically, ProCLIP first distills knowledge from CLIP’s text encoder into the LLM-based embedder to leverage CLIP’s rich pretrained knowledge while establishing initial alignment between the LLM embedder and CLIP image encoder. Subsequently, ProCLIP further aligns the CLIP image encoder with the LLM-based embedder through image-text contrastive tuning, employing self-distillation regularization to avoid overfitting. To achieve a more effective alignment, instance semantic alignment loss and embedding structure alignment loss are employed during representation inheritance and contrastive tuning. The Code is available at https://github.com/VisionXLab/ProCLIP.

[275] $\mathbf{S^2LM}$: Towards Semantic Steganography via Large Language Models

Huanqi Wu, Huangbiao Xu, Runfeng Xie, Jiaxin Cai, Kaixin Zhang, Xiao Ke

Main category: cs.CV

TL;DR: S²LM: A semantic steganography method using LLMs to hide arbitrary sentence-level messages in images, outperforming traditional bit-level approaches.

Details

Motivation: Current steganography struggles with embedding semantically rich, sentence-level information into carriers, creating a need for methods that can hide structured content like sentences or paragraphs.

Method: Proposes S²LM (Semantic Steganographic Language Model) that leverages large language models to embed high-level textual information into images, redesigning the entire pipeline to enable hiding and recovery of arbitrary sentences.

Result: Experimental results show S²LM effectively enables direct sentence recovery beyond bit-level steganography. Also introduces the Invisible Text (IVT) benchmark dataset for evaluating semantic steganography methods.

Conclusion: S²LM represents a novel approach to semantic steganography that successfully hides sentence-level messages in images using LLMs, with promising results and an accompanying benchmark dataset for future research.

Abstract: Despite remarkable progress in steganography, embedding semantically rich, sentence-level information into carriers remains a challenging problem. In this work, we present a novel concept of Semantic Steganography, which aims to hide semantically meaningful and structured content, such as sentences or paragraphs, in cover media. Based on this concept, we present Sentence-to-Image Steganography as an instance that enables the hiding of arbitrary sentence-level messages within a cover image. To accomplish this feat, we propose S^2LM: Semantic Steganographic Language Model, which leverages large language models (LLMs) to embed high-level textual information into images. Unlike traditional bit-level approaches, S^2LM redesigns the entire pipeline, involving the LLM throughout the process to enable the hiding and recovery of arbitrary sentences. Furthermore, we establish a benchmark named Invisible Text (IVT), comprising a diverse set of sentence-level texts as secret messages to evaluate semantic steganography methods. Experimental results demonstrate that S^2LM effectively enables direct sentence recovery beyond bit-level steganography. The source code and IVT dataset will be released soon.

[276] Multivariate Diffusion Transformer with Decoupled Attention for High-Fidelity Mask-Text Collaborative Facial Generation

Yushe Cao, Dianxi Shi, Xing Fu, Xuechao Zou, Haikuo Peng, Xueqi Li, Chun Yu, Junliang Xing

Main category: cs.CV

TL;DR: MDiTFace is a diffusion transformer framework for multimodal facial generation that uses unified tokenization and decoupled attention to efficiently fuse semantic masks and text inputs, achieving superior results with 94% computational reduction.

Details

Motivation: Current multimodal facial generation methods using semantic masks and text descriptions suffer from ineffective cross-modal interactions due to conventional feature fusion approaches, leading to suboptimal generation quality.

Method: MDiTFace employs a unified tokenization strategy to process semantic masks and text inputs, uses stacked multivariate transformer blocks for synchronous condition processing, and introduces a novel decoupled attention mechanism that separates mask tokens from temporal embeddings into dynamic and static pathways.

Result: The framework reduces additional computational overhead from mask conditions by over 94% while maintaining performance, and significantly outperforms competing methods in both facial fidelity and conditional consistency.

Conclusion: MDiTFace effectively addresses cross-modal interaction challenges in facial generation through unified tokenization and efficient attention mechanisms, achieving state-of-the-art performance with dramatically reduced computational costs.

Abstract: While significant progress has been achieved in multimodal facial generation using semantic masks and textual descriptions, conventional feature fusion approaches often fail to enable effective cross-modal interactions, thereby leading to suboptimal generation outcomes. To address this challenge, we introduce MDiTFace–a customized diffusion transformer framework that employs a unified tokenization strategy to process semantic mask and text inputs, eliminating discrepancies between heterogeneous modality representations. The framework facilitates comprehensive multimodal feature interaction through stacked, newly designed multivariate transformer blocks that process all conditions synchronously. Additionally, we design a novel decoupled attention mechanism by dissociating implicit dependencies between mask tokens and temporal embeddings. This mechanism segregates internal computations into dynamic and static pathways, enabling caching and reuse of features computed in static pathways after initial calculation, thereby reducing additional computational overhead introduced by mask condition by over 94% while maintaining performance. Extensive experiments demonstrate that MDiTFace significantly outperforms other competing methods in terms of both facial fidelity and conditional consistency.

[277] Back to Basics: Let Denoising Generative Models Denoise

Tianhong Li, Kaiming He

Main category: cs.CV

TL;DR: The paper proposes JiT (Just image Transformers), a diffusion model that directly predicts clean images rather than noise, leveraging the manifold assumption that natural data lies on low-dimensional manifolds while noised data does not.

Details

Motivation: Current diffusion models predict noise or noised quantities rather than clean images, which contradicts classical denoising. The authors argue that predicting clean data is fundamentally different from predicting noised quantities because natural data lies on low-dimensional manifolds while noised data does not.

Method: JiT uses simple, large-patch Transformers on pixels without tokenizers, pre-training, or extra losses. The model directly predicts clean images by operating on the manifold assumption, allowing apparently under-capacity networks to work effectively in high-dimensional spaces.

Result: Competitive results on ImageNet at 256×256 and 512×512 resolutions with large patch sizes of 16 and 32, where traditional noise-predicting models can fail catastrophically.

Conclusion: Predicting clean data rather than noise is more effective for diffusion models, enabling simple Transformers to be strong generative models. The approach represents a self-contained paradigm for Transformer-based diffusion on raw natural data that goes “back to basics” by mapping back to the manifold.

Abstract: Today’s denoising diffusion models do not “denoise” in the classical sense, i.e., they do not directly predict clean images. Rather, the neural networks predict noise or a noised quantity. In this paper, we suggest that predicting clean data and predicting noised quantities are fundamentally different. According to the manifold assumption, natural data should lie on a low-dimensional manifold, whereas noised quantities do not. With this assumption, we advocate for models that directly predict clean data, which allows apparently under-capacity networks to operate effectively in very high-dimensional spaces. We show that simple, large-patch Transformers on pixels can be strong generative models: using no tokenizer, no pre-training, and no extra loss. Our approach is conceptually nothing more than “Just image Transformers”, or JiT, as we call it. We report competitive results using JiT with large patch sizes of 16 and 32 on ImageNet at resolutions of 256 and 512, where predicting high-dimensional noised quantities can fail catastrophically. With our networks mapping back to the basics of the manifold, our research goes back to basics and pursues a self-contained paradigm for Transformer-based diffusion on raw natural data.

[278] Generating Storytelling Images with Rich Chains-of-Reasoning

Xiujie Song, Qi Jia, Shota Watanabe, Xiaoyi Pang, Ruijie Chen, Mengyue Wu, Kenny Q. Zhu

Main category: cs.CV

TL;DR: Proposes StorytellingPainter, a two-stage pipeline using LLMs and T2I models to generate semantically rich storytelling images with logical visual reasoning chains.

Details

Motivation: Storytelling images that convey compelling stories through visual reasoning chains are valuable for applications like illustration and cognitive screening, but are scarce and complex to create manually.

Method: Two-stage pipeline: 1) LLMs generate story narratives, 2) Text-to-Image synthesis creates corresponding images. Also introduces lightweight Mini-Storytellers to bridge performance gaps between small and proprietary LLMs.

Result: Experimental results demonstrate feasibility of the approach. Dedicated evaluation framework assesses semantic complexity, diversity, and text-image alignment.

Conclusion: Proposes Storytelling Image Generation task and shows StorytellingPainter can effectively generate semantically rich storytelling images, addressing scarcity and complexity issues in manual creation.

Abstract: A single image can convey a compelling story through logically connected visual clues, forming Chains-of-Reasoning (CoRs). We define these semantically rich images as Storytelling Images. By conveying multi-layered information that inspires active interpretation, these images enable a wide range of applications, such as illustration and cognitive screening. Despite their potential, such images are scarce and complex to create. To address this, we introduce the Storytelling Image Generation task and propose StorytellingPainter, a two-stage pipeline combining the reasoning of Large Language Models (LLMs) with Text-to-Image (T2I) synthesis. We also develop a dedicated evaluation framework assessing semantic complexity, diversity, and text-image alignment. Furthermore, given the critical role of story generation in the task, we introduce lightweight Mini-Storytellers to bridge the performance gap between small-scale and proprietary LLMs. Experimental results demonstrate the feasibility of our approaches.

[279] Do We Need Reformer for Vision? An Experimental Comparison with Vision Transformers

Ali El Bellaj, Mohammed-Amine Cheddadi, Rhassan Berber

Main category: cs.CV

TL;DR: Reformer-based vision model reduces theoretical complexity from O(n²) to O(n log n) using LSH attention, but ViT outperforms it in practical efficiency for typical high-resolution images.

Details

Motivation: Standard Vision Transformers (ViTs) are computationally expensive due to quadratic scaling of global self-attention, limiting practicality for high-resolution inputs and resource-constrained settings.

Method: Combine patch-based tokenization with locality-sensitive hashing (LSH) attention from Reformer architecture to approximate global self-attention while reducing theoretical time complexity.

Result: Reformer achieves higher accuracy on CIFAR-10 than ViT baseline, but ViT consistently outperforms Reformer in practical efficiency and end-to-end computation time on larger datasets (ImageNet-100) and high-resolution medical imaging.

Conclusion: Despite theoretical advantages of LSH-based attention, meaningful computation gains require sequence lengths substantially longer than those produced by typical high-resolution images, making ViT more practical for current vision tasks.

Abstract: Transformers have recently demonstrated strong performance in computer vision, with Vision Transformers (ViTs) leveraging self-attention to capture both low-level and high-level image features. However, standard ViTs remain computationally expensive, since global self-attention scales quadratically with the number of tokens, which limits their practicality for high-resolution inputs and resource-constrained settings. In this work, we investigate the Reformer architecture as an alternative vision backbone. By combining patch-based tokenization with locality-sensitive hashing (LSH) attention, our model approximates global self-attention while reducing its theoretical time complexity from $\mathcal{O}(n^2)$ to $\mathcal{O}(n \log n)$ in the sequence length $n$. We evaluate the proposed Reformer-based vision model on CIFAR-10 to assess its behavior on small-scale datasets, on ImageNet-100 to study its accuracy–efficiency trade-off in a more realistic setting, and on a high-resolution medical imaging dataset to evaluate the model under longer token sequences. While the Reformer achieves higher accuracy on CIFAR-10 compared to our ViT-style baseline, the ViT model consistently outperforms the Reformer in our experiments in terms of practical efficiency and end-to-end computation time across the larger and higher-resolution settings. These results suggest that, despite the theoretical advantages of LSH-based attention, meaningful computation gains require sequence lengths substantially longer than those produced by typical high-resolution images.

[280] From Human Intention to Action Prediction: Intention-Driven End-to-End Autonomous Driving

Huan Zheng, Yucheng Zhou, Tianyi Yan, Jiayi Su, Hongjun Chen, Dubing Chen, Xingtai Gui, Wencheng Han, Runzhou Tao, Zhongying Qiu, Jianfei Yang, Jianbing Shen

Main category: cs.CV

TL;DR: This paper introduces Intention-Drive, a benchmark for intention-driven autonomous driving with natural language intentions, proposes Imagined Future Alignment evaluation metric, and explores two solution paradigms.

Details

Motivation: Current autonomous driving systems are limited to simple command-following and lack the ability to interpret high-level human intentions. There's a need for benchmarks and semantic-aware evaluation metrics to advance towards genuinely intelligent agents.

Method: 1) Formal definition of Intention-Driven End-to-End Autonomous Driving task; 2) Creation of Intention-Drive benchmark with large-scale dataset of natural language intentions paired with sensor data; 3) Introduction of Imagined Future Alignment (IFA) evaluation protocol using generative world models; 4) Proposal of two solution paradigms: end-to-end vision-language planner and hierarchical agent-based framework.

Result: Existing models show satisfactory driving stability but struggle significantly with intention fulfillment. The proposed frameworks demonstrate superior alignment with human intentions compared to existing approaches.

Conclusion: The paper addresses a critical gap in autonomous driving by providing a benchmark and evaluation framework for intention-driven systems, revealing current limitations and proposing promising solution directions for achieving genuine intelligence in autonomous agents.

Abstract: While end-to-end autonomous driving has achieved remarkable progress in geometric control, current systems remain constrained by a command-following paradigm that relies on simple navigational instructions. Transitioning to genuinely intelligent agents requires the capability to interpret and fulfill high-level, abstract human intentions. However, this advancement is hindered by the lack of dedicated benchmarks and semantic-aware evaluation metrics. In this paper, we formally define the task of Intention-Driven End-to-End Autonomous Driving and present Intention-Drive, a comprehensive benchmark designed to bridge this gap. We construct a large-scale dataset featuring complex natural language intentions paired with high-fidelity sensor data. To overcome the limitations of conventional trajectory-based metrics, we introduce the Imagined Future Alignment (IFA), a novel evaluation protocol leveraging generative world models to assess the semantic fulfillment of human goals beyond mere geometric accuracy. Furthermore, we explore the solution space by proposing two distinct paradigms: an end-to-end vision-language planner and a hierarchical agent-based framework. The experiments reveal a critical dichotomy where existing models exhibit satisfactory driving stability but struggle significantly with intention fulfillment. Notably, the proposed frameworks demonstrate superior alignment with human intentions.

[281] I-Scene: 3D Instance Models are Implicit Generalizable Spatial Learners

Lu Ling, Yunhao Ge, Yichen Sheng, Aniket Bera

Main category: cs.CV

TL;DR: A method that reprograms pre-trained 3D instance generators to learn scene-level spatial relations, enabling generalization to unseen layouts and novel object compositions without dataset-bounded supervision.

Details

Motivation: Existing learning-based 3D scene generation approaches are limited by dataset-bounded supervision, restricting generalization to new layouts and object compositions. The paper aims to overcome this by leveraging transferable spatial knowledge from pre-trained 3D instance generators.

Method: Reprogram a pre-trained 3D instance generator to act as a scene-level learner, replacing dataset supervision with model-centric spatial supervision. Uses a view-centric formulation of scene space instead of canonical space, creating a fully feed-forward generalizable scene generator that learns spatial relations directly from the instance model.

Result: The approach enables generalization to unseen layouts and novel object compositions. Spatial reasoning emerges even when training scenes are randomly composed objects, demonstrating that the generator’s transferable scene prior provides rich learning signals for inferring proximity, support, and symmetry from geometric cues.

Conclusion: 3D instance generators are implicit spatial learners and reasoners, pointing toward foundation models for interactive 3D scene understanding and generation. The method unlocks transferable spatial knowledge from pre-trained models for scene-level generalization.

Abstract: Generalization remains the central challenge for interactive 3D scene generation. Existing learning-based approaches ground spatial understanding in limited scene dataset, restricting generalization to new layouts. We instead reprogram a pre-trained 3D instance generator to act as a scene level learner, replacing dataset-bounded supervision with model-centric spatial supervision. This reprogramming unlocks the generator transferable spatial knowledge, enabling generalization to unseen layouts and novel object compositions. Remarkably, spatial reasoning still emerges even when the training scenes are randomly composed objects. This demonstrates that the generator’s transferable scene prior provides a rich learning signal for inferring proximity, support, and symmetry from purely geometric cues. Replacing widely used canonical space, we instantiate this insight with a view-centric formulation of the scene space, yielding a fully feed-forward, generalizable scene generator that learns spatial relations directly from the instance model. Quantitative and qualitative results show that a 3D instance generator is an implicit spatial learner and reasoner, pointing toward foundation models for interactive 3D scene understanding and generation. Project page: https://luling06.github.io/I-Scene-project/

[282] Generative Refocusing: Flexible Defocus Control from a Single Image

Chun-Wei Tuan Mu, Jia-Bin Huang, Yu-Lun Liu

Main category: cs.CV

TL;DR: Generative Refocusing: A two-step method using DeblurNet and BokehNet for single-image refocusing with semi-supervised training combining synthetic and real data.

Details

Motivation: Depth-of-field control is essential in photography but difficult to achieve with single images. Current methods require all-in-focus inputs, rely on synthetic data, and have limited aperture control.

Method: Two-step process: 1) DeblurNet recovers all-in-focus images from various inputs, 2) BokehNet creates controllable bokeh. Uses semi-supervised training combining synthetic paired data with unpaired real bokeh images, leveraging EXIF metadata to capture real optical characteristics.

Result: Achieves state-of-the-art performance in defocus deblurring, bokeh synthesis, and refocusing benchmarks. Enables text-guided adjustments and custom aperture shapes.

Conclusion: Generative Refocusing overcomes limitations of current methods by combining synthetic and real data, providing better optical realism and flexible control over refocusing effects.

Abstract: Depth-of-field control is essential in photography, but getting the perfect focus often takes several tries or special equipment. Single-image refocusing is still difficult. It involves recovering sharp content and creating realistic bokeh. Current methods have significant drawbacks. They need all-in-focus inputs, depend on synthetic data from simulators, and have limited control over aperture. We introduce Generative Refocusing, a two-step process that uses DeblurNet to recover all-in-focus images from various inputs and BokehNet for creating controllable bokeh. Our main innovation is semi-supervised training. This method combines synthetic paired data with unpaired real bokeh images, using EXIF metadata to capture real optical characteristics beyond what simulators can provide. Our experiments show we achieve top performance in defocus deblurring, bokeh synthesis, and refocusing benchmarks. Additionally, our Generative Refocusing allows text-guided adjustments and custom aperture shapes.

[283] Deep But Reliable: Advancing Multi-turn Reasoning for Thinking with Images

Wenhao Yang, Yu Xia, Jinlong Huang, Shiyin Lu, Qing-Guo Chen, Zhao Xu, Weihua Luo, Kaifu Zhang, Yuanyu Wan, Lijun Zhang

Main category: cs.CV

TL;DR: DRIM enables deep, reliable multi-turn reasoning in vision-language models by combining cold-start supervised fine-tuning with redundancy-penalized reinforcement learning to develop self-reflective reasoning patterns.

Details

Motivation: Existing vision-language models struggle to reflect on and correct incorrect reasoning trajectories when thinking with images in their multimodal chain-of-thought, limiting their reliability in complex visual tasks.

Method: Three-stage pipeline: 1) Data construction using high-resolution images with verifiable visual QA pairs requiring multi-turn tool calls; 2) Cold-start supervised fine-tuning using tool trajectories; 3) Reinforcement learning with redundancy-penalized policy optimization that penalizes incorrect answers without sufficient multi-scale exploration.

Result: DRIM achieves superior performance on visual understanding benchmarks through its self-reflective reasoning capabilities.

Conclusion: The proposed DRIM framework successfully addresses the limitation of existing VLMs by enabling deep, reliable multi-turn reasoning with self-reflection capabilities when thinking with images in multimodal chain-of-thought.

Abstract: Recent advances in large Vision-Language Models (VLMs) have exhibited strong reasoning capabilities on complex visual tasks by thinking with images in their Chain-of-Thought (CoT), which is achieved by actively invoking tools to analyze visual inputs rather than merely perceiving them. However, existing models often struggle to reflect on and correct themselves when attempting incorrect reasoning trajectories. To address this limitation, we propose DRIM, a model that enables deep but reliable multi-turn reasoning when thinking with images in its multimodal CoT. Our pipeline comprises three stages: data construction, cold-start SFT and RL. Based on a high-resolution image dataset, we construct high-difficulty and verifiable visual question-answer pairs, where solving each task requires multi-turn tool calls to reach the correct answer. In the SFT stage, we collect tool trajectories as cold-start data, guiding a multi-turn reasoning pattern. In the RL stage, we introduce redundancy-penalized policy optimization, which incentivizes the model to develop a self-reflective reasoning pattern. The basic idea is to impose judgment on reasoning trajectories and penalize those that produce incorrect answers without sufficient multi-scale exploration. Extensive experiments demonstrate that DRIM achieves superior performance on visual understanding benchmarks.

[284] Plasticine: A Traceable Diffusion Model for Medical Image Translation

Tianyang Zhang, Xinxing Cheng, Jun Cheng, Shaoming Zheng, He Zhao, Huazhu Fu, Alejandro F Frangi, Jiang Liu, Jinming Duan

Main category: cs.CV

TL;DR: Plasticine is the first end-to-end image-to-image translation framework with traceability as a core objective, combining intensity translation and spatial transformation in a denoising diffusion framework for medical images.

Details

Motivation: Domain gaps from different imaging devices and populations challenge medical image analysis. Existing methods generate diverse synthetic data but overlook spatial correspondence and traceability - the ability to provide pixel-level correspondences between original and translated images, which is crucial for clinical interpretability.

Method: Plasticine combines intensity translation and spatial transformation within a denoising diffusion framework. This enables generation of synthetic images with interpretable intensity transitions and spatially coherent deformations, supporting pixel-wise traceability throughout the translation process.

Result: The paper proposes the first end-to-end image-to-image translation framework explicitly designed with traceability as a core objective, addressing a significant gap in existing methods.

Conclusion: Plasticine represents an important advancement in medical image translation by prioritizing traceability alongside image generation, enhancing clinical interpretability through pixel-level correspondence preservation.

Abstract: Domain gaps arising from variations in imaging devices and population distributions pose significant challenges for machine learning in medical image analysis. Existing image-to-image translation methods primarily aim to learn mappings between domains, often generating diverse synthetic data with variations in anatomical scale and shape, but they usually overlook spatial correspondence during the translation process. For clinical applications, traceability, defined as the ability to provide pixel-level correspondences between original and translated images, is equally important. This property enhances clinical interpretability but has been largely overlooked in previous approaches. To address this gap, we propose Plasticine, which is, to the best of our knowledge, the first end-to-end image-to-image translation framework explicitly designed with traceability as a core objective. Our method combines intensity translation and spatial transformation within a denoising diffusion framework. This design enables the generation of synthetic images with interpretable intensity transitions and spatially coherent deformations, supporting pixel-wise traceability throughout the translation process.

[285] SpatialTree: How Spatial Abilities Branch Out in MLLMs

Yuxi Xiao, Longfei Li, Shen Yan, Xinhang Liu, Sida Peng, Yunchao Wei, Xiaowei Zhou, Bingyi Kang

Main category: cs.CV

TL;DR: SpatialTree introduces a cognitive-science-inspired hierarchy of spatial abilities in MLLMs with four levels (perception to agentic competence), creates a hierarchical benchmark, reveals skill interdependencies and transfer dynamics, and proposes an auto-think strategy to improve performance across all levels.

Details

Motivation: Current multimodal LLM studies focus on narrow spatial tasks without understanding the hierarchical development of spatial abilities from perception to reasoning and interaction, which cognitive science suggests develops progressively.

Method: Introduces SpatialTree hierarchy (L1-L4), constructs capability-centric hierarchical benchmark with 27 sub-abilities, evaluates mainstream MLLMs, conducts targeted supervised fine-tuning to study transfer dynamics, and proposes auto-think strategy to suppress unnecessary deliberation in RL training.

Result: Reveals clear structure: L1 skills are orthogonal while higher-level skills are strongly correlated; shows negative transfer within L1 but strong cross-level transfer from low to high abilities; naive RL helps complex reasoning but hurts intuitive perception; auto-think strategy enables consistent improvement across all levels.

Conclusion: SpatialTree provides a proof-of-concept framework for understanding and systematically scaling spatial abilities in MLLMs, demonstrating hierarchical organization, transfer dynamics, and effective training strategies for comprehensive spatial intelligence development.

Abstract: Cognitive science suggests that spatial ability develops progressively-from perception to reasoning and interaction. Yet in multimodal LLMs (MLLMs), this hierarchy remains poorly understood, as most studies focus on a narrow set of tasks. We introduce SpatialTree, a cognitive-science-inspired hierarchy that organizes spatial abilities into four levels: low-level perception (L1), mental mapping (L2), simulation (L3), and agentic competence (L4). Based on this taxonomy, we construct the first capability-centric hierarchical benchmark, thoroughly evaluating mainstream MLLMs across 27 sub-abilities. The evaluation results reveal a clear structure: L1 skills are largely orthogonal, whereas higher-level skills are strongly correlated, indicating increasing interdependency. Through targeted supervised fine-tuning, we uncover a surprising transfer dynamic-negative transfer within L1, but strong cross-level transfer from low- to high-level abilities with notable synergy. Finally, we explore how to improve the entire hierarchy. We find that naive RL that encourages extensive “thinking” is unreliable: it helps complex reasoning but hurts intuitive perception. We propose a simple auto-think strategy that suppresses unnecessary deliberation, enabling RL to consistently improve performance across all levels. By building SpatialTree, we provide a proof-of-concept framework for understanding and systematically scaling spatial abilities in MLLMs.

[286] VULCAN: Tool-Augmented Multi Agents for Iterative 3D Object Arrangement

Zhengfei Kuang, Rui Lin, Long Zhao, Gordon Wetzstein, Saining Xie, Sanghyun Woo

Main category: cs.CV

TL;DR: MLLMs extended to 3D scene manipulation via MCP-based API, visual tools for spatial understanding, and multi-agent framework for robust object arrangement tasks.

Details

Motivation: Multimodal Large Language Models (MLLMs) have shown strong performance in 2D vision-language tasks, but their application to complex 3D scene manipulation remains underexplored, creating a critical gap in 3D object arrangement capabilities.

Method: Three key innovations: 1) MCP-based API to shift from brittle raw code to robust function-level updates, 2) specialized visual tools for scene analysis, spatial information gathering, and action validation, 3) collaborative multi-agent framework with planning, execution, and verification roles for iterative error recovery.

Result: The approach significantly outperforms existing baselines on a diverse set of 25 complex object arrangement tasks, demonstrating effective 3D scene manipulation capabilities.

Conclusion: The proposed framework successfully bridges the gap between MLLMs and 3D scene manipulation by addressing visual grounding limitations, enhancing 3D scene understanding, and providing robust error handling through multi-agent collaboration.

Abstract: Despite the remarkable progress of Multimodal Large Language Models (MLLMs) in 2D vision-language tasks, their application to complex 3D scene manipulation remains underexplored. In this paper, we bridge this critical gap by tackling three key challenges in 3D object arrangement task using MLLMs. First, to address the weak visual grounding of MLLMs, which struggle to link programmatic edits with precise 3D outcomes, we introduce an MCP-based API. This shifts the interaction from brittle raw code manipulation to more robust, function-level updates. Second, we augment the MLLM’s 3D scene understanding with a suite of specialized visual tools to analyze scene state, gather spatial information, and validate action outcomes. This perceptual feedback loop is critical for closing the gap between language-based updates and precise 3D-aware manipulation. Third, to manage the iterative, error-prone updates, we propose a collaborative multi-agent framework with designated roles for planning, execution, and verification. This decomposition allows the system to robustly handle multi-step instructions and recover from intermediate errors. We demonstrate the effectiveness of our approach on a diverse set of 25 complex object arrangement tasks, where it significantly outperforms existing baselines. Website: vulcan-3d.github.io

[287] Language as Prior, Vision as Calibration: Metric Scale Recovery for Monocular Depth Estimation

Mingxing Zhan, Li Zhang, Beibei Wang, Yingjie Wang, Zenglin Shi

Main category: cs.CV

TL;DR: A method that uses language cues to predict uncertainty-aware bounds for metric depth calibration from relative-depth models, improving accuracy and robustness.

Details

Motivation: Monocular metric depth estimation remains ill-posed due to unidentifiable global scale and sensitivity to domain shifts, while relative-depth models transfer well but lack metric scale.

Method: Train lightweight calibration heads on frozen relative-depth backbone and CLIP text encoder. Use language to predict uncertainty-aware envelope bounding feasible calibration parameters, then use pooled multi-scale visual features to select image-specific calibration within this envelope. Supervise with closed-form least-squares oracle in inverse depth.

Result: Improves in-domain accuracy on NYUv2 and KITTI, and shows improved robustness in zero-shot transfer to SUN-RGBD and DDAD compared to language-only baselines.

Conclusion: Language can provide useful but noisy scale cues for metric depth calibration when used to predict uncertainty-aware bounds rather than point estimates, enabling more robust metric depth recovery from relative-depth models.

Abstract: Relative-depth foundation models transfer well, yet monocular metric depth remains ill-posed due to unidentifiable global scale and heightened domain-shift sensitivity. Under a frozen-backbone calibration setting, we recover metric depth via an image-specific affine transform in inverse depth and train only lightweight calibration heads while keeping the relative-depth backbone and the CLIP text encoder fixed. Since captions provide coarse but noisy scale cues that vary with phrasing and missing objects, we use language to predict an uncertainty-aware envelope that bounds feasible calibration parameters in an unconstrained space, rather than committing to a text-only point estimate. We then use pooled multi-scale frozen visual features to select an image-specific calibration within this envelope. During training, a closed-form least-squares oracle in inverse depth provides per-image supervision for learning the envelope and the selected calibration. Experiments on NYUv2 and KITTI improve in-domain accuracy, while zero-shot transfer to SUN-RGBD and DDAD demonstrates improved robustness over strong language-only baselines.

[288] FastV-RAG: Towards Fast and Fine-Grained Video QA with Retrieval-Augmented Generation

Gen Li, Peiyu Liu

Main category: cs.CV

TL;DR: VideoSpeculateRAG: An efficient VLM-based RAG framework using speculative decoding and similarity-based filtering to improve inference speed and accuracy for knowledge-intensive multimodal tasks.

Details

Motivation: Vision-Language Models struggle with integrating external knowledge efficiently. Current RAG methods are inefficient and often fail to maintain high answer quality, creating a need for better solutions.

Method: Two key innovations: 1) Speculative decoding pipeline with lightweight draft model generating answer candidates verified by heavyweight model, 2) Similarity-based filtering strategy to address incorrect entity recognition in retrieved knowledge.

Result: Achieves comparable or higher accuracy than standard RAG approaches while accelerating inference by approximately 2x.

Conclusion: Combining speculative decoding with retrieval-augmented reasoning enhances efficiency and reliability in complex, knowledge-intensive multimodal tasks.

Abstract: Vision-Language Models (VLMs) excel at visual reasoning but still struggle with integrating external knowledge. Retrieval-Augmented Generation (RAG) is a promising solution, but current methods remain inefficient and often fail to maintain high answer quality. To address these challenges, we propose VideoSpeculateRAG, an efficient VLM-based RAG framework built on two key ideas. First, we introduce a speculative decoding pipeline: a lightweight draft model quickly generates multiple answer candidates, which are then verified and refined by a more accurate heavyweight model, substantially reducing inference latency without sacrificing correctness. Second, we identify a major source of error - incorrect entity recognition in retrieved knowledge - and mitigate it with a simple yet effective similarity-based filtering strategy that improves entity alignment and boosts overall answer accuracy. Experiments demonstrate that VideoSpeculateRAG achieves comparable or higher accuracy than standard RAG approaches while accelerating inference by approximately 2x. Our framework highlights the potential of combining speculative decoding with retrieval-augmented reasoning to enhance efficiency and reliability in complex, knowledge-intensive multimodal tasks.

[289] SortWaste: A Densely Annotated Dataset for Object Detection in Industrial Waste Sorting

Sara Inácio, Hugo Proença, João C. Neves

Main category: cs.CV

TL;DR: SortWaste dataset for waste detection with ClutterScore metric shows current models struggle with cluttered scenes despite decent overall performance.

Details

Motivation: Manual waste sorting is inefficient and hazardous, while automated systems struggle with real-world waste variability and clutter due to lack of appropriate datasets.

Method: Introduces SortWaste dataset from Material Recovery Facility and proposes ClutterScore metric to quantify scene hardness using object count, class/size entropy, and spatial overlap.

Result: State-of-the-art models achieve 59.7% mAP for plastic detection but performance significantly drops in highly cluttered scenes, showing current limitations.

Conclusion: Need for more challenging waste detection datasets and improved models that can handle cluttered real-world waste streams effectively.

Abstract: The increasing production of waste, driven by population growth, has created challenges in managing and recycling materials effectively. Manual waste sorting is a common practice; however, it remains inefficient for handling large-scale waste streams and presents health risks for workers. On the other hand, existing automated sorting approaches still struggle with the high variability, clutter, and visual complexity of real-world waste streams. The lack of real-world datasets for waste sorting is a major reason automated systems for this problem are underdeveloped. Accordingly, we introduce SortWaste, a densely annotated object detection dataset collected from a Material Recovery Facility. Additionally, we contribute to standardizing waste detection in sorting lines by proposing ClutterScore, an objective metric that gauges the scene’s hardness level using a set of proxies that affect visual complexity (e.g., object count, class and size entropy, and spatial overlap). In addition to these contributions, we provide an extensive benchmark of state-of-the-art object detection models, detailing their results with respect to the hardness level assessed by the proposed metric. Despite achieving promising results (mAP of 59.7% in the plastic-only detection task), performance significantly decreases in highly cluttered scenes. This highlights the need for novel and more challenging datasets on the topic.

[290] BEDS : Bayesian Emergent Dissipative Structures : A Formal Framework for Continuous Inference Under Energy Constraints

Laurent Caraffa

Main category: cs.CV

TL;DR: BEDS is a framework for analyzing inference systems under energy constraints, showing that continuous belief maintenance has fundamental thermodynamic costs proportional to precision and dissipation rates.

Details

Motivation: Classical computational models assume perfect memory and one-shot computation, but real inference systems must maintain beliefs continuously under energy constraints with inherent information loss (dissipation). There's a need to formalize the thermodynamics of continuous inference.

Method: Introduces BEDS (Bayesian Emergent Dissipative Structures) framework that explicitly incorporates dissipation as a fundamental constraint. Defines three problem classes (BEDS-attainable, BEDS-maintainable, BEDS-crystallizable) and proves thermodynamic bounds linking energy, precision, and dissipation.

Result: Proves that maintaining a belief with precision τ against dissipation rate γ requires power P ≥ γk_BT/2, with scaling P ∝ γ·τ. Shows the three BEDS problem classes are distinct from classical decidability. Proposes the Gödel-Landauer-Prigogine conjecture linking pathologies across formal systems, computation, and thermodynamics.

Conclusion: Continuous inference under energy constraints has fundamental thermodynamic limitations. The BEDS framework provides a formal way to analyze these constraints, revealing connections between thermodynamics, computation, and formal systems that go beyond classical computational theory.

Abstract: We introduce BEDS (Bayesian Emergent Dissipative Structures), a formal framework for analyzing inference systems that must maintain beliefs continuously under energy constraints. Unlike classical computational models that assume perfect memory and focus on one-shot computation, BEDS explicitly incorporates dissipation (information loss over time) as a fundamental constraint. We prove a central result linking energy, precision, and dissipation: maintaining a belief with precision $τ$ against dissipation rate $γ$ requires power $P \geq γk_{\rm B} T / 2$, with scaling $P \propto γ\cdot τ$. This establishes a fundamental thermodynamic cost for continuous inference. We define three classes of problems – BEDS-attainable, BEDS-maintainable, and BEDS-crystallizable – and show these are distinct from classical decidability. We propose the Gödel-Landauer-Prigogine conjecture, suggesting that closure pathologies across formal systems, computation, and thermodynamics share a common structure.

[291] HOLO: Homography-Guided Pose Estimator Network for Fine-Grained Visual Localization on SD Maps

Xuchang Zhong, Xu Cao, Jinke Feng, Hao Fang

Main category: cs.CV

TL;DR: A homography-guided pose estimator network for visual localization between multi-view images and SD maps, using BEV feature projection and homography constraints to improve training efficiency and accuracy.

Details

Motivation: Existing regression-based visual localization methods on SD maps overlook geometric priors, leading to suboptimal training efficiency and limited localization accuracy. There's a need to better incorporate geometric constraints for improved performance.

Method: Proposes a homography-guided pose estimator network that: 1) constructs input pairs satisfying homography constraints by projecting ground-view features into BEV domain, 2) enforces semantic alignment with map features, 3) uses homography relationships to guide feature fusion, and 4) restricts pose outputs to valid feasible regions.

Result: Extensive experiments on nuScenes dataset show the approach significantly outperforms existing state-of-the-art visual localization methods. The framework also naturally supports cross-resolution inputs due to explicit homography modeling.

Conclusion: This is the first work to unify BEV semantic reasoning with homography learning for image-to-map localization. The method improves both training efficiency and localization accuracy compared to attention-based fusion and direct 3-DoF pose regression approaches.

Abstract: Visual localization on standard-definition (SD) maps has emerged as a promising low-cost and scalable solution for autonomous driving. However, existing regression-based approaches often overlook inherent geometric priors, resulting in suboptimal training efficiency and limited localization accuracy. In this paper, we propose a novel homography-guided pose estimator network for fine-grained visual localization between multi-view images and standard-definition (SD) maps. We construct input pairs that satisfy a homography constraint by projecting ground-view features into the BEV domain and enforcing semantic alignment with map features. Then we leverage homography relationships to guide feature fusion and restrict the pose outputs to a valid feasible region, which significantly improves training efficiency and localization accuracy compared to prior methods relying on attention-based fusion and direct 3-DoF pose regression. To the best of our knowledge, this is the first work to unify BEV semantic reasoning with homography learning for image-to-map localization. Furthermore, by explicitly modeling homography transformations, the proposed framework naturally supports cross-resolution inputs, enhancing model flexibility. Extensive experiments on the nuScenes dataset demonstrate that our approach significantly outperforms existing state-of-the-art visual localization methods. Code and pretrained models will be publicly released to foster future research.

cs.AI

[292] Mastering the Game of Go with Self-play Experience Replay

Jingbin Liu, Xuechun Wang

Main category: cs.AI

TL;DR: QZero is a model-free RL algorithm that masters Go without search during training, achieving AlphaGo-level performance using only 7 GPUs for 5 months.

Details

Motivation: Previous Go AI approaches like AlphaGo rely on model-based MCTS, which requires extensive search. The authors aim to demonstrate that model-free reinforcement learning can efficiently master complex games like Go without search during training.

Method: QZero uses entropy-regularized Q-learning with a single Q-value network that unifies policy evaluation and improvement. It learns through self-play and off-policy experience replay without any human data, starting tabula rasa.

Result: Trained for 5 months with modest compute resources (7 GPUs), QZero achieved performance comparable to AlphaGo, demonstrating model-free RL can master Go efficiently.

Conclusion: This work shows model-free reinforcement learning is efficient for mastering complex games like Go, and demonstrates the feasibility of off-policy RL in solving large-scale, complex environments.

Abstract: The game of Go has long served as a benchmark for artificial intelligence, demanding sophisticated strategic reasoning and long-term planning. Previous approaches such as AlphaGo and its successors, have predominantly relied on model-based Monte-Carlo Tree Search (MCTS). In this work, we present QZero, a novel model-free reinforcement learning algorithm that forgoes search during training and learns a Nash equilibrium policy through self-play and off-policy experience replay. Built upon entropy-regularized Q-learning, QZero utilizes a single Q-value network to unify policy evaluation and improvement. Starting tabula rasa without human data and trained for 5 months with modest compute resources (7 GPUs), QZero achieved a performance level comparable to that of AlphaGo. This demonstrates, for the first time, the efficiency of using model-free reinforcement learning to master the game of Go, as well as the feasibility of off-policy reinforcement learning in solving large-scale and complex environments.

[293] Digital Red Queen: Adversarial Program Evolution in Core War with LLMs

Akarsh Kumar, Ryan Bahlous-Boldi, Prafull Sharma, Phillip Isola, Sebastian Risi, Yujin Tang, David Ha

Main category: cs.AI

TL;DR: DRQ is a self-play algorithm using LLMs to evolve Core War warriors through Red Queen dynamics, showing improved generality and behavioral convergence over time.

Details

Motivation: Current LLM-evolution frameworks use static optimization, missing the open-ended adversarial dynamics of real-world evolution. The paper aims to study Red Queen dynamics through continual adaptation to changing objectives.

Method: Digital Red Queen (DRQ) uses LLMs to evolve assembly-like programs (warriors) for Core War. In each round, the model evolves a new warrior to defeat all previous ones, creating a sequence of adapted warriors through self-play.

Result: Warriors become increasingly general (relative to held-out human warriors) and show less behavioral diversity across independent runs, indicating convergence toward general-purpose strategies similar to convergent evolution.

Conclusion: Shifting from static to dynamic Red Queen objectives has value. Core War serves as a rich sandbox for studying adversarial adaptation, and DRQ’s simplicity suggests similar approaches could work in practical domains like cybersecurity and drug resistance.

Abstract: Large language models (LLMs) are increasingly being used to evolve solutions to problems in many domains, in a process inspired by biological evolution. However, unlike biological evolution, most LLM-evolution frameworks are formulated as static optimization problems, overlooking the open-ended adversarial dynamics that characterize real-world evolutionary processes. Here, we study Digital Red Queen (DRQ), a simple self-play algorithm that embraces these so-called “Red Queen” dynamics via continual adaptation to a changing objective. DRQ uses an LLM to evolve assembly-like programs, called warriors, which compete against each other for control of a virtual machine in the game of Core War, a Turing-complete environment studied in artificial life and connected to cybersecurity. In each round of DRQ, the model evolves a new warrior to defeat all previous ones, producing a sequence of adapted warriors. Over many rounds, we observe that warriors become increasingly general (relative to a set of held-out human warriors). Interestingly, warriors also become less behaviorally diverse across independent runs, indicating a convergence pressure toward a general-purpose behavioral strategy, much like convergent evolution in nature. This result highlights a potential value of shifting from static objectives to dynamic Red Queen objectives. Our work positions Core War as a rich, controllable sandbox for studying adversarial adaptation in artificial systems and for evaluating LLM-based evolution methods. More broadly, the simplicity and effectiveness of DRQ suggest that similarly minimal self-play approaches could prove useful in other more practical multi-agent adversarial domains, like real-world cybersecurity or combating drug resistance.

[294] Enhancing LLM Instruction Following: An Evaluation-Driven Multi-Agentic Workflow for Prompt Instructions Optimization

Alberto Purpura, Li Wang, Sahil Badyal, Eugenio Beaufrand, Adam Faulkner

Main category: cs.AI

TL;DR: Multi-agent workflow decouples task description optimization from constraint refinement using quantitative feedback to improve LLM compliance with formal requirements.

Details

Motivation: LLMs often generate conceptually correct content but fail to adhere to formal constraints, creating procedurally flawed outputs. Traditional prompt engineering focuses on rephrasing main task descriptions while neglecting granular constraints that serve as acceptance criteria.

Method: Proposes a multi-agentic workflow that separates optimization of primary task description from constraint refinement. Uses quantitative scores as feedback to iteratively rewrite and improve both task descriptions and constraints.

Result: The method produces revised prompts that yield significantly higher compliance scores from models like Llama 3.1 8B and Mixtral-8x 7B, demonstrating improved adherence to formal constraints.

Conclusion: Decoupling task description optimization from constraint refinement using quantitative feedback is an effective approach for improving LLM compliance with formal requirements, addressing a key limitation in traditional prompt engineering.

Abstract: Large Language Models (LLMs) often generate substantively relevant content but fail to adhere to formal constraints, leading to outputs that are conceptually correct but procedurally flawed. Traditional prompt refinement approaches focus on rephrasing the description of the primary task an LLM has to perform, neglecting the granular constraints that function as acceptance criteria for its response. We propose a novel multi-agentic workflow that decouples optimization of the primary task description from its constraints, using quantitative scores as feedback to iteratively rewrite and improve them. Our evaluation demonstrates this method produces revised prompts that yield significantly higher compliance scores from models like Llama 3.1 8B and Mixtral-8x 7B.

[295] Exploration Through Introspection: A Self-Aware Reward Model

Michael Petrowski, Milica Gašić

Main category: cs.AI

TL;DR: Introspective RL agents with pain-belief inference outperform baselines and replicate human-like behaviors, showing self-awareness improves learning.

Details

Motivation: To advance Theory of Mind in AI by exploring self-awareness in artificial agents, particularly how inferring internal states (like pain) affects learning, inspired by biological pain as a learning signal.

Method: Use reinforcement learning agents in gridworld environments with introspective exploration component. Implement hidden Markov model to infer “pain-belief” from online observations, integrate this signal into subjective reward function. Compare normal vs. chronic pain perception models.

Result: Introspective agents significantly outperform standard baseline agents and can replicate complex human-like behaviors.

Conclusion: Self-awareness through internal state inference (pain-belief modeling) enhances agent learning capabilities and provides computational framework for studying differences between normal and chronic pain perception.

Abstract: Understanding how artificial agents model internal mental states is central to advancing Theory of Mind in AI. Evidence points to a unified system for self- and other-awareness. We explore this self-awareness by having reinforcement learning agents infer their own internal states in gridworld environments. Specifically, we introduce an introspective exploration component that is inspired by biological pain as a learning signal by utilizing a hidden Markov model to infer “pain-belief” from online observations. This signal is integrated into a subjective reward function to study how self-awareness affects the agent’s learning abilities. Further, we use this computational framework to investigate the difference in performance between normal and chronic pain perception models. Results show that introspective agents in general significantly outperform standard baseline agents and can replicate complex human-like behaviors.

[296] Toward Maturity-Based Certification of Embodied AI: Quantifying Trustworthiness Through Measurement Mechanisms

Michael C. Darling, Alan H. Hesu, Michael A. Mardikes, Brian C. McGuigan, Reed M. Milewicz

Main category: cs.AI

TL;DR: Proposes a maturity-based certification framework for embodied AI systems with structured assessment, quantitative scoring, and multi-objective trade-off navigation, demonstrated through uncertainty quantification in UAS detection.

Details

Motivation: Current embodied AI systems lack structured certification frameworks with explicit measurement mechanisms for trustworthiness assessment.

Method: Develops a maturity-based certification framework with structured assessment, quantitative scoring mechanisms, and methods for navigating multi-objective trade-offs in trustworthiness evaluation.

Result: Demonstrates feasibility through uncertainty quantification as an exemplar measurement mechanism applied to an Uncrewed Aircraft System (UAS) detection case study.

Conclusion: A maturity-based framework with explicit measurement mechanisms enables certifiable embodied AI systems, addressing trustworthiness evaluation challenges through structured assessment and quantitative scoring.

Abstract: We propose a maturity-based framework for certifying embodied AI systems through explicit measurement mechanisms. We argue that certifiable embodied AI requires structured assessment frameworks, quantitative scoring mechanisms, and methods for navigating multi-objective trade-offs inherent in trustworthiness evaluation. We demonstrate this approach using uncertainty quantification as an exemplar measurement mechanism and illustrate feasibility through an Uncrewed Aircraft System (UAS) detection case study.

[297] CPGPrompt: Translating Clinical Guidelines into LLM-Executable Decision Support

Ruiqi Deng, Geoffrey Martin, Tony Wang, Gongbo Zhang, Yi Liu, Chunhua Weng, Yanshan Wang, Justin F Rousseau, Yifan Peng

Main category: cs.AI

TL;DR: CPGPrompt is an auto-prompting system that converts narrative clinical guidelines into LLMs using structured decision trees, achieving strong performance on binary specialty-referral decisions but variable results on multi-class pathway classification.

Details

Motivation: Clinical practice guidelines (CPGs) provide evidence-based care recommendations, but integrating them into AI systems is challenging. Previous rule-based approaches suffer from poor interpretability, inconsistent guideline adherence, and narrow domain applicability.

Method: Developed CPGPrompt, an auto-prompting system that translates CPGs into structured decision trees and uses LLMs to dynamically navigate them for patient case evaluation. Tested with synthetic vignettes across three domains (headache, lower back pain, prostate cancer) in four decision scenario categories.

Result: Binary specialty referral classification achieved strong performance across all domains (F1: 0.85-1.00) with perfect recall (1.00 ± 0.00). Multi-class pathway assignment showed reduced performance: headache (F1: 0.47), lower back pain (F1: 0.72), prostate cancer (F1: 0.77). Performance differences reflected guideline structure challenges.

Conclusion: CPGPrompt effectively integrates CPGs into LLMs for clinical decision support, particularly for binary referral decisions. Performance varies for complex multi-class pathways depending on guideline structure, with quantifiable tests (prostate cancer) enabling more reliable decisions than negation handling (headache) or temporal reasoning (back pain) requirements.

Abstract: Clinical practice guidelines (CPGs) provide evidence-based recommendations for patient care; however, integrating them into Artificial Intelligence (AI) remains challenging. Previous approaches, such as rule-based systems, face significant limitations, including poor interpretability, inconsistent adherence to guidelines, and narrow domain applicability. To address this, we develop and validate CPGPrompt, an auto-prompting system that converts narrative clinical guidelines into large language models (LLMs). Our framework translates CPGs into structured decision trees and utilizes an LLM to dynamically navigate them for patient case evaluation. Synthetic vignettes were generated across three domains (headache, lower back pain, and prostate cancer) and distributed into four categories to test different decision scenarios. System performance was assessed on both binary specialty-referral decisions and fine-grained pathway-classification tasks. The binary specialty referral classification achieved consistently strong performance across all domains (F1: 0.85-1.00), with high recall (1.00 $\pm$ 0.00). In contrast, multi-class pathway assignment showed reduced performance, with domain-specific variations: headache (F1: 0.47), lower back pain (F1: 0.72), and prostate cancer (F1: 0.77). Domain-specific performance differences reflected the structure of each guideline. The headache guideline highlighted challenges with negation handling. The lower back pain guideline required temporal reasoning. In contrast, prostate cancer pathways benefited from quantifiable laboratory tests, resulting in more reliable decision-making.

[298] Personalization of Large Foundation Models for Health Interventions

Stefan Konigorski, Johannes E. Vedder, Babajide Alamu Owoyele, İbrahim Özkan

Main category: cs.AI

TL;DR: LFMs can’t replace N-of-1 trials for personalized medicine; they’re complementary - LFMs generate hypotheses from population data while N-of-1 trials provide causal validation for individuals.

Details

Motivation: Despite LFMs transforming healthcare AI, it's unclear if they can provide truly personalized treatment recommendations due to paradoxes like the generalizability paradox (models accurate in one study perform poorly in others), privacy-performance paradox, scale-specificity paradox, and automation-empathy paradox.

Method: Proposes a hybrid framework combining LFMs and N-of-1 trials: LFMs generate ranked intervention candidates with uncertainty estimates from population patterns using multimodal data, which then trigger subsequent N-of-1 trials (crossover self-experiments) for causal validation at individual level.

Result: LFMs and N-of-1 trials are complementary rather than substitutive - LFMs excel at rapid hypothesis generation while N-of-1 trials excel at causal validation for individuals, resolving tensions between personalization and external validity.

Conclusion: Clarifying the boundary between prediction and causation and explicitly addressing paradoxical tensions are essential for responsible AI integration in personalized medicine. The hybrid framework enables personalization while navigating identified paradoxes.

Abstract: Large foundation models (LFMs) transform healthcare AI in prevention, diagnostics, and treatment. However, whether LFMs can provide truly personalized treatment recommendations remains an open question. Recent research has revealed multiple challenges for personalization, including the fundamental generalizability paradox: models achieving high accuracy in one clinical study perform at chance level in others, demonstrating that personalization and external validity exist in tension. This exemplifies broader contradictions in AI-driven healthcare: the privacy-performance paradox, scale-specificity paradox, and the automation-empathy paradox. As another challenge, the degree of causal understanding required for personalized recommendations, as opposed to mere predictive capacities of LFMs, remains an open question. N-of-1 trials – crossover self-experiments and the gold standard for individual causal inference in personalized medicine – resolve these tensions by providing within-person causal evidence while preserving privacy through local experimentation. Despite their impressive capabilities, this paper argues that LFMs cannot replace N-of-1 trials. We argue that LFMs and N-of-1 trials are complementary: LFMs excel at rapid hypothesis generation from population patterns using multimodal data, while N-of-1 trials excel at causal validation for a given individual. We propose a hybrid framework that combines the strengths of both to enable personalization and navigate the identified paradoxes: LFMs generate ranked intervention candidates with uncertainty estimates, which trigger subsequent N-of-1 trials. Clarifying the boundary between prediction and causation and explicitly addressing the paradoxical tensions are essential for responsible AI integration in personalized medicine.

[299] Evolving Programmatic Skill Networks

Haochen Shi, Xingdi Yuan, Bang Liu

Main category: cs.AI

TL;DR: PSN is a framework for continual skill acquisition using executable symbolic programs that form a compositional network, with LLM-based mechanisms for reflection, optimization, and refactoring.

Details

Motivation: To enable continual skill acquisition in open-ended embodied environments where agents need to construct, refine, and reuse an expanding library of executable skills.

Method: Programmatic Skill Network (PSN) uses executable symbolic programs as skills in a compositional network. It employs three LLM-based mechanisms: REFLECT for structured fault localization, progressive optimization with maturity-aware update gating, and canonical structural refactoring with rollback validation.

Result: Experiments on MineDojo and Crafter show robust skill reuse, rapid adaptation, and strong generalization across open-ended task distributions.

Conclusion: PSN effectively enables continual skill acquisition with structural parallels to neural network training, demonstrating practical value for open-ended embodied learning.

Abstract: We study continual skill acquisition in open-ended embodied environments where an agent must construct, refine, and reuse an expanding library of executable skills. We introduce the Programmatic Skill Network (PSN), a framework in which skills are executable symbolic programs forming a compositional network that evolves through experience. PSN defines three core mechanisms instantiated via large language models: (1)REFLECT for structured fault localization over skill compositions, (2) progressive optimization with maturity-aware update gating that stabilizes reliable skills while maintaining plasticity for uncertain ones, and (3) canonical structural refactoring under rollback validation that maintains network compactness. We further show that PSN’s learning dynamics exhibit structural parallels to neural network training. Experiments on MineDojo and Crafter demonstrate robust skill reuse, rapid adaptation, and strong generalization across open-ended task distributions.\footnote{We plan to open-source the code.

[300] Variance Computation for Weighted Model Counting with Knowledge Compilation Approach

Kengo Nakamura, Masaaki Nishino, Norihito Yasuda

Main category: cs.AI

TL;DR: This paper investigates the tractability of computing the variance of weighted model counting (WMC) to measure uncertainty in probabilistic inference outcomes when parameters have uncertainty.

Details

Motivation: In practical inference tasks, model parameters often have uncertainty because they are learned from data. This uncertainty should be quantified in the inference outcomes, but the tractability of computing such variance is largely unknown.

Method: The authors develop a polynomial-time algorithm to compute WMC variance when input is given as structured d-DNNF. They also prove hardness results for structured DNNFs, d-DNNFs, and FBDDs, and apply their approach to Bayesian networks.

Result: 1) Polynomial-time algorithm for WMC variance on structured d-DNNFs. 2) Hardness proofs for structured DNNFs, d-DNNFs, and FBDDs. 3) Empirical demonstration on real-world Bayesian networks showing variance computation for marginal probabilities.

Conclusion: The paper establishes tractability boundaries for computing WMC variance, provides practical algorithms for certain representations, and demonstrates applicability to uncertainty quantification in Bayesian network inference.

Abstract: One of the most important queries in knowledge compilation is weighted model counting (WMC), which has been applied to probabilistic inference on various models, such as Bayesian networks. In practical situations on inference tasks, the model’s parameters have uncertainty because they are often learned from data, and thus we want to compute the degree of uncertainty in the inference outcome. One possible approach is to regard the inference outcome as a random variable by introducing distributions for the parameters and evaluate the variance of the outcome. Unfortunately, the tractability of computing such a variance is hardly known. Motivated by this, we consider the problem of computing the variance of WMC and investigate this problem’s tractability. First, we derive a polynomial time algorithm to evaluate the WMC variance when the input is given as a structured d-DNNF. Second, we prove the hardness of this problem for structured DNNFs, d-DNNFs, and FBDDs, which is intriguing because the latter two allow polynomial time WMC algorithms. Finally, we show an application that measures the uncertainty in the inference of Bayesian networks. We empirically show that our algorithm can evaluate the variance of the marginal probability on real-world Bayesian networks and analyze the impact of the variances of parameters on the variance of the marginal.

[301] STAR-S: Improving Safety Alignment through Self-Taught Reasoning on Safety Rules

Di Wu, Yanyan Zhao, Xin Lu, Mingzhe Li, Bing Qin

Main category: cs.AI

TL;DR: STAR-S is a self-taught framework that improves LLM safety by iteratively learning to reason about safety rules through a cycle of reasoning elicitation, reflection, and fine-tuning.

Details

Motivation: Current approaches to defend against jailbreak attacks in LLMs rely on training models to reason over safety rules, but it's difficult to determine what form of safety reasoning is most effective, and this reasoning is hard to explicitly design or obtain.

Method: STAR-S integrates safety rule reasoning learning into a self-taught loop: 1) eliciting reasoning and reflection guided by safety rules, 2) leveraging fine-tuning to enhance safety reasoning, and 3) repeating this process to create a synergistic cycle where improved reasoning produces better training data for further enhancement.

Result: Experiments show that STAR-S effectively defends against jailbreak attacks and outperforms baseline methods.

Conclusion: The self-taught approach to learning safety reasoning through iterative refinement provides an effective framework for defending against jailbreak attacks in LLMs, with code made publicly available.

Abstract: Defending against jailbreak attacks is crucial for the safe deployment of Large Language Models (LLMs). Recent research has attempted to improve safety by training models to reason over safety rules before responding. However, a key issue lies in determining what form of safety reasoning effectively defends against jailbreak attacks, which is difficult to explicitly design or directly obtain. To address this, we propose \textbf{STAR-S} (\textbf{S}elf-\textbf{TA}ught \textbf{R}easoning based on \textbf{S}afety rules), a framework that integrates the learning of safety rule reasoning into a self-taught loop. The core of STAR-S involves eliciting reasoning and reflection guided by safety rules, then leveraging fine-tuning to enhance safety reasoning. Repeating this process creates a synergistic cycle. Improvements in the model’s reasoning and interpretation of safety rules allow it to produce better reasoning data under safety rule prompts, which is then utilized for further training. Experiments show that STAR-S effectively defends against jailbreak attacks, outperforming baselines. Code is available at: https://github.com/pikepokenew/STAR_S.git.

[302] ReEfBench: Quantifying the Reasoning Efficiency of LLMs

Zhizhang Fu, Yuancheng Gu, Chenkai Hu, Hanmeng Liu, Yue Zhang

Main category: cs.AI

TL;DR: The paper proposes a neuro-symbolic framework for evaluating LLM reasoning beyond just output length, identifies four behavioral prototypes, and reveals that longer token generation doesn’t guarantee better reasoning while training data mixing and model distillation have critical limitations.

Details

Motivation: Current Chain-of-Thought evaluation methods are limited in distinguishing whether performance gains come from genuine reasoning improvements or just increased verbosity, creating a need for more comprehensive process-centric evaluation frameworks.

Method: Proposes a novel neuro-symbolic framework for non-intrusive, comprehensive process-centric evaluation of reasoning, analyzes behavioral prototypes, failure modes, and examines impact of inference mode, training strategy, and model scale.

Result: Identifies four distinct behavioral prototypes, reveals that extended token generation is not necessary for deep reasoning, shows that mixing long and short CoT data in training risks premature saturation and collapse, and finds that distillation into smaller models captures behavioral length but fails to replicate logical efficacy due to capacity limits.

Conclusion: The proposed evaluation framework provides better insights into LLM reasoning capabilities, showing that current training and scaling approaches have critical limitations that need addressing to achieve genuine reasoning improvements rather than just behavioral mimicry.

Abstract: Test-time scaling has enabled Large Language Models (LLMs) to tackle complex reasoning, yet the limitations of current Chain-of-Thought (CoT) evaluation obscures whether performance gains stem from genuine reasoning or mere verbosity. To address this, (1) we propose a novel neuro-symbolic framework for the non-intrusive, comprehensive process-centric evaluation of reasoning. (2) Through this lens, we identify four distinct behavioral prototypes and diagnose the failure modes. (3) We examine the impact of inference mode, training strategy, and model scale. Our analysis reveals that extended token generation is not a prerequisite for deep reasoning. Furthermore, we reveal critical constraints: mixing long and short CoT data in training risks in premature saturation and collapse, while distillation into smaller models captures behavioral length but fails to replicate logical efficacy due to intrinsic capacity limits.

[303] SCRIBE: Structured Mid-Level Supervision for Tool-Using Language Models

Yuxuan Jiang, Francis Ferraro

Main category: cs.AI

TL;DR: SCRIBE is a reinforcement learning framework that uses skill-conditioned reward modeling with intermediate behavioral evaluation to improve credit assignment in multi-step reasoning tasks, achieving state-of-the-art performance on reasoning and tool-use benchmarks.

Details

Motivation: Training reliable tool-augmented agents is challenging due to difficult credit assignment in multi-step reasoning. Existing LLM-based judges produce noisy and inconsistent signals because they lack fine-grained, task-specific rubrics to distinguish high-level planning from low-level execution.

Method: SCRIBE introduces a reinforcement learning framework that intervenes at a novel mid-level abstraction. It grounds reward modeling in a curated library of skill prototypes, transforming open-ended LLM evaluation into a constrained verification problem. By routing each subgoal to a corresponding prototype, the reward model gets precise, structured rubrics that reduce reward variance.

Result: SCRIBE achieves state-of-the-art performance across reasoning and tool-use benchmarks. It improves AIME25 accuracy of a Qwen3-4B model from 43.3% to 63.3%, and significantly increases success rates in complex multi-turn tool interactions. Analysis shows co-evolution across abstraction levels where mid-level skill mastery precedes effective high-level planning.

Conclusion: SCRIBE provides a scalable and complementary pathway toward more autonomous and reliable tool-using agents. It is additive to low-level tool optimizations and demonstrates that structured mid-level abstractions can effectively address credit assignment challenges in multi-step reasoning.

Abstract: Training reliable tool-augmented agents remains a significant challenge, largely due to the difficulty of credit assignment in multi-step reasoning. While process-level reward models offer a promising direction, existing LLM-based judges often produce noisy and inconsistent signals because they lack fine-grained, task-specific rubrics to distinguish high-level planning from low-level execution. In this work, we introduce SCRIBE (Skill-Conditioned Reward with Intermediate Behavioral Evaluation), a reinforcement learning framework that intervenes at a novel mid-level abstraction. SCRIBE grounds reward modeling in a curated library of skill prototypes, transforming open-ended LLM evaluation into a constrained verification problem. By routing each subgoal to a corresponding prototype, the reward model is equipped with precise, structured rubrics that substantially reduce reward variance. Experimental results show that SCRIBE achieves state-of-the-art performance across a range of reasoning and tool-use benchmarks. In particular, it improves the AIME25 accuracy of a Qwen3-4B model from 43.3% to 63.3%, and significantly increases success rates in complex multi-turn tool interactions. Further analysis of training dynamics reveals a co-evolution across abstraction levels, where mastery of mid-level skills consistently precedes the emergence of effective high-level planning behaviors. Finally, we demonstrate that SCRIBE is additive to low-level tool optimizations, providing a scalable and complementary pathway toward more autonomous and reliable tool-using agents.

[304] SPIO: Ensemble and Selective Strategies via LLM-Based Multi-Agent Planning in Automated Data Science

Wonduk Seo, Juhyeon Lee, Yanjun Shao, Qingshan Zhou, Seunghyun Lee, Yi Bu

Main category: cs.AI

TL;DR: SPIO is a multi-agent framework for automated data analytics that replaces rigid single-path workflows with adaptive multi-path planning across data preprocessing, feature engineering, model selection, and hyperparameter tuning, achieving 5.6% average performance gain over state-of-the-art baselines.

Details

Motivation: Current LLM-based multi-agent systems for automated data analytics are limited by rigid, single-path workflows that restrict strategic exploration and often lead to suboptimal outcomes. There's a need for more flexible and adaptive approaches that can explore multiple solution paths.

Method: SPIO (Sequential Plan Integration and Optimization) uses specialized agents across four core modules (data preprocessing, feature engineering, model selection, hyperparameter tuning) to generate diverse candidate strategies. These are cascaded and refined by an optimization agent. SPIO offers two modes: SPIO-S for selecting a single optimal pipeline, and SPIO-E for ensembling top-k pipelines for robustness.

Result: Extensive evaluations on Kaggle and OpenML benchmarks show SPIO consistently outperforms state-of-the-art baselines, achieving an average performance gain of 5.6%.

Conclusion: SPIO provides a more flexible, accurate, and reliable foundation for automated data science by explicitly exploring and integrating multiple solution paths, overcoming the limitations of rigid single-path workflows in existing multi-agent systems.

Abstract: Large Language Models (LLMs) have enabled dynamic reasoning in automated data analytics, yet recent multi-agent systems remain limited by rigid, single-path workflows that restrict strategic exploration and often lead to suboptimal outcomes. To overcome these limitations, we propose SPIO (Sequential Plan Integration and Optimization), a framework that replaces rigid workflows with adaptive, multi-path planning across four core modules: data preprocessing, feature engineering, model selection, and hyperparameter tuning. In each module, specialized agents generate diverse candidate strategies, which are cascaded and refined by an optimization agent. SPIO offers two operating modes: SPIO-S for selecting a single optimal pipeline, and SPIO-E for ensembling top-k pipelines to maximize robustness. Extensive evaluations on Kaggle and OpenML benchmarks show that SPIO consistently outperforms state-of-the-art baselines, achieving an average performance gain of 5.6%. By explicitly exploring and integrating multiple solution paths, SPIO delivers a more flexible, accurate, and reliable foundation for automated data science.

[305] Controllable LLM Reasoning via Sparse Autoencoder-Based Steering

Yi Fang, Wenjie Wang, Mingfeng Xue, Boyi Deng, Fengli Xu, Dayiheng Liu, Fuli Feng

Main category: cs.AI

TL;DR: SAE-Steering uses Sparse Autoencoders to disentangle reasoning strategies in Large Reasoning Models, enabling precise control over reasoning paths and improving accuracy by redirecting models from erroneous to correct reasoning.

Details

Motivation: Current Large Reasoning Models autonomously select reasoning strategies, which often leads to inefficient or erroneous reasoning paths. Existing methods cannot effectively control fine-grained reasoning strategies due to conceptual entanglement in hidden states, creating a need for better strategy control methods.

Method: Proposes SAE-Steering: uses Sparse Autoencoders to decompose strategy-entangled hidden states into disentangled features, then employs a two-stage pipeline (1) recalls features amplifying strategy-specific keywords (filtering 99%+ features), (2) ranks remaining features by control effectiveness.

Result: SAE-Steering outperforms existing methods by over 15% in control effectiveness. Controlling reasoning strategies redirects LRMs from erroneous to correct paths, achieving 7% absolute accuracy improvement.

Conclusion: SAE-Steering enables reliable and flexible control over reasoning strategies in Large Reasoning Models by disentangling strategy features, significantly improving both control effectiveness and task accuracy through strategic redirection.

Abstract: Large Reasoning Models (LRMs) exhibit human-like cognitive reasoning strategies (e.g. backtracking, cross-verification) during reasoning process, which improves their performance on complex tasks. Currently, reasoning strategies are autonomously selected by LRMs themselves. However, such autonomous selection often produces inefficient or even erroneous reasoning paths. To make reasoning more reliable and flexible, it is important to develop methods for controlling reasoning strategies. Existing methods struggle to control fine-grained reasoning strategies due to conceptual entanglement in LRMs’ hidden states. To address this, we leverage Sparse Autoencoders (SAEs) to decompose strategy-entangled hidden states into a disentangled feature space. To identify the few strategy-specific features from the vast pool of SAE features, we propose SAE-Steering, an efficient two-stage feature identification pipeline. SAE-Steering first recalls features that amplify the logits of strategy-specific keywords, filtering out over 99% of features, and then ranks the remaining features by their control effectiveness. Using the identified strategy-specific features as control vectors, SAE-Steering outperforms existing methods by over 15% in control effectiveness. Furthermore, controlling reasoning strategies can redirect LRMs from erroneous paths to correct ones, achieving a 7% absolute accuracy improvement.

[306] Interleaved Tool-Call Reasoning for Protein Function Understanding

Chuanliu Fan, Zicheng Ma, Huanran Meng, Aijia Zhang, Wenjie Du, Jun Zhang, Yi Qin Gao, Ziqiang Cao, Guohong Fu

Main category: cs.AI

TL;DR: PFUA is a tool-augmented protein reasoning agent that outperforms text-only LLMs by 103% on protein function prediction tasks by integrating domain-specific tools rather than relying on pure text reasoning.

Details

Motivation: Chain-of-thought reasoning from LLMs fails for protein function understanding because it amplifies superficial keyword patterns without introducing new biological knowledge, limiting generalization. Protein function prediction requires external biological priors and computational tools rather than purely internal reasoning.

Method: PFUA (tool-augmented protein reasoning agent) unifies problem decomposition, tool invocation, and grounded answer generation. Instead of long unconstrained reasoning traces, it integrates domain-specific tools to produce verifiable intermediate evidence.

Result: Experiments on four benchmarks show PFUA consistently outperforms text-only reasoning models with an average performance improvement of 103%.

Conclusion: Protein function prediction requires domain-specific tools and verifiable evidence rather than pure text-based reasoning. PFUA demonstrates that tool-augmented approaches are essential for knowledge-intensive scientific tasks in biology.

Abstract: Recent advances in large language models (LLMs) have highlighted the effectiveness of chain-of-thought reasoning in symbolic domains such as mathematics and programming. However, our study shows that directly transferring such text-based reasoning paradigms to protein function understanding is ineffective: reinforcement learning mainly amplifies superficial keyword patterns while failing to introduce new biological knowledge, resulting in limited generalization. We argue that protein function prediction is a knowledge-intensive scientific task that fundamentally relies on external biological priors and computational tools rather than purely internal reasoning. To address this gap, we propose PFUA, a tool-augmented protein reasoning agent that unifies problem decomposition, tool invocation, and grounded answer generation. Instead of relying on long unconstrained reasoning traces, PFUA integrates domain-specific tools to produce verifiable intermediate evidence. Experiments on four benchmarks demonstrate that PFUA consistently outperforms text-only reasoning models with an average performance improvement of 103%.

[307] When Identity Skews Debate: Anonymization for Bias-Reduced Multi-Agent Reasoning

Hyeong Kyu Choi, Xiaojin Zhu, Sharon Li

Main category: cs.AI

TL;DR: This paper introduces a principled framework to mitigate identity bias (sycophancy and self-bias) in multi-agent debate by formalizing debate dynamics, proposing response anonymization, and defining a bias metric.

Details

Motivation: Multi-agent debate systems suffer from identity-driven biases where agents either uncritically adopt peers' views (sycophancy) or stubbornly adhere to their own prior outputs (self-bias), undermining the reliability and trustworthiness of debate outcomes.

Method: 1) Formalize debate dynamics as an identity-weighted Bayesian update process; 2) Propose response anonymization by removing identity markers from prompts to force equal weights on agent identity; 3) Define Identity Bias Coefficient (IBC) to measure agents’ tendency to follow peers versus themselves.

Result: Empirical studies across multiple models and benchmarks confirm that identity bias is widespread, with sycophancy being far more common than self-bias. Response anonymization effectively reduces bias and improves trustworthiness.

Conclusion: The work highlights the need for MAD systems to reason based on content rather than identity, provides a principled framework for bias mitigation, and offers a quantitative metric (IBC) for measuring identity bias in multi-agent debates.

Abstract: Multi-agent debate (MAD) aims to improve large language model (LLM) reasoning by letting multiple agents exchange answers and then aggregate their opinions. Yet recent studies reveal that agents are not neutral: they are prone to identity-driven sycophancy and self-bias, uncritically adopting a peer’s view or stubbornly adhering to their own prior output, undermining the reliability of debate. In this work, we present the first principled framework that joins sycophancy and self-bias to mitigate and quantify identity bias in MAD. First, we formalize the debate dynamics as an identity-weighted Bayesian update process. Second, we propose response anonymization: by removing identity markers from prompts, agents cannot distinguish “self” from “peer”, which forces equal weights on agent identity, thereby reducing bias and improving trustworthiness. Third, we define the Identity Bias Coefficient (IBC), a principled bias metric that measures an agent’s tendency to follow its peer versus itself. Empirical studies across multiple models and benchmarks confirm that identity bias is widespread, with sycophancy far more common than self-bias. Our findings highlight the need to ensure that MAD systems reason based on content rather than identity. Code is released in https://github.com/deeplearning-wisc/MAD-identity-bias.

[308] Architecting Agentic Communities using Design Patterns

Zoran Milosevic, Fethi Rabhi

Main category: cs.AI

TL;DR: The paper presents a formal framework for architecting AI agent systems using enterprise design patterns, focusing on Agentic Communities where AI agents and humans coordinate through governed roles and protocols.

Details

Motivation: The rapid evolution of LLMs and Agentic AI requires systematic architectural guidance for building production-grade systems that can coordinate effectively in enterprise environments.

Method: Classifies patterns into three tiers (LLM Agents, Agentic AI, Agentic Communities), draws on distributed systems principles, and grounds patterns in a formal framework specifying collaboration agreements with roles, protocols, and governance structures.

Result: Provides a framework with practical guidance and formal verification capabilities, enabling expression of organizational, legal, and ethical rules through accountability mechanisms for verifiable governance of multi-agent ecosystems.

Conclusion: The approach offers actionable guidance for practitioners while maintaining formal rigor essential for enterprise deployment, validated through a clinical trial matching case study.

Abstract: The rapid evolution of Large Language Models (LLM) and subsequent Agentic AI technologies requires systematic architectural guidance for building sophisticated, production-grade systems. This paper presents an approach for architecting such systems using design patterns derived from enterprise distributed systems standards, formal methods, and industry practice. We classify these patterns into three tiers: LLM Agents (task-specific automation), Agentic AI (adaptive goal-seekers), and Agentic Communities (organizational frameworks where AI agents and human participants coordinate through formal roles, protocols, and governance structures). We focus on Agentic Communities - coordination frameworks encompassing LLM Agents, Agentic AI entities, and humans - most relevant for enterprise and industrial applications. Drawing on established coordination principles from distributed systems, we ground these patterns in a formal framework that specifies collaboration agreements where AI agents and humans fill roles within governed ecosystems. This approach provides both practical guidance and formal verification capabilities, enabling expression of organizational, legal, and ethical rules through accountability mechanisms that ensure operational and verifiable governance of inter-agent communication, negotiation, and intent modeling. We validate this framework through a clinical trial matching case study. Our goal is to provide actionable guidance to practitioners while maintaining the formal rigor essential for enterprise deployment in dynamic, multi-agent ecosystems.

[309] How Does the Thinking Step Influence Model Safety? An Entropy-based Safety Reminder for LRMs

Su-Hyeon Kim, Hyundong Jin, Yejin Lee, Yo-Sub Han

Main category: cs.AI

TL;DR: SafeRemind: A decoding-time defense method that injects safe-reminding phrases into thinking steps of Large Reasoning Models to prevent unsafe behavior amplification while preserving reasoning utility.

Details

Motivation: Large Reasoning Models (LRMs) use explicit thinking steps that can amplify unsafe behaviors, but conventional defenses overlook the unique reasoning dynamics of LRMs. The authors discovered that safe-reminding phrases within thinking steps play a crucial role in ensuring LRM safety.

Method: SafeRemind is a decoding-time defense method that dynamically injects safe-reminding phrases into thinking steps. It uses entropy triggers to intervene at decision-locking points, redirecting potentially harmful trajectories toward safer outcomes without requiring parameter updates.

Result: Extensive evaluations across five LRMs and six benchmarks show SafeRemind substantially enhances safety, achieving improvements of up to 45.5 percentage points while preserving core reasoning utility.

Conclusion: SafeRemind effectively addresses the novel safety risks introduced by thinking steps in LRMs through targeted phrase injection at critical decision points, offering a practical defense mechanism that maintains reasoning capabilities.

Abstract: Large Reasoning Models (LRMs) achieve remarkable success through explicit thinking steps, yet the thinking steps introduce a novel risk by potentially amplifying unsafe behaviors. Despite this vulnerability, conventional defense mechanisms remain ineffective as they overlook the unique reasoning dynamics of LRMs. In this work, we find that the emergence of safe-reminding phrases within thinking steps plays a pivotal role in ensuring LRM safety. Motivated by this finding, we propose SafeRemind, a decoding-time defense method that dynamically injects safe-reminding phrases into thinking steps. By leveraging entropy triggers to intervene at decision-locking points, SafeRemind redirects potentially harmful trajectories toward safer outcomes without requiring any parameter updates. Extensive evaluations across five LRMs and six benchmarks demonstrate that SafeRemind substantially enhances safety, achieving improvements of up to 45.5%p while preserving core reasoning utility.

[310] Sandwich Reasoning: An Answer-Reasoning-Answer Approach for Low-Latency Query Correction

Chen Zhang, Kepu Zhang, Jiatong Zhang, Xiao Zhang, Jun Xu

Main category: cs.AI

TL;DR: SandwichR introduces an Answer-Reasoning-Answer paradigm for query correction that achieves CoT-level accuracy with 40-70% latency reduction by aligning initial answers with post-hoc reasoning through consistency-aware RL.

Details

Motivation: Query correction needs high accuracy within real-time constraints. CoT reasoning improves accuracy but has prohibitive latency. Early answer approaches can't leverage reasoning to improve accuracy since answers are generated before reasoning in autoregressive decoding.

Method: SandwichR uses Answer-Reasoning-Answer paradigm: initial correction → explicit reasoning → final refined correction. Uses consistency-aware RL with dedicated consistency reward to align initial and final corrections, plus margin-based rejection sampling to prioritize borderline samples where reasoning has biggest impact. Also constructs specialized query correction dataset.

Result: Achieves state-of-the-art accuracy comparable to standard CoT while delivering 40-70% latency reduction, resolving the latency-accuracy trade-off in online search systems.

Conclusion: SandwichR enables low-latency query correction without sacrificing reasoning-aware accuracy by explicitly aligning fast initial answers with post-hoc reasoning, making it practical for real-time search applications.

Abstract: Query correction is a critical entry point in modern search pipelines, demanding high accuracy strictly within real-time latency constraints. Chain-of-Thought (CoT) reasoning improves accuracy but incurs prohibitive latency for real-time query correction. A potential solution is to output an answer before reasoning to reduce latency; however, under autoregressive decoding, the early answer is independent of subsequent reasoning, preventing the model from leveraging its reasoning capability to improve accuracy. To address this issue, we propose Sandwich Reasoning (SandwichR), a novel approach that explicitly aligns a fast initial answer with post-hoc reasoning, enabling low-latency query correction without sacrificing reasoning-aware accuracy. SandwichR follows an Answer-Reasoning-Answer paradigm, producing an initial correction, an explicit reasoning process, and a final refined correction. To align the initial answer with post-reasoning insights, we design a consistency-aware reinforcement learning (RL) strategy: a dedicated consistency reward enforces alignment between the initial and final corrections, while margin-based rejection sampling prioritizes borderline samples where reasoning drives the most impactful corrective gains. Additionally, we construct a high-quality query correction dataset, addressing the lack of specialized benchmarks for complex query correction. Experimental results demonstrate that SandwichR achieves SOTA accuracy comparable to standard CoT while delivering a 40-70% latency reduction, resolving the latency-accuracy trade-off in online search.

[311] Personalized Medication Planning via Direct Domain Modeling and LLM-Generated Heuristics

Yonatan Vernik, Alexander Tuisov, David Izhaki, Hana Weitman, Gal A. Kaminka, Alexander Shleyfman

Main category: cs.AI

TL;DR: Automated medication planning scales from 7 to 28+ medications using LLM-generated domain-specific heuristics with GBFS search.

Details

Motivation: Previous automated medication planning was limited to only 7 medications, which is clinically impractical. Need to scale up to realistic clinical levels for practical applications.

Method: Programmatically specify domain (initial state + successor generation), use LLM to generate problem-specific heuristics, then apply GBFS search algorithm with these heuristics.

Result: Dramatic improvements in coverage and planning time, scaling from 7 to at least 28 medications, making medication planning more practical.

Conclusion: LLM-generated domain-specific heuristics enable significant scaling of automated medication planning, bringing it closer to practical clinical applications.

Abstract: Personalized medication planning involves selecting medications and determining a dosing schedule to achieve medical goals specific to each individual patient. Previous work successfully demonstrated that automated planners, using general domain-independent heuristics, are able to generate personalized treatments, when the domain and problems are modeled using a general domain description language (\pddlp). Unfortunately, this process was limited in practice to consider no more than seven medications. In clinical terms, this is a non-starter. In this paper, we explore the use of automatically-generated domain- and problem-specific heuristics to be used with general search, as a method of scaling up medication planning to levels allowing closer work with clinicians. Specifically, we specify the domain programmatically (specifying an initial state and a successor generation procedure), and use an LLM to generate a problem specific heuristic that can be used by a fixed search algorithm (GBFS). The results indicate dramatic improvements in coverage and planning time, scaling up the number of medications to at least 28, and bringing medication planning one step closer to practical applications.

[312] EntroCoT: Enhancing Chain-of-Thought via Adaptive Entropy-Guided Segmentation

Zihang Li, Yuhang Wang, Yikun Zong, Wenhan Yu, Xiaokun Yuan, Runhan Jiang, Zirui Liu, Tong Yang, Arthur Jiang

Main category: cs.AI

TL;DR: EntroCoT is a framework that automatically identifies and filters low-quality Chain-of-Thought reasoning traces by segmenting reasoning steps at uncertain points and evaluating each step’s contribution, creating higher-quality training data for mathematical reasoning.

Details

Motivation: Existing fine-tuning datasets for Chain-of-Thought prompting often contain "answer right but reasoning wrong" problems where correct final answers come from hallucinated, redundant, or logically invalid intermediate steps, which undermines the quality of supervision for training LLMs.

Method: 1) Uses entropy-based mechanism to segment reasoning traces into multiple steps at uncertain junctures; 2) Employs Monte Carlo rollout-based mechanism to evaluate the marginal contribution of each step; 3) Filters deceptive reasoning samples to construct high-quality dataset where every intermediate step facilitates the final answer.

Result: Extensive experiments on mathematical benchmarks show that fine-tuning on the subset constructed by EntroCoT consistently outperforms baselines using full-dataset supervision.

Conclusion: EntroCoT effectively addresses the “answer right but reasoning wrong” problem in CoT datasets, providing a unified framework for automatic identification and refinement of low-quality reasoning traces, leading to improved mathematical reasoning performance in LLMs.

Abstract: Chain-of-Thought (CoT) prompting has significantly enhanced the mathematical reasoning capabilities of Large Language Models. We find existing fine-tuning datasets frequently suffer from the “answer right but reasoning wrong” probelm, where correct final answers are derived from hallucinated, redundant, or logically invalid intermediate steps. This paper proposes EntroCoT, a unified framework for automatically identifying and refining low-quality CoT supervision traces. EntroCoT first proposes an entropy-based mechanism to segment the reasoning trace into multiple steps at uncertain junctures, and then introduces a Monte Carlo rollout-based mechanism to evaluate the marginal contribution of each step. By accurately filtering deceptive reasoning samples, EntroCoT constructs a high-quality dataset where every intermediate step in each reasoning trace facilitates the final answer. Extensive experiments on mathematical benchmarks demonstrate that fine-tuning on the subset constructed by EntroCoT consistently outperforms the baseslines of full-dataset supervision.

[313] ROI-Reasoning: Rational Optimization for Inference via Pre-Computation Meta-Cognition

Muyang Zhao, Qi Qi, Hao Sun

Main category: cs.AI

TL;DR: ROI-Reasoning: A two-stage framework that teaches LLMs to strategically allocate computation under strict token budgets by predicting task difficulty and expected utility.

Details

Motivation: LLMs lack inherent understanding of how much computation different tasks require, and current approaches don't optimize reasoning under strict global token constraints, leading to inefficient computation allocation.

Method: Two-stage approach: 1) Meta-Cognitive Fine-Tuning teaches models to predict reasoning cost and expected utility before generation, enabling solve-or-skip decisions; 2) Rationality-Aware Reinforcement Learning optimizes sequential decision making under hard token budgets for long-horizon allocation strategies.

Result: ROI-Reasoning consistently improves overall score while substantially reducing regret under tight computation budgets across budgeted mathematical reasoning benchmarks.

Conclusion: The framework successfully endows LLMs with budget-aware rationality by formalizing the problem as Ordered Stochastic Multiple-Choice Knapsack Problem and teaching models to anticipate task difficulty, estimate ROI, and allocate computation strategically.

Abstract: Large language models (LLMs) can achieve strong reasoning performance with sufficient computation, but they do not inherently know how much computation a task requires. We study budgeted inference-time reasoning for multiple tasks under a strict global token constraint and formalize it as a Ordered Stochastic Multiple-Choice Knapsack Problem(OS-MCKP). This perspective highlights a meta-cognitive requirement – anticipating task difficulty, estimating return over investment (ROI), and allocating computation strategically. We propose ROI-Reasoning, a two-stage framework that endows LLMs with intrinsic, budget-aware rationality. In the first stage, Meta-Cognitive Fine-Tuning teaches models to predict reasoning cost and expected utility before generation, enabling explicit solve-or-skip decisions. Next, Rationality-Aware Reinforcement Learning optimizes sequential decision making under a hard token budget, allowing models to learn long-horizon allocation strategies. Across budgeted mathematical reasoning benchmarks, ROI-Reasoning consistently improves overall score while substantially reducing regret under tight computation budgets.

[314] Defeasible Conditionals using Answer Set Programming

Racquel Dennison, Jesse Heyninck, Thomas Meyer

Main category: cs.AI

TL;DR: ASP-based declarative implementation of Rational Closure for defeasible reasoning with improved efficiency over existing imperative solvers.

Details

Motivation: The KLM framework provides foundational properties for defeasible reasoning, but existing implementations (like InfOCF) are imperative. There's a need for declarative approaches using Answer Set Programming to compute Rational Closure more efficiently.

Method: Developed a declarative ASP encoding to compute Rational Closure, enabling automatic construction of minimal ranked models from knowledge bases and supporting entailment checking for queries.

Result: Formally proved correctness of ASP encoding. Empirical evaluation shows improved computational efficiency compared to existing imperative implementations (InfOCF solver).

Conclusion: ASP-based approach successfully implements Rational Closure while adhering to theoretical foundations and offering better performance than imperative alternatives.

Abstract: Defeasible entailment is concerned with drawing plausible conclusions from incomplete information. A foundational framework for modelling defeasible entailment is the KLM framework. Introduced by Kraus, Lehmann, and Magidor, the KLM framework outlines several key properties for defeasible entailment. One of the most prominent algorithms within this framework is Rational Closure (RC). This paper presents a declarative definition for computing RC using Answer Set Programming (ASP). Our approach enables the automatic construction of the minimal ranked model from a given knowledge base and supports entailment checking for specified queries. We formally prove the correctness of our ASP encoding and conduct empirical evaluations to compare the performance of our implementation with that of existing imperative implementations, specifically the InfOCF solver. The results demonstrate that our ASP-based approach adheres to RC’s theoretical foundations and offers improved computational efficiency.

[315] XAI-LAW: A Logic Programming Tool for Modeling, Explaining, and Learning Legal Decisions

Agostino Dovier, Talissa Dreossi, Andrea Formisano, Benedetta Strizzolo

Main category: cs.AI

TL;DR: An ASP-based system for modeling Italian Criminal Code articles and learning legal rules from judicial decisions to support legal reasoning during trials.

Details

Motivation: To support legal experts during criminal trials by providing automated reasoning about legal outcomes and making judicial decision-making more interpretable through explainable AI.

Method: Encode Italian Criminal Code articles (crimes against person and property offenses) in Answer Set Programming (ASP), validate on previous verdicts, handle contradictions, and use inductive logic programming to learn legal rules from case examples.

Result: Developed a tool that generates possible decisions for new cases, provides explanations using stable model “supportedness,” and can generalize legal rules from examples through inductive learning.

Conclusion: ASP-based modeling of legal codes combined with inductive learning enables effective legal reasoning support with automatic explainability, enhancing interpretability of judicial decisions.

Abstract: We propose an approach to model articles of the Italian Criminal Code (ICC), using Answer Set Programming (ASP), and to semi-automatically learn legal rules from examples based on prior judicial decisions. The developed tool is intended to support legal experts during the criminal trial phase by providing reasoning and possible legal outcomes. The methodology involves analyzing and encoding articles of the ICC in ASP, including “crimes against the person” and property offenses. The resulting model is validated on a set of previous verdicts and refined as necessary. During the encoding process, contradictions may arise; these are properly handled by the system, which also generates possible decisions for new cases and provides explanations through a tool that leverages the “supportedness” of stable models. The automatic explainability offered by the tool can also be used to clarify the logic behind judicial decisions, making the decision-making process more interpretable. Furthermore, the tool integrates an inductive logic programming system for ASP, which is employed to generalize legal rules from case examples.

[316] Formally Explaining Decision Tree Models with Answer Set Programming

Akihiro Takemura, Masayuki Otani, Katsumi Inoue

Main category: cs.AI

TL;DR: ASP-based method for generating multiple explanation types (sufficient, contrastive, majority, tree-specific) for decision tree models, offering flexibility and enumeration capabilities compared to SAT approaches.

Details

Motivation: Decision tree models like random forests and gradient-boosted trees are widely used but difficult to interpret, especially in safety-critical applications where formal justification of model decisions is required.

Method: Proposes using Answer Set Programming (ASP) to generate various types of explanations (sufficient, contrastive, majority, and tree-specific explanations). ASP offers greater flexibility in encoding user preferences and supports enumeration of all possible explanations compared to SAT-based approaches.

Result: Empirical evaluation on diverse datasets demonstrates the effectiveness and limitations of the ASP-based approach compared to existing methods.

Conclusion: ASP provides a flexible framework for generating multiple explanation types for decision tree models, addressing interpretability needs in safety-critical applications while supporting user preferences and complete enumeration capabilities.

Abstract: Decision tree models, including random forests and gradient-boosted decision trees, are widely used in machine learning due to their high predictive performance. However, their complex structures often make them difficult to interpret, especially in safety-critical applications where model decisions require formal justification. Recent work has demonstrated that logical and abductive explanations can be derived through automated reasoning techniques. In this paper, we propose a method for generating various types of explanations, namely, sufficient, contrastive, majority, and tree-specific explanations, using Answer Set Programming (ASP). Compared to SAT-based approaches, our ASP-based method offers greater flexibility in encoding user preferences and supports enumeration of all possible explanations. We empirically evaluate the approach on a diverse set of datasets and demonstrate its effectiveness and limitations compared to existing methods.

[317] xDNN(ASP): Explanation Generation System for Deep Neural Networks powered by Answer Set Programming

Ly Ly Trieu, Tran Cao Son

Main category: cs.AI

TL;DR: xDNN(ASP) is an explainable AI system that extracts logic programs from deep neural networks to provide global explanations, maintaining accuracy while revealing feature importance and hidden node impacts for network optimization.

Details

Motivation: Current xAI methods like SHAP, rule extraction, and counterfactuals focus on input-output relationships but neglect the internal structure of neural networks in explanation generation. There's a need for global explanations that capture the network's internal logic and structure.

Method: xDNN(ASP) extracts a logic program under answer set semantics from a trained neural network and its training data. The extracted program ideally represents the trained model with one-to-one correspondence between answer sets and input-output pairs of the network.

Result: Experimental evaluation on two synthetic datasets shows the extracted logic program maintains high prediction accuracy while providing valuable model understanding: feature importance analysis and hidden node impact assessment that can guide network optimization by reducing hidden layer nodes.

Conclusion: xDNN(ASP) successfully provides global explanations for deep neural networks by extracting interpretable logic programs that capture both input-output relationships and internal network structure, enabling better model understanding and optimization.

Abstract: Explainable artificial intelligence (xAI) has gained significant attention in recent years. Among other things, explainablility for deep neural networks has been a topic of intensive research due to the meteoric rise in prominence of deep neural networks and their “black-box” nature. xAI approaches can be characterized along different dimensions such as their scope (global versus local explanations) or underlying methodologies (statistic-based versus rule-based strategies). Methods generating global explanations aim to provide reasoning process applicable to all possible output classes while local explanation methods focus only on a single, specific class. SHAP (SHapley Additive exPlanations), a well-known statistical technique, identifies important features of a network. Deep neural network rule extraction method constructs IF-THEN rules that link input conditions to a class. Another approach focuses on generating counterfactuals which help explain how small changes to an input can affect the model’s predictions. However, these techniques primarily focus on the input-output relationship and thus neglect the structure of the network in explanation generation. In this work, we propose xDNN(ASP), an explanation generation system for deep neural networks that provides global explanations. Given a neural network model and its training data, xDNN(ASP) extracts a logic program under answer set semantics that-in the ideal case-represents the trained model, i.e., answer sets of the extracted program correspond one-to-one to input-output pairs of the network. We demonstrate experimentally, using two synthetic datasets, that not only the extracted logic program maintains a high-level of accuracy in the prediction task, but it also provides valuable information for the understanding of the model such as the importance of features as well as the impact of hidden nodes on the prediction. The latter can be used as a guide for reducing the number of nodes used in hidden layers, i.e., providing a means for optimizing the network.

[318] Investigating the Grounding Bottleneck for a Large-Scale Configuration Problem: Existing Tools and Constraint-Aware Guessing

Veronika Semmelrock, Gerhard Friedrich

Main category: cs.AI

TL;DR: ASP faces scaling challenges for large configuration problems; constraint-aware guessing reduces memory demands to handle 30k+ component systems.

Details

Motivation: Current ASP solving techniques may not scale for large configuration problems like electronic systems with over 30,000 components, due to the grounding bottleneck where memory demands increase sharply with problem size.

Method: Investigates incremental solving approach and develops constraint-aware guessing method based on grounding analysis to reduce memory requirements.

Result: Incremental solving proved effective but still faced memory limitations; constraint-aware guessing significantly reduced memory needs for large configuration problems.

Conclusion: While ASP has realized the AI vision in many domains, scaling to large configuration problems requires addressing the grounding bottleneck through methods like constraint-aware guessing to manage memory demands.

Abstract: Answer set programming (ASP) aims to realize the AI vision: The user specifies the problem, and the computer solves it. Indeed, ASP has made this vision true in many application domains. However, will current ASP solving techniques scale up for large configuration problems? As a benchmark for such problems, we investigated the configuration of electronic systems, which may comprise more than 30,000 components. We show the potential and limits of current ASP technology, focusing on methods that address the so-called grounding bottleneck, i.e., the sharp increase of memory demands in the size of the problem instances. To push the limits, we investigated the incremental solving approach, which proved effective in practice. However, even in the incremental approach, memory demands impose significant limits. Based on an analysis of grounding, we developed the method constraint-aware guessing, which significantly reduced the memory need.

[319] Current Agents Fail to Leverage World Model as Tool for Foresight

Cheng Qian, Emre Can Acikgoz, Bingxuan Li, Xiusi Chen, Yuji Zhang, Bingxiang He, Qinyu Luo, Dilek Hakkani-Tür, Gokhan Tur, Yunzhu Li, Heng Ji, Heng Ji

Main category: cs.AI

TL;DR: Current AI agents struggle to effectively use generative world models for future state anticipation, showing low simulation usage, frequent misuse of predictions, and inconsistent performance when simulation is available.

Details

Motivation: As agents face more tasks requiring future state anticipation rather than short-horizon reasoning, generative world models offer potential as external simulators. The paper investigates whether current agents can effectively leverage these world models to enhance their cognitive capabilities.

Method: The study empirically examines agent performance across diverse agentic and visual question answering tasks when given access to generative world models. It analyzes simulation invocation rates, misuse of predicted rollouts, and performance changes when simulation is available or enforced. Attribution analysis identifies specific bottlenecks in agent-world model interaction.

Result: Agents rarely invoke simulation (<1%), frequently misuse predicted rollouts (~15%), and often show inconsistent or degraded performance (up to 5% worse) when simulation is available. The main bottlenecks are agents’ inability to decide when to simulate, interpret predicted outcomes, and integrate foresight into reasoning.

Conclusion: Current agents lack the capacity to effectively use world models as cognitive tools. The findings highlight the need for mechanisms that enable calibrated, strategic interaction with world models to achieve reliable anticipatory cognition in future agent systems.

Abstract: Agents built on vision-language models increasingly face tasks that demand anticipating future states rather than relying on short-horizon reasoning. Generative world models offer a promising remedy: agents could use them as external simulators to foresee outcomes before acting. This paper empirically examines whether current agents can leverage such world models as tools to enhance their cognition. Across diverse agentic and visual question answering tasks, we observe that some agents rarely invoke simulation (fewer than 1%), frequently misuse predicted rollouts (approximately 15%), and often exhibit inconsistent or even degraded performance (up to 5%) when simulation is available or enforced. Attribution analysis further indicates that the primary bottleneck lies in the agents’ capacity to decide when to simulate, how to interpret predicted outcomes, and how to integrate foresight into downstream reasoning. These findings underscore the need for mechanisms that foster calibrated, strategic interaction with world models, paving the way toward more reliable anticipatory cognition in future agent systems.

[320] Trade-R1: Bridging Verifiable Rewards to Stochastic Environments via Process-Level Reasoning Verification

Rui Sun, Yifan Sun, Sheng Xu, Li Zhao, Jing Li, Daxin Jiang, Chen Hua, Zuo Bai

Main category: cs.AI

TL;DR: Trade-R1 framework uses process-level reasoning verification to adapt RL for financial decisions, addressing noisy market rewards through structured RAG verification and triangular consistency metrics.

Details

Motivation: RL works well for LLMs in domains with verifiable rewards (math, coding), but financial markets have verifiable but inherently noisy rewards that cause standard RL to degenerate into reward hacking.

Method: Proposes Trade-R1 framework with verification method that transforms financial document reasoning evaluation into structured RAG task. Uses triangular consistency metric assessing pairwise alignment between retrieved evidence, reasoning chains, and decisions as validity filter. Two reward strategies: Fixed-effect Semantic Reward (FSR) for stable alignment, and Dynamic-effect Semantic Reward (DSR) for coupled magnitude optimization.

Result: Experiments on different country asset selection show the paradigm reduces reward hacking, with DSR achieving superior cross-market generalization while maintaining highest reasoning consistency.

Conclusion: Trade-R1 successfully bridges verifiable rewards to stochastic financial environments through process-level reasoning verification, addressing the challenge of noisy market returns in RL applications.

Abstract: Reinforcement Learning (RL) has enabled Large Language Models (LLMs) to achieve remarkable reasoning in domains like mathematics and coding, where verifiable rewards provide clear signals. However, extending this paradigm to financial decision is challenged by the market’s stochastic nature: rewards are verifiable but inherently noisy, causing standard RL to degenerate into reward hacking. To address this, we propose Trade-R1, a model training framework that bridges verifiable rewards to stochastic environments via process-level reasoning verification. Our key innovation is a verification method that transforms the problem of evaluating reasoning over lengthy financial documents into a structured Retrieval-Augmented Generation (RAG) task. We construct a triangular consistency metric, assessing pairwise alignment between retrieved evidence, reasoning chains, and decisions to serve as a validity filter for noisy market returns. We explore two reward integration strategies: Fixed-effect Semantic Reward (FSR) for stable alignment signals, and Dynamic-effect Semantic Reward (DSR) for coupled magnitude optimization. Experiments on different country asset selection demonstrate that our paradigm reduces reward hacking, with DSR achieving superior cross-market generalization while maintaining the highest reasoning consistency.

[321] Anti-Length Shift: Dynamic Outlier Truncation for Training Efficient Reasoning Models

Wei Wu, Liyi Chen, Congxi Xiao, Tianfu Wang, Qimeng Wang, Chengqiang Lu, Yan Gao, Yi Wu, Yao Hu, Hui Xiong

Main category: cs.AI

TL;DR: DOT (Dynamic Outlier Truncation) reduces model verbosity by targeting excessive reasoning tokens during training, achieving 78% token reduction while improving accuracy.

Details

Motivation: Large reasoning models with RL-enhanced chain-of-thought often generate excessive verbosity on simple queries, increasing deployment costs. Existing methods using explicit length penalties create optimization conflicts and don't address the underlying generative mechanisms causing overthinking.

Method: Introduces Dynamic Outlier Truncation (DOT) - a training-time intervention that selectively suppresses redundant tokens by targeting only the extreme tail of response lengths within fully correct rollout groups. Combines with auxiliary KL regularization and predictive dynamic sampling for stable convergence.

Result: Significantly pushes the efficiency-performance Pareto frontier outward across multiple model scales. On AIME-24, reduces inference token usage by 78% while simultaneously increasing accuracy compared to initial policy, surpassing state-of-the-art efficient reasoning methods.

Conclusion: DOT effectively addresses the length shift phenomenon where models generate unnecessary reasoning on trivial inputs, enabling more efficient reasoning while preserving long-horizon reasoning capabilities for complex problems.

Abstract: Large reasoning models enhanced by reinforcement learning with verifiable rewards have achieved significant performance gains by extending their chain-of-thought. However, this paradigm incurs substantial deployment costs as models often exhibit excessive verbosity on simple queries. Existing efficient reasoning methods relying on explicit length penalties often introduce optimization conflicts and leave the generative mechanisms driving overthinking largely unexamined. In this paper, we identify a phenomenon termed length shift where models increasingly generate unnecessary reasoning on trivial inputs during training. To address this, we introduce Dynamic Outlier Truncation (DOT), a training-time intervention that selectively suppresses redundant tokens. This method targets only the extreme tail of response lengths within fully correct rollout groups while preserving long-horizon reasoning capabilities for complex problems. To complement this intervention and ensure stable convergence, we further incorporate auxiliary KL regularization and predictive dynamic sampling. Experimental results across multiple model scales demonstrate that our approach significantly pushes the efficiency-performance Pareto frontier outward. Notably, on the AIME-24, our method reduces inference token usage by 78% while simultaneously increasing accuracy compared to the initial policy and surpassing state-of-the-art efficient reasoning methods.

[322] MobileDreamer: Generative Sketch World Model for GUI Agent

Yilin Cao, Yufeng Zhong, Zhixiong Zeng, Liming Zheng, Jing Huang, Haibo Qiu, Peng Shi, Wenji Mao, Wan Guanglu

Main category: cs.AI

TL;DR: MobileDreamer introduces an efficient world-model-based lookahead framework for mobile GUI agents that uses textual sketch world modeling and rollout imagination to improve performance on long-horizon tasks.

Details

Motivation: Existing mobile GUI agents are mostly reactive, making decisions only from current screens, which limits their performance on long-horizon tasks. Building a world model that can forecast action outcomes would enable better decision-making, but this is challenging due to the need for spatial awareness and computational efficiency.

Method: MobileDreamer consists of two main components: 1) Textual sketch world model that transforms digital images into key task-related sketches and uses an order-invariant learning strategy to preserve spatial information of GUI elements, and 2) Rollout imagination strategy that optimizes action selection by leveraging the world model’s prediction capabilities.

Result: Experiments on Android World show MobileDreamer achieves state-of-the-art performance and improves task success by 5.25%. World model evaluations confirm that the textual sketch modeling accurately forecasts key GUI elements.

Conclusion: MobileDreamer successfully addresses the limitations of reactive GUI agents by providing an efficient world-model-based lookahead framework that enables better decision-making through future imagination, significantly improving performance on long-horizon mobile GUI tasks.

Abstract: Mobile GUI agents have shown strong potential in real-world automation and practical applications. However, most existing agents remain reactive, making decisions mainly from current screen, which limits their performance on long-horizon tasks. Building a world model from repeated interactions enables forecasting action outcomes and supports better decision making for mobile GUI agents. This is challenging because the model must predict post-action states with spatial awareness while remaining efficient enough for practical deployment. In this paper, we propose MobileDreamer, an efficient world-model-based lookahead framework to equip the GUI agents based on the future imagination provided by the world model. It consists of textual sketch world model and rollout imagination for GUI agent. Textual sketch world model forecasts post-action states through a learning process to transform digital images into key task-related sketches, and designs a novel order-invariant learning strategy to preserve the spatial information of GUI elements. The rollout imagination strategy for GUI agent optimizes the action-selection process by leveraging the prediction capability of world model. Experiments on Android World show that MobileDreamer achieves state-of-the-art performance and improves task success by 5.25%. World model evaluations further verify that our textual sketch modeling accurately forecasts key GUI elements.

[323] ComfySearch: Autonomous Exploration and Reasoning for ComfyUI Workflows

Jinwei Su, Qizhen Lan, Zeyu Wang, Yinghui Xia, Hairu Wen, Yiqun Duan, Xi Xiao, Tianyu Shi, Yang Jingsong, Lewei He

Main category: cs.AI

TL;DR: ComfySearch is an agentic framework that generates functional ComfyUI pipelines through validation-guided workflow construction, outperforming existing methods on complex creative tasks.

Details

Motivation: AI-generated content has shifted to modular workflows (like ComfyUI), but the large number of components and difficulty maintaining structural consistency under graph constraints lead to low pass rates and limited quality workflows.

Method: ComfySearch is an agentic framework that explores the component space and generates functional ComfyUI pipelines via validation-guided workflow construction.

Result: ComfySearch substantially outperforms existing methods on complex and creative tasks, achieving higher executability (pass) rates, higher solution rates, and stronger generalization.

Conclusion: The proposed framework effectively addresses the limitations of current modular AI workflow generation by providing a systematic approach to constructing functional pipelines with better performance metrics.

Abstract: AI-generated content has progressed from monolithic models to modular workflows, especially on platforms like ComfyUI, allowing users to customize complex creative pipelines. However, the large number of components in ComfyUI and the difficulty of maintaining long-horizon structural consistency under strict graph constraints frequently lead to low pass rates and workflows of limited quality. To tackle these limitations, we present ComfySearch, an agentic framework that can effectively explore the component space and generate functional ComfyUI pipelines via validation-guided workflow construction. Experiments demonstrate that ComfySearch substantially outperforms existing methods on complex and creative tasks, achieving higher executability (pass) rates, higher solution rates, and stronger generalization.

[324] Agent Drift: Quantifying Behavioral Degradation in Multi-Agent LLM Systems Over Extended Interactions

Abhishek Rath

Main category: cs.AI

TL;DR: This paper introduces “agent drift” - the progressive degradation of multi-agent LLM systems over time - and proposes a framework to measure and mitigate it through the Agent Stability Index (ASI) and three mitigation strategies.

Details

Motivation: Multi-agent LLM systems are powerful for complex tasks but their long-term behavioral stability is unexamined. The paper addresses the problem of progressive degradation in agent behavior, decision quality, and inter-agent coherence over extended interactions.

Method: The authors introduce a theoretical framework for agent drift with three manifestations (semantic, coordination, behavioral drift), develop the Agent Stability Index (ASI) - a composite metric across 12 dimensions, and propose three mitigation strategies: episodic memory consolidation, drift-aware routing protocols, and adaptive behavioral anchoring.

Result: Through simulation-based analysis and theoretical modeling, the study demonstrates that unchecked agent drift leads to substantial reductions in task completion accuracy and increased human intervention requirements. Theoretical analysis suggests the proposed mitigation strategies can significantly reduce drift-related errors while maintaining system throughput.

Conclusion: This work establishes a foundational methodology for monitoring, measuring, and mitigating agent drift in production agentic AI systems, with important implications for enterprise deployment reliability and AI safety research.

Abstract: Multi-agent Large Language Model (LLM) systems have emerged as powerful architectures for complex task decomposition and collaborative problem-solving. However, their long-term behavioral stability remains largely unexamined. This study introduces the concept of agent drift, defined as the progressive degradation of agent behavior, decision quality, and inter-agent coherence over extended interaction sequences. We present a comprehensive theoretical framework for understanding drift phenomena, proposing three distinct manifestations: semantic drift (progressive deviation from original intent), coordination drift (breakdown in multi-agent consensus mechanisms), and behavioral drift (emergence of unintended strategies). We introduce the Agent Stability Index (ASI), a novel composite metric framework for quantifying drift across twelve dimensions, including response consistency, tool usage patterns, reasoning pathway stability, and inter-agent agreement rates. Through simulation-based analysis and theoretical modeling, we demonstrate how unchecked agent drift can lead to substantial reductions in task completion accuracy and increased human intervention requirements. We propose three mitigation strategies: episodic memory consolidation, drift-aware routing protocols, and adaptive behavioral anchoring. Theoretical analysis suggests these approaches can significantly reduce drift-related errors while maintaining system throughput. This work establishes a foundational methodology for monitoring, measuring, and mitigating agent drift in production agentic AI systems, with direct implications for enterprise deployment reliability and AI safety research.

[325] FÆRDXEL: An Expert System for Danish Traffic Law

Luís Cruz-Filipe, Jonas Vistrup

Main category: cs.AI

TL;DR: FÆRDXEL is an explainable AI tool for symbolic reasoning in Danish traffic law that combines logic programming with transparent reasoning navigation.

Details

Motivation: To create an AI system for legal reasoning in Danish traffic law that is both accurate and explainable, addressing the need for transparent AI tools in the legal sector.

Method: Combines logic programming techniques with a novel interface that allows users to navigate through the system’s reasoning process, ensuring explainability.

Result: Two evaluations: (1) Empirical evaluation showing FÆRDXEL’s conclusions align with Danish judges’ decisions in selected court cases; (2) Qualitative evaluation from legal experts indicating potential for real-world AI tools in the Danish legal sector.

Conclusion: FÆRDXEL demonstrates both accuracy in legal reasoning and explainability, showing promise as a foundation for practical AI tools in the Danish legal system.

Abstract: We present FÆRDXEL, a tool for symbolic reasoning in the domain of Danish traffic law. FÆRDXEL combines techniques from logic programming with a novel interface that allows users to navigate through its reasoning process, thereby ensuring the system’s explainability. Towards the goal of better understanding the value of FÆRDXEL, two evaluations of the system have been performed: (1) An empirical evaluation showing that for a selection of court cases, the conclusions of FÆRDXEL align with those of Danish judges. (2) A qualitative evaluation from legal experts indicating that this work has potential to become a foundation for real-world AI tools supporting professionals in the Danish legal sector.

[326] Imagining and building wise machines: The centrality of AI metacognition

Samuel G. B. Johnson, Amir-Hossein Karimi, Yoshua Bengio, Nick Chater, Tobias Gerstenberg, Kate Larson, Sydney Levine, Melanie Mitchell, Iyad Rahwan, Bernhard Schölkopf, Igor Grossmann

Main category: cs.AI

TL;DR: The paper argues that while AI has become smart, it lacks wisdom, which involves strategies for solving intractable problems including both object-level heuristics and metacognitive strategies like intellectual humility and perspective-taking.

Details

Motivation: AI systems have become increasingly smart but their wisdom hasn't kept pace. Current AI particularly struggles with metacognition, which limits its robustness, explainability, cooperation, and safety in novel environments.

Method: The paper analyzes human wisdom as a set of strategies for solving intractable problems, distinguishing between object-level strategies (heuristics for managing problems) and metacognitive strategies (like intellectual humility, perspective-taking, context-adaptability for managing object-level strategies).

Result: The analysis shows that improved metacognition in AI would lead to systems more robust to novel environments, more explainable to users, more cooperative with others, and safer by risking fewer misaligned goals with human users.

Conclusion: The paper sketches a vision for wise AI and discusses how such systems might be benchmarked, trained, and implemented, emphasizing the importance of developing metacognitive capabilities in AI systems.

Abstract: Although AI has become increasingly smart, its wisdom has not kept pace. In this article, we examine what is known about human wisdom and sketch a vision of its AI counterpart. We analyze human wisdom as a set of strategies for solving intractable problems-those outside the scope of analytic techniques-including both object-level strategies like heuristics [for managing problems] and metacognitive strategies like intellectual humility, perspective-taking, or context-adaptability [for managing object-level strategies]. We argue that AI systems particularly struggle with metacognition; improved metacognition would lead to AI more robust to novel environments, explainable to users, cooperative with others, and safer in risking fewer misaligned goals with human users. We discuss how wise AI might be benchmarked, trained, and implemented.

[327] VERUS-LM: a Versatile Framework for Combining LLMs with Symbolic Reasoning

Benjamin Callewaert, Simon Vandevelde, Joost Vennekens

Main category: cs.AI

TL;DR: VERUS-LM is a neurosymbolic framework that combines LLMs with symbolic solvers using generic prompting, knowledge-query separation, and support for diverse logical reasoning tasks, outperforming LLMs and achieving competitive results on reasoning benchmarks.

Details

Motivation: Current neurosymbolic approaches combining LLMs and symbolic solvers have limitations: poor generalizability due to task-specific prompts, inefficiency from lack of knowledge-query separation, and restricted inferential capabilities, hindering scalability across domains.

Method: VERUS-LM employs a generic prompting mechanism, clearly separates domain knowledge from queries, and supports a wide range of logical reasoning tasks including optimization and constraint satisfaction, enhancing adaptability and reducing computational costs.

Result: The approach succeeds in diverse reasoning on a novel dataset, markedly outperforming LLMs. It achieves competitive results on common reasoning benchmarks compared to state-of-the-art approaches and significantly surpasses them on the difficult AR-LSAT dataset.

Conclusion: VERUS-LM represents a significant step towards more versatile neurosymbolic AI systems by pushing the boundaries of hybrid reasoning through improved generalizability, efficiency, and inferential capabilities.

Abstract: A recent approach to neurosymbolic reasoning is to explicitly combine the strengths of large language models (LLMs) and symbolic solvers to tackle complex reasoning tasks. However, current approaches face significant limitations, including poor generalizability due to task-specific prompts, inefficiencies caused by the lack of separation between knowledge and queries, and restricted inferential capabilities. These shortcomings hinder their scalability and applicability across diverse domains. In this paper, we introduce VERUS-LM, a novel framework designed to address these challenges. VERUS-LM employs a generic prompting mechanism, clearly separates domain knowledge from queries, and supports a wide range of different logical reasoning tasks. This framework enhances adaptability, reduces computational cost, and allows for richer forms of reasoning, such as optimization and constraint satisfaction. We show that our approach succeeds in diverse reasoning on a novel dataset, markedly outperforming LLMs. Additionally, our system achieves competitive results on common reasoning benchmarks when compared to similar state-of-the-art approaches, and significantly surpasses them on the difficult AR-LSAT dataset. By pushing the boundaries of hybrid reasoning, VERUS-LM represents a significant step towards more versatile neurosymbolic AI systems.

[328] Beyond Chemical QA: Evaluating LLM’s Chemical Reasoning with Modular Chemical Operations

Hao Li, He Cao, Bin Feng, Yanjun Shao, Xiangru Tang, Zhiyuan Yan, Li Yuan, Yonghong Tian, Yu Li

Main category: cs.AI

TL;DR: ChemCoTBench is a reasoning framework that applies Chain-of-Thought principles to chemistry tasks by treating molecular transformations as modular operations, enabling systematic reasoning for complex problems like molecular optimization and reaction prediction.

Details

Motivation: Current LLM benchmarks for chemistry focus on simple knowledge retrieval rather than the step-by-step reasoning needed for complex real-world tasks like drug design and reaction engineering. There's an untapped potential for systematic reasoning in chemistry that requires rigorous structural analysis.

Method: Introduces ChemCoTBench, a reasoning framework that bridges molecular structure understanding with arithmetic-inspired operations (addition, deletion, substitution) to formalize chemical problem-solving into transparent, step-by-step workflows. Treats molecular transformations as modular “chemical operations” to enable slow-thinking reasoning.

Result: The framework is evaluated on two high-impact tasks: Molecular Property Optimization and Chemical Reaction Prediction. The paper provides annotated datasets, a reasoning taxonomy, and baseline evaluations to establish benchmarks for systematic chemical reasoning.

Conclusion: ChemCoTBench bridges the gap between abstract reasoning methods and practical chemical discovery, establishing a foundation for advancing LLMs as tools for AI-driven scientific innovation in chemistry.

Abstract: While large language models (LLMs) with Chain-of-Thought (CoT) reasoning excel in mathematics and coding, their potential for systematic reasoning in chemistry, a domain demanding rigorous structural analysis for real-world tasks like drug design and reaction engineering, remains untapped. Current benchmarks focus on simple knowledge retrieval, neglecting step-by-step reasoning required for complex tasks such as molecular optimization and reaction prediction. To address this, we introduce ChemCoTBench, a reasoning framework that bridges molecular structure understanding with arithmetic-inspired operations, including addition, deletion, and substitution, to formalize chemical problem-solving into transparent, step-by-step workflows. By treating molecular transformations as modular “chemical operations”, the framework enables slow-thinking reasoning, mirroring the logic of mathematical proofs while grounding solutions in real-world chemical constraints. We evaluate models on two high-impact tasks: Molecular Property Optimization and Chemical Reaction Prediction. These tasks mirror real-world challenges while providing structured evaluability. By providing annotated datasets, a reasoning taxonomy, and baseline evaluations, ChemCoTBench bridges the gap between abstract reasoning methods and practical chemical discovery, establishing a foundation for advancing LLMs as tools for AI-driven scientific innovation.

[329] A framework for Conditional Reasoning in Answer Set Programming

Mario Alviano, Laura Giordano, Daniele Theseider Dupré

Main category: cs.AI

TL;DR: Introduces Conditional ASP framework combining conditional logic with ASP for conditional reasoning over answer sets using multi-preferential semantics.

Details

Motivation: To extend Answer Set Programming with conditional reasoning capabilities, allowing for more expressive knowledge representation that can handle typicality and exceptions.

Method: Builds on conditional logic with typicality, combines conditional knowledge base with ASP program, uses multi-preferential semantics (including KLM preferential semantics as special case), encodes conditional entailment in ASP.

Result: Develops a formal Conditional ASP framework that enables conditional reasoning over answer sets, provides complexity upper-bound for the approach.

Conclusion: Proposes a novel Conditional ASP framework that successfully integrates conditional logic with ASP, providing a principled approach for conditional reasoning in answer set programming with established semantics and complexity analysis.

Abstract: In this paper we introduce a Conditional Answer Set Programming framework (Conditional ASP) for the definition of conditional extensions of Answer Set Programming (ASP). The approach builds on a conditional logic with typicality, and on the combination of a conditional knowledge base with an ASP program, and allows for conditional reasoning over the answer sets of the program. The formalism relies on a multi-preferential semantics, and on the KLM preferential semantics, as a special case. Conditional entailment is encoded in ASP and a complexity upper-bound is provided.

[330] The ASP-based Nurse Scheduling System at the University of Yamanashi Hospital

Hidetomo Nabeshima, Mutsunori Banbara, Torsten Schaub, Takehide Soh

Main category: cs.AI

TL;DR: ASP-based nurse scheduling system successfully deployed at University of Yamanashi Hospital, addressing real-world challenges beyond academic benchmarks.

Details

Motivation: Nurse scheduling is a complex optimization problem requiring reconciliation of individual preferences with hospital staffing needs, balancing hard/soft constraints and allowing interactive adjustments. Real-world deployment presents unique challenges beyond typical academic benchmark problems.

Method: Answer Set Programming (ASP) was used to build the nurse scheduling system, with focus on practical application and necessary technological advancements to handle real-world complexities.

Result: Successful deployment of the ASP-based nurse scheduling system at the University of Yamanashi Hospital, demonstrating practical application of ASP technology in real healthcare settings.

Conclusion: The paper presents insights gained from real-world deployment and highlights the advancements in ASP technology needed to effectively manage the complexities of nurse scheduling in practical hospital environments.

Abstract: We present the design principles of a nurse scheduling system built using Answer Set Programming (ASP) and successfully deployed at the University of Yamanashi Hospital. Nurse scheduling is a complex optimization problem requiring the reconciliation of individual nurse preferences with hospital staffing needs across various wards. This involves balancing hard and soft constraints and the flexibility of interactive adjustments. While extensively studied in academia, real-world nurse scheduling presents unique challenges that go beyond typical benchmark problems and competitions. This paper details the practical application of ASP to address these challenges at the University of Yamanashi Hospital, focusing on the insights gained and the advancements in ASP technology necessary to effectively manage the complexities of real-world deployment.

[331] Interpretable Hybrid Machine Learning Models Using FOLD-R++ and Answer Set Programming

Sanne Wielinga, Jesse Heyninck

Main category: cs.AI

TL;DR: Hybrid approach combining Answer Set Programming rules with black-box ML classifiers to correct uncertain predictions and provide explanations, achieving better accuracy and interpretability in medical domains.

Details

Motivation: High-performing ML methods like neural networks are opaque, limiting trust in high-stakes domains like healthcare, while interpretable symbolic methods like ASP lack predictive power compared to ML models.

Method: Integrates ASP-derived rules from FOLD-R++ algorithm with black-box ML classifiers to selectively correct uncertain predictions and provide human-readable explanations.

Result: Experiments on five medical datasets show statistically significant performance gains in accuracy and F1 score.

Conclusion: Combining symbolic reasoning with conventional ML can achieve high interpretability without sacrificing accuracy, showing promise for trustworthy AI in critical domains.

Abstract: Machine learning (ML) techniques play a pivotal role in high-stakes domains such as healthcare, where accurate predictions can greatly enhance decision-making. However, most high-performing methods such as neural networks and ensemble methods are often opaque, limiting trust and broader adoption. In parallel, symbolic methods like Answer Set Programming (ASP) offer the possibility of interpretable logical rules but do not always match the predictive power of ML models. This paper proposes a hybrid approach that integrates ASP-derived rules from the FOLD-R++ algorithm with black-box ML classifiers to selectively correct uncertain predictions and provide human-readable explanations. Experiments on five medical reveal statistically significant performance gains in accuracy and F1 score. This study underscores the potential of combining symbolic reasoning with conventional ML to achieve high interpretability without sacrificing accuracy

[332] An ASP-Based Framework for MUSes

Mohimenul Kabir, Kuldeep S Meel

Main category: cs.AI

TL;DR: MUS-ASP: An answer set programming framework for online enumeration of minimal unsatisfiable subsets (MUSes) that accelerates both MUS enumeration and counting tasks.

Details

Motivation: Understanding the core reason for unsatisfiability in formulas is crucial for many applications. Minimal unsatisfiable subsets (MUSes) capture this core reason, but current approaches either enumerate MUSes within time limits or count total MUSes, lacking efficient online enumeration capabilities.

Method: Developed MUS-ASP, an answer set programming-based framework that translates MUS enumeration into answer set solving. Leverages ASP’s strengths in knowledge representation and computational efficiency of state-of-the-art ASP systems for online enumeration.

Result: Extensive experimental evaluation demonstrates MUS-ASP’s effectiveness and highlights significant acceleration in both MUS enumeration and counting tasks, especially when integrated within hybrid solvers.

Conclusion: MUS-ASP provides an efficient ASP-based framework for online MUS enumeration that outperforms existing approaches and enables faster analysis of unsatisfiable formulas through integration with hybrid solvers.

Abstract: Given an unsatisfiable formula, understanding the core reason for unsatisfiability is crucial in several applications. One effective way to capture this is through the minimal unsatisfiable subset (MUS), the subset-minimal set of clauses that remains unsatisfiable. Current research broadly focuses on two directions: (i) enumerating as many MUSes as possible within a given time limit, and (ii) counting the total number of MUSes for a given unsatisfiable formula. In this paper, we introduce an answer set programming-based framework, named MUS-ASP, designed for online enumeration of MUSes. ASP is a powerful tool for its strengths in knowledge representation and is particularly suitable for specifying complex combinatorial problems. By translating MUS enumeration into answer set solving, MUS-ASP leverages the computational efficiency of state-of-the-art ASP systems. Our extensive experimental evaluation demonstrates the effectiveness of MUS-ASP and highlights the acceleration in both MUS enumeration and counting tasks, particularly when integrated within hybrid solvers, including the framework proposed in this paper.

[333] League of LLMs: A Benchmark-Free Paradigm for Mutual Evaluation of Large Language Models

Qianhong Guo, Wei Xie, Xiaofang Cai, Enze Wang, Shuoyoucheng Ma, Xiaobing Sun, Tian Xia, Kai Chen, Xiaofeng Wang, Baosheng Wang

Main category: cs.AI

TL;DR: LOL (League of LLMs) is a benchmark-free evaluation paradigm that organizes multiple LLMs into a self-governed league for multi-round mutual evaluation to address data contamination, opaque operation, and subjective preference issues in LLM evaluation.

Details

Motivation: Current LLM evaluation faces critical challenges: data contamination (models trained on test data), opaque operation (black-box evaluation processes), and subjective preferences (human biases in evaluation). These issues undermine reliable assessment of LLM capabilities.

Method: LOL organizes multiple LLMs into a self-governed league for multi-round mutual evaluation. It integrates four core criteria: dynamic (adaptive evaluation), transparent (open process), objective (minimized bias), and professional (domain-specific assessment). The system enables LLMs to evaluate each other’s responses in mathematics and programming domains.

Result: Experiments on eight mainstream LLMs show LOL effectively distinguishes capabilities with high internal ranking stability (Top-k consistency = 70.7%). It reveals novel findings: “memorization-based answering” behaviors in some models and statistically significant homophily bias within OpenAI family (Δ = 9, p < 0.05).

Conclusion: LOL provides a valuable complement to current LLM evaluation ecosystem by offering a benchmark-free, transparent, and objective assessment framework that can capture nuanced behaviors and biases not detectable by traditional evaluation paradigms.

Abstract: Although large language models (LLMs) have shown exceptional capabilities across a wide range of tasks, reliable evaluation remains a critical challenge due to data contamination, opaque operation, and subjective preferences. To address these issues, we propose League of LLMs (LOL), a novel benchmark-free evaluation paradigm that organizes multiple LLMs into a self-governed league for multi-round mutual evaluation. LOL integrates four core criteria (dynamic, transparent, objective, and professional) to mitigate key limitations of existing paradigms. Experiments on eight mainstream LLMs in mathematics and programming demonstrate that LOL can effectively distinguish LLM capabilities while maintaining high internal ranking stability (Top-$k$ consistency $= 70.7%$). Beyond ranking, LOL reveals empirical findings that are difficult for traditional paradigms to capture. For instance, ``memorization-based answering’’ behaviors are observed in some models, and a statistically significant homophily bias is found within the OpenAI family ($Δ= 9$, $p < 0.05$). Finally, we make our framework and code publicly available as a valuable complement to the current LLM evaluation ecosystem.

[334] Attractive Metadata Attack: Inducing LLM Agents to Invoke Malicious Tools

Kanghua Mo, Li Hu, Yucheng Long, Zhihao Li

Main category: cs.AI

TL;DR: A new attack surface in LLM agents: manipulating tool metadata (names, descriptions, parameters) to influence agent behavior without prompt injection or model access.

Details

Motivation: LLM agents rely on external tools, but the tool-centric paradigm introduces an underexplored vulnerability where adversaries can manipulate tool metadata to influence agent decisions, creating a stealthy attack surface.

Method: Proposes Attractive Metadata Attack (AMA), a black-box in-context learning framework that generates highly attractive but syntactically/semantically valid tool metadata through iterative optimization, seamlessly integrating into standard tool ecosystems.

Result: High attack success rates (81%-95%) across ten realistic tool-use scenarios and various LLM agents, with significant privacy leakage and negligible impact on primary tasks. Attack remains effective against prompt-level defenses, auditor detection, and structured protocols like Model Context Protocol.

Conclusion: Metadata manipulation is a potent, stealthy attack surface orthogonal to injection attacks. Current defenses are insufficient, requiring execution-level protections beyond prompt-level and auditor-based mechanisms.

Abstract: Large language model (LLM) agents have demonstrated remarkable capabilities in complex reasoning and decision-making by leveraging external tools. However, this tool-centric paradigm introduces a previously underexplored attack surface, where adversaries can manipulate tool metadata – such as names, descriptions, and parameter schemas – to influence agent behavior. We identify this as a new and stealthy threat surface that allows malicious tools to be preferentially selected by LLM agents, without requiring prompt injection or access to model internals. To demonstrate and exploit this vulnerability, we propose the Attractive Metadata Attack (AMA), a black-box in-context learning framework that generates highly attractive but syntactically and semantically valid tool metadata through iterative optimization. The proposed attack integrates seamlessly into standard tool ecosystems and requires no modification to the agent’s execution framework. Extensive experiments across ten realistic, simulated tool-use scenarios and a range of popular LLM agents demonstrate consistently high attack success rates (81%-95%) and significant privacy leakage, with negligible impact on primary task execution. Moreover, the attack remains effective even against prompt-level defenses, auditor-based detection, and structured tool-selection protocols such as the Model Context Protocol, revealing systemic vulnerabilities in current agent architectures. These findings reveal that metadata manipulation constitutes a potent and stealthy attack surface. Notably, AMA is orthogonal to injection attacks and can be combined with them to achieve stronger attack efficacy, highlighting the need for execution-level defenses beyond prompt-level and auditor-based mechanisms. Code is available at https://github.com/SEAIC-M/AMA.

[335] Answering the Unanswerable Is to Err Knowingly: Analyzing and Mitigating Abstention Failures in Large Reasoning Models

Yi Liu, Xiangyu Liu, Zequn Sun, Wei Hu

Main category: cs.AI

TL;DR: LRMs fail to abstain from unanswerable questions despite having cognitive ability to recognize flaws; proposed two-stage method improves abstention rates without harming reasoning performance.

Details

Motivation: Large reasoning models (LRMs) fail to provide appropriate abstentions when confronted with inherently unanswerable questions, creating trustworthiness issues in AI systems.

Method: Two-stage method combining cognitive monitoring with inference-time intervention; lightweight approach that leverages models’ internal cognition capabilities.

Result: Experimental results show significant improvement in abstention rate while maintaining overall reasoning performance.

Conclusion: The proposed method successfully resolves the misalignment between LRMs’ internal cognition and external response, enhancing trustworthiness for unanswerable questions.

Abstract: Large reasoning models (LRMs) have shown remarkable progress on complex reasoning tasks. However, some questions posed to LRMs are inherently unanswerable, such as math problems lacking sufficient conditions. We find that LRMs continually fail to provide appropriate abstentions when confronted with these unanswerable questions. In this paper, we systematically analyze, investigate, and resolve this issue for trustworthy AI. We first conduct a detailed analysis of the distinct response behaviors of LRMs when facing unanswerable questions. Then, we show that LRMs possess sufficient cognitive capabilities to recognize the flaws in these questions. However, they fail to exhibit appropriate abstention behavior, revealing a misalignment between their internal cognition and external response. Finally, to resolve this issue, we propose a lightweight, two-stage method that combines cognitive monitoring with inference-time intervention. Experimental results demonstrate that our method significantly improves the abstention rate while maintaining the overall reasoning performance.

[336] D-Artemis: A Deliberative Cognitive Framework for Mobile GUI Multi-Agents

Hongze Mi, Yibo Feng, Wenjie Lu, Yuqi Wang, Jinyuan Li, Song Cao, He Cui, Tengfei Tian, Xuelin Zhang, Haotian Luo, Di Sun, Jun Fang, Hua Chai, Naiqiang Tan, Gang Pan

Main category: cs.AI

TL;DR: D-Artemis is a novel deliberative framework for GUI agents that uses a cognitive loop of Thinking, Alignment, and Reflection to automate user interactions without needing complex trajectory training data.

Details

Motivation: Current GUI agents face three critical challenges: data bottleneck in end-to-end training, high cost of delayed error detection, and risk of contradictory guidance. The authors aim to overcome these limitations by developing a more robust framework.

Method: D-Artemis employs a three-stage cognitive loop: 1) Thinking with app-specific tip retrieval, 2) Pre-execution Alignment with Thought-Action Consistency Check and Action Correction Agent to prevent failures, and 3) post-execution Status Reflection Agent for strategic learning. It enhances general-purpose MLLMs without complex trajectory training.

Result: Achieves new SOTA results: 75.8% success rate on AndroidWorld and 96.8% on ScreenSpot-V2 benchmarks. Ablation studies confirm each component’s significant contribution to the framework’s performance.

Conclusion: D-Artemis demonstrates strong generalization capabilities for GUI automation tasks by implementing a human-inspired cognitive loop, effectively addressing key challenges in current approaches while avoiding the need for complex training datasets.

Abstract: Graphical User Interface (GUI) agents aim to automate a wide spectrum of human tasks by emulating user interaction. Despite rapid advancements, current approaches are hindered by several critical challenges: data bottleneck in end-to-end training, high cost of delayed error detection, and risk of contradictory guidance. Inspired by the human cognitive loop of Thinking, Alignment, and Reflection, we present D-Artemis – a novel deliberative framework in this paper. D-Artemis leverages a fine-grained, app-specific tip retrieval mechanism to inform its decision-making process. It also employs a proactive Pre-execution Alignment stage, where Thought-Action Consistency (TAC) Check module and Action Correction Agent (ACA) work in concert to mitigate the risk of execution failures. A post-execution Status Reflection Agent (SRA) completes the cognitive loop, enabling strategic learning from experience. Crucially, D-Artemis enhances the capabilities of general-purpose Multimodal large language models (MLLMs) for GUI tasks without the need for training on complex trajectory datasets, demonstrating strong generalization. D-Artemis establishes new state-of-the-art (SOTA) results across both major benchmarks, achieving a 75.8% success rate on AndroidWorld and 96.8% on ScreenSpot-V2. Extensive ablation studies further demonstrate the significant contribution of each component to the framework.

[337] Multiplayer Nash Preference Optimization

Fang Wu, Xu Huang, Weihao Xuan, Zhiwei Zhang, Yijia Xiao, Guancheng Wan, Xiaomin Li, Bing Hu, Peng Xia, Jure Leskovec, Yejin Choi

Main category: cs.AI

TL;DR: MNPO generalizes Nash learning from human feedback to multiplayer games, addressing limitations of two-player methods by modeling alignment as an n-player competition for better handling of non-transitive and heterogeneous preferences.

Details

Motivation: Existing RLHF methods based on Bradley-Terry assumptions struggle with non-transitive and heterogeneous real-world preferences. While NLHF reframes alignment as a two-player Nash game, it suffers from single-opponent bias that fails to capture complex preference structures.

Method: Multiplayer Nash Preference Optimization (MNPO) formulates alignment as an n-player game where each policy competes against a population of opponents while being regularized toward a reference model, extending beyond two-player interactions.

Result: MNPO inherits equilibrium guarantees from two-player methods while enabling richer competitive dynamics and improved coverage of diverse preference structures. It consistently outperforms existing NLHF baselines on instruction-following benchmarks, achieving superior alignment under heterogeneous annotator conditions.

Conclusion: MNPO establishes a principled and scalable framework for aligning LLMs with complex, non-transitive human preferences, addressing limitations of both reward-based RLHF and two-player NLHF approaches.

Abstract: Reinforcement learning from human feedback (RLHF) has emerged as the standard paradigm for aligning large language models with human preferences. However, reward-based methods built on the Bradley-Terry assumption struggle to capture the non-transitive and heterogeneous nature of real-world preferences. To address this, recent studies have reframed alignment as a two-player Nash game, giving rise to Nash learning from human feedback (NLHF). While this perspective has inspired algorithms such as INPO, ONPO, and EGPO with strong theoretical and empirical guarantees, they remain fundamentally restricted to two-player interactions, creating a single-opponent bias that fails to capture the full complexity of realistic preference structures. This work introduces Multiplayer Nash Preference Optimization (MNPO), a novel framework that generalizes NLHF to the multiplayer regime. It formulates alignment as an n-player game, where each policy competes against a population of opponents while being regularized toward a reference model. We demonstrate that MNPO inherits the equilibrium guarantees of two-player methods while enabling richer competitive dynamics and improved coverage of diverse preference structures. Comprehensive empirical evaluation shows that MNPO consistently outperforms existing NLHF baselines on instruction-following benchmarks, achieving superior alignment quality under heterogeneous annotator conditions and mixed-policy evaluation scenarios. Together, these results establish MNPO as a principled and scalable framework for aligning LLMs with complex, non-transitive human preferences. Code is available at https://github.com/smiles724/MNPO.

[338] Agentic Exploration of Physics Models

Maximilian Nägele, Florian Marquardt

Main category: cs.AI

TL;DR: SciExplorer is an AI agent that uses large language models to autonomously explore and discover scientific laws in unknown physical systems without domain-specific instructions.

Details

Motivation: Current machine learning approaches require task-specific tailoring and cannot fully automate the iterative scientific discovery process of exploring unknown systems through experiments and analysis.

Method: SciExplorer leverages large language model tool-use capabilities with minimal tools (primarily code execution) to explore physical systems without domain-specific blueprints, testing on mechanical dynamical systems, wave evolution, and quantum many-body physics.

Result: Impressive performance on recovering equations of motion from observed dynamics and inferring Hamiltonians from expectation values, demonstrating effectiveness without finetuning or task-specific instructions.

Conclusion: The approach opens doors to similar scientific exploration in other domains, enabling autonomous discovery of scientific laws without requiring domain-specific knowledge or customization.

Abstract: The process of scientific discovery relies on an interplay of observations, analysis, and hypothesis generation. Machine learning is increasingly being adopted to address individual aspects of this process. However, it remains an open challenge to fully automate the heuristic, iterative loop required to discover the laws of an unknown system by exploring it through experiments and analysis, without tailoring the approach to the specifics of a given task. Here, we introduce SciExplorer, an agent that leverages large language model tool-use capabilities to enable exploration of systems without any domain-specific blueprints, and apply it to physical systems that are initially unknown to the agent. We test SciExplorer on a broad set of models spanning mechanical dynamical systems, wave evolution, and quantum many-body physics. Despite using a minimal set of tools, primarily based on code execution, we observe impressive performance on tasks such as recovering equations of motion from observed dynamics and inferring Hamiltonians from expectation values. The demonstrated effectiveness of this setup opens the door towards similar scientific exploration in other domains, without the need for finetuning or task-specific instructions.

[339] DeepSearch: Overcome the Bottleneck of Reinforcement Learning with Verifiable Rewards via Monte Carlo Tree Search

Fang Wu, Weihao Xuan, Heli Qi, Ximing Lu, Aaron Tu, Li Erran Li, Yejin Choi

Main category: cs.AI

TL;DR: DeepSearch integrates Monte Carlo Tree Search into RLVR training to overcome exploration bottlenecks, achieving SOTA results with 5.7x fewer GPU hours.

Details

Motivation: Current RLVR methods suffer from training plateaus due to sparse exploration patterns that miss critical reasoning paths and fail to systematically cover the solution space, leading to diminishing returns despite increased computation.

Method: DeepSearch embeds Monte Carlo Tree Search directly into RLVR training with: (1) global frontier selection prioritizing promising nodes, (2) entropy-based guidance for confident path selection, and (3) adaptive replay buffer training with solution caching.

Result: Achieves 62.95% average accuracy on mathematical reasoning benchmarks, establishing new SOTA for 1.5B reasoning models while using 5.7x fewer GPU hours than extended training approaches.

Conclusion: Strategic exploration through systematic search is more effective than brute-force scaling for advancing RLVR methodologies, establishing a new direction for scaling reasoning capabilities.

Abstract: Although RLVR has become an essential component for developing advanced reasoning skills in language models, contemporary studies have documented training plateaus after thousands of optimization steps, i.e., notable decreases in performance gains despite increased computational investment. This limitation stems from the sparse exploration patterns inherent in current RLVR practices, where models rely on limited rollouts that often miss critical reasoning paths and fail to provide systematic coverage of the solution space. We present DeepSearch, a framework that integrates Monte Carlo Tree Search (MCTS) directly into RLVR training. In contrast to existing methods that rely on tree search only at inference, DeepSearch embeds structured search into the training loop, enabling systematic exploration and fine-grained credit assignment across reasoning steps. Through training-time exploration, DeepSearch addresses the fundamental bottleneck of insufficient exploration, which leads to diminishing performance improvements over prolonged training steps. Our contributions include: (1) a global frontier selection strategy that prioritizes promising nodes across the search tree, (2) selection with entropy-based guidance that identifies confident paths for supervision, and (3) adaptive replay buffer training with solution caching for efficiency. Experiments on mathematical reasoning benchmarks show that DeepSearch achieves 62.95% average accuracy and establishes a new state-of-the-art for 1.5B reasoning models, while using 5.7x fewer GPU hours than extended training approaches. These results highlight the importance of strategic exploration over brute-force scaling and demonstrate the promise of algorithmic innovation for advancing RLVR methodologies. DeepSearch establishes a new direction for scaling reasoning capabilities through systematic search rather than prolonged computation.

[340] ChartAgent: A Multimodal Agent for Visually Grounded Reasoning in Complex Chart Question Answering

Rachneet Kaur, Nishan Srishankar, Zhen Zeng, Sumitra Ganesh, Manuela Veloso

Main category: cs.AI

TL;DR: ChartAgent: A visual reasoning agent framework for chart understanding that outperforms existing methods by up to 16.07% on chart QA benchmarks, especially excelling on unannotated charts requiring spatial reasoning.

Details

Motivation: Current multimodal LLMs struggle with unannotated charts that require precise visual interpretation rather than relying on textual shortcuts. There's a need for systems that can perform visual reasoning directly in the chart's spatial domain like humans do.

Method: ChartAgent is an agentic framework that iteratively decomposes queries into visual subtasks and actively manipulates chart images using specialized vision tools. It performs actions like drawing annotations, cropping regions (segmenting pie slices, isolating bars), and localizing axes through a library of chart-specific tools.

Result: Achieves state-of-the-art accuracy on ChartBench and ChartX benchmarks, with up to 16.07% absolute gain overall and 17.31% on unannotated, numerically intensive queries. Effective across diverse chart types and complexity levels, and works as a plug-and-play framework with various underlying LLMs.

Conclusion: ChartAgent demonstrates the effectiveness of visually grounded reasoning for chart understanding using tool-augmented multimodal agents, closely mirroring human cognitive strategies and significantly outperforming prior methods on challenging chart comprehension tasks.

Abstract: Recent multimodal LLMs have shown promise in chart-based visual question answering, but their performance declines sharply on unannotated charts-those requiring precise visual interpretation rather than relying on textual shortcuts. To address this, we introduce ChartAgent, a novel agentic framework that explicitly performs visual reasoning directly within the chart’s spatial domain. Unlike textual chain-of-thought reasoning, ChartAgent iteratively decomposes queries into visual subtasks and actively manipulates and interacts with chart images through specialized actions such as drawing annotations, cropping regions (e.g., segmenting pie slices, isolating bars), and localizing axes, using a library of chart-specific vision tools to fulfill each subtask. This iterative reasoning process closely mirrors human cognitive strategies for chart comprehension. ChartAgent achieves state-of-the-art accuracy on the ChartBench and ChartX benchmarks, surpassing prior methods by up to 16.07% absolute gain overall and 17.31% on unannotated, numerically intensive queries. Furthermore, our analyses show that ChartAgent is (a) effective across diverse chart types, (b) achieves the highest scores across varying visual and reasoning complexity levels, and (c) serves as a plug-and-play framework that boosts performance across diverse underlying LLMs. Our work is among the first to demonstrate visually grounded reasoning for chart understanding using tool-augmented multimodal agents.

[341] ELAIPBench: A Benchmark for Expert-Level Artificial Intelligence Paper Understanding

Xinbang Dai, Huikang Hu, Yongrui Chen, Jiaqi Li, Rihui Jin, Yuyang Zhang, Xiaoguang Li, Lifeng Shang, Guilin Qi

Main category: cs.AI

TL;DR: ELAIPBench is a new benchmark for evaluating LLMs’ comprehension of AI research papers, featuring 403 expert-curated multiple-choice questions across three difficulty levels. Current LLMs achieve only 39.95% accuracy, far below human performance, with thinking modes and RAG systems failing to improve results.

Details

Motivation: Existing benchmarks for evaluating LLMs' comprehension of academic papers are inadequate - they either use surface-level questions or unreliable evaluation metrics. There's a need for a benchmark that captures deep comprehension and reasoning about full-length academic papers, particularly in the AI domain.

Method: Developed ELAIPBench through an incentive-driven, adversarial annotation process involving domain experts. Created 403 multiple-choice questions from 137 AI research papers, spanning three difficulty levels. Questions emphasize non-trivial reasoning rather than shallow retrieval. Evaluated various LLMs including frontier models with thinking modes and RAG systems.

Result: Best-performing LLM achieved only 39.95% accuracy, far below human performance. Frontier LLMs with thinking modes or RAG systems failed to improve results - in some cases even harming accuracy due to overthinking or noisy retrieval. The benchmark reveals significant limitations in current LLMs’ ability to comprehend academic papers.

Conclusion: There is a substantial gap between current LLM capabilities and genuine comprehension of academic papers. Existing enhancement techniques like thinking modes and RAG systems are insufficient for this task. ELAIPBench provides a valuable benchmark for future research into improving LLMs’ deep reasoning about complex academic content.

Abstract: While large language models (LLMs) excel at many domain-specific tasks, their ability to deeply comprehend and reason about full-length academic papers remains underexplored. Existing benchmarks often fall short of capturing such depth, either due to surface-level question design or unreliable evaluation metrics. To address this gap, we introduce ELAIPBench, a benchmark curated by domain experts to evaluate LLMs’ comprehension of artificial intelligence (AI) research papers. Developed through an incentive-driven, adversarial annotation process, ELAIPBench features 403 multiple-choice questions from 137 papers. It spans three difficulty levels and emphasizes non-trivial reasoning rather than shallow retrieval. Our experiments show that the best-performing LLM achieves an accuracy of only 39.95%, far below human performance. Moreover, we observe that frontier LLMs equipped with a thinking mode or a retrieval-augmented generation (RAG) system fail to improve final results-even harming accuracy due to overthinking or noisy retrieval. These findings underscore the significant gap between current LLM capabilities and genuine comprehension of academic papers.

[342] Better Call CLAUSE: A Discrepancy Benchmark for Auditing LLMs Legal Reasoning Capabilities

Manan Roy Choudhury, Adithya Chandramouli, Mannan Anand, Vivek Gupta

Main category: cs.AI

TL;DR: CLAUSE benchmark evaluates LLMs’ ability to detect subtle legal flaws in contracts, revealing models often miss nuanced errors and struggle with legal justifications.

Details

Motivation: The rapid integration of LLMs into high-stakes legal work has exposed a critical gap: no benchmark exists to systematically stress-test their reliability against the nuanced, adversarial, and often subtle flaws present in real-world contracts.

Method: Introduces CLAUSE benchmark with over 7500 real-world perturbed contracts from CUAD and ContractNLI datasets. Uses a novel persona-driven pipeline to generate 10 distinct anomaly categories, validated against official statutes using Retrieval-Augmented Generation (RAG) system for legal fidelity.

Result: Analysis shows key weakness: leading LLMs often miss subtle errors and struggle even more to justify them legally. Models demonstrate fragility in legal reasoning when faced with fine-grained discrepancies.

Conclusion: The work outlines a path to identify and correct reasoning failures in legal AI, providing a systematic benchmark for evaluating LLM reliability in high-stakes legal applications.

Abstract: The rapid integration of large language models (LLMs) into high-stakes legal work has exposed a critical gap: no benchmark exists to systematically stress-test their reliability against the nuanced, adversarial, and often subtle flaws present in real-world contracts. To address this, we introduce CLAUSE, a first-of-its-kind benchmark designed to evaluate the fragility of an LLM’s legal reasoning. We study the capabilities of LLMs to detect and reason about fine-grained discrepancies by producing over 7500 real-world perturbed contracts from foundational datasets like CUAD and ContractNLI. Our novel, persona-driven pipeline generates 10 distinct anomaly categories, which are then validated against official statutes using a Retrieval-Augmented Generation (RAG) system to ensure legal fidelity. We use CLAUSE to evaluate leading LLMs’ ability to detect embedded legal flaws and explain their significance. Our analysis shows a key weakness: these models often miss subtle errors and struggle even more to justify them legally. Our work outlines a path to identify and correct such reasoning failures in legal AI.

[343] Chain-of-Thought as a Lens: Evaluating Structured Reasoning Alignment between Human Preferences and Large Language Models

Boxuan Wang, Zhuoyun Li, Xinmiao Huang, Xiaowei Huang, Yi Dong

Main category: cs.AI

TL;DR: The paper introduces Alignment Score, a semantic-level metric to measure how well LLM reasoning chains align with human preferences, showing it correlates with task accuracy and peaks at 2-hop reasoning.

Details

Motivation: There's a need to quantitatively assess whether multi-step reasoning in large language models aligns with human preferences, as current methods lack semantic-level evaluation of reasoning chains.

Method: Developed Alignment Score metric that constructs semantic-entropy-based matrices over intermediate reasoning steps, comparing model-produced chain-of-thought traces with human-preferred references and measuring their divergence.

Result: Alignment Score tracks task accuracy across models and reasoning depths, peaks at 2-hop reasoning, and shows misalignment at greater depths is driven by thematic shift and redundant reasoning errors.

Conclusion: Alignment Score provides a meaningful diagnostic signal for structured reasoning, with strong correlation to accuracy performance, supporting its use for evaluating reasoning alignment.

Abstract: This paper primarily demonstrate a method to quantitatively assess the alignment between multi-step, structured reasoning in large language models and human preferences. We introduce the Alignment Score, a semantic-level metric that compares a model-produced chain of thought traces with a human-preferred reference by constructing semantic-entropy-based matrices over intermediate steps and measuring their divergence. Our analysis shows that Alignment Score tracks task accuracy across models and hop depths, and peaks at 2-hop reasoning. Empirical results further indicates that misalignment at greater reasoning depths is driven mainly by alignment errors such as thematic shift and redundant reasoning. Viewing chain sampling as drawing from a distribution over reasoning paths, we empirically demonstrate a strong and consistent correlation between Alignment Score and accuracy performance, supporting its use as a meaningful diagnostic signal for structured reasoning.

[344] Faithful-First Reasoning, Planning, and Acting for Multimodal LLMs

Junxian Li, Xinyue Xu, Sai Ma, Di Zhang, Sichao Li

Main category: cs.AI

TL;DR: Faithful-First RPA framework improves perceptual faithfulness in multimodal reasoning by using FaithEvi for supervision and FaithAct for planning, reducing hallucinations without sacrificing accuracy.

Details

Motivation: Multimodal Large Language Models (MLLMs) often generate unfaithful reasoning chains that drift from visual evidence or contradict final predictions, leading to hallucination problems in multimodal reasoning tasks.

Method: Proposes Faithful-First Reasoning, Planning, and Acting (RPA) framework with two components: FaithEvi provides step-wise and chain-level supervision by evaluating faithfulness of intermediate reasoning, and FaithAct uses these signals to plan and execute faithfulness-aware actions during inference.

Result: Improves perceptual faithfulness by up to 24% over prompt-based and tool-augmented reasoning frameworks across multiple multimodal reasoning benchmarks, without degrading task accuracy. Analysis shows it produces perceptually faithful reasoning trajectories and mitigates hallucination behavior.

Conclusion: Establishes a unified framework for both evaluating and enforcing faithfulness in multimodal reasoning, demonstrating that treating faithfulness as a guiding principle effectively addresses hallucination problems in MLLMs.

Abstract: Multimodal Large Language Models (MLLMs) frequently suffer from unfaithfulness, generating reasoning chains that drift from visual evidence or contradict final predictions. We propose Faithful-First Reasoning, Planning, and Acting (RPA) framework in which FaithEvi provides step-wise and chain-level supervision by evaluating the faithfulness of intermediate reasoning, and FaithAct uses these signals to plan and execute faithfulness-aware actions during inference. Experiments across multiple multimodal reasoning benchmarks show that faithful-first RPA improves perceptual faithfulness by up to 24% over prompt-based and tool-augmented reasoning frameworks, without degrading task accuracy. Our analysis shows that treating faithfulness as a guiding principle perceptually faithful reasoning trajectories and mitigates hallucination behavior. This work thereby establishes a unified framework for both evaluating and enforcing faithfulness in multimodal reasoning. Code will be released upon acceptance.

[345] Boosting In-Silicon Directed Evolution with Fine-Tuned Protein Language Model and Tree Search

Yaodong Yang, Yang Wang, Jinpeng Li, Pei Guo, Da Han, Guangyong Chen, Pheng-Ann Heng

Main category: cs.AI

TL;DR: AlphaDE is a novel framework that combines fine-tuned protein language models with Monte Carlo tree search for efficient directed protein evolution, outperforming previous methods.

Details

Motivation: Existing directed evolution methods rely on heuristic strategies and fail to efficiently integrate protein language models with advanced optimization techniques like reinforcement learning for adaptive evolution policies.

Method: Two-step approach: 1) Fine-tune pretrained protein language models using masked language modeling on homologous sequences to activate evolutionary plausibility; 2) Use test-time inference with Monte Carlo tree search to evolve proteins with guidance from the fine-tuned model.

Result: AlphaDE remarkably outperforms previous state-of-the-art methods even with few-shot fine-tuning, and successfully condenses the protein sequence space of avGFP through computational evolution.

Conclusion: AlphaDE bridges the gap between protein language models and advanced optimization techniques, providing an effective framework for in-silico directed protein evolution with evolutionary guidance.

Abstract: Protein evolution through amino acid mutations is a cornerstone of life sciences. Recent advances in protein language models have shown rich evolutionary patterns, offering unprecedented potential for in-silicon directed evolution. However, existing directed evolution methods largely rely on heuristic evolution strategies and have yet to efficiently integrate the transformative protein language models with advanced optimization techniques, such as reinforcement learning, to adaptively learn superior evolution policies. To bridge this gap, we propose AlphaDE, a novel framework that evolves protein sequences by harnessing the innovative paradigms of large language models, such as fine-tuning and test-time inference. First, AlphaDE fine-tunes pretrained protein language models using masked language modeling on homologous protein sequences to activate the evolutionary plausibility of the interested protein family. Second, AlphaDE introduces test-time inference based on Monte Carlo tree search, which effectively evolves proteins with evolutionary guidance from the fine-tuned protein language model. Extensive benchmark experiments show that AlphaDE remarkably outperforms previous state-of-the-art methods even with few-shot fine-tuning. A case study further demonstrates that AlphaDE supports condensing the protein sequence space of avGFP through computational evolution.

[346] Multi-Agent LLM Orchestration Achieves Deterministic, High-Quality Decision Support for Incident Response

Philip Drammeh

Main category: cs.AI

TL;DR: Multi-agent LLM systems achieve 100% actionable incident response recommendations vs 1.7% for single-agent, with zero quality variance, making them production-ready.

Details

Motivation: Single-agent LLM approaches generate vague, unusable recommendations for incident response, lacking the quality and consistency needed for production systems.

Method: Developed MyAntFarm.ai, a reproducible containerized framework comparing single-agent vs multi-agent systems through 348 controlled trials on identical incident scenarios, introducing Decision Quality (DQ) metric.

Result: Multi-agent orchestration achieved 100% actionable recommendation rate vs 1.7% for single-agent, with 80x improvement in action specificity, 140x improvement in solution correctness, and zero quality variance across all trials.

Conclusion: Multi-agent orchestration transforms from performance optimization to production-readiness requirement for LLM-based incident response, enabling SLA commitments impossible with single-agent systems.

Abstract: Large language models (LLMs) promise to accelerate incident response in production systems, yet single-agent approaches generate vague, unusable recommendations. We present MyAntFarm.ai, a reproducible containerized framework demonstrating that multi-agent orchestration fundamentally transforms LLM-based incident response quality. Through 348 controlled trials comparing single-agent copilot versus multi-agent systems on identical incident scenarios, we find that multi-agent orchestration achieves 100% actionable recommendation rate versus 1.7% for single-agent approaches, an 80 times improvement in action specificity and 140 times improvement in solution correctness. Critically, multi-agent systems exhibit zero quality variance across all trials, enabling production SLA commitments impossible with inconsistent single-agent outputs. Both architectures achieve similar comprehension latency (approx.40s), establishing that the architectural value lies in deterministic quality, not speed. We introduce Decision Quality (DQ), a novel metric capturing validity, specificity, and correctness properties essential for operational deployment that existing LLM metrics do not address. These findings reframe multi-agent orchestration from a performance optimization to a production-readiness requirement for LLM-based incident response. All code, Docker configurations, and trial data are publicly available for reproduction.

[347] Navigating Taxonomic Expansions of Entity Sets Driven by Knowledge Bases

Giovanni Amendola, Pietro Cofone, Marco Manna, Aldo Ricioppo

Main category: cs.AI

TL;DR: The paper proposes efficient reasoning tasks for navigating expansion graphs without full materialization, enabling practical entity set expansion with taxonomic structures.

Details

Motivation: Traditional linear entity set expansion doesn't reveal rich taxonomic structures present in knowledge resources. While expansion graphs provide this structure, their potentially large size makes full materialization impractical in real-world scenarios.

Method: Formalize reasoning tasks that check whether two tuples belong to comparable, incomparable, or the same nodes in the expansion graph. Implement these tasks efficiently under realistic assumptions like bounding input or limiting entity descriptions.

Result: Under realistic assumptions, the reasoning tasks can be implemented efficiently, enabling local, incremental navigation of expansion graphs without requiring full graph construction.

Conclusion: The approach supports practical applications of taxonomic entity set expansion by allowing efficient local navigation of expansion graphs rather than requiring full materialization of potentially large structures.

Abstract: Recognizing similarities among entities is central to both human cognition and computational intelligence. Within this broader landscape, Entity Set Expansion is one prominent task aimed at taking an initial set of (tuples of) entities and identifying additional ones that share relevant semantic properties with the former – potentially repeating the process to form increasingly broader sets. However, this linear'' approach does not unveil the richer taxonomic’’ structures present in knowledge resources. A recent logic-based framework introduces the notion of an expansion graph: a rooted directed acyclic graph where each node represents a semantic generalization labeled by a logical formula, and edges encode strict semantic inclusion. This structure supports taxonomic expansions of entity sets driven by knowledge bases. Yet, the potentially large size of such graphs may make full materialization impractical in real-world scenarios. To overcome this, we formalize reasoning tasks that check whether two tuples belong to comparable, incomparable, or the same nodes in the graph. Our results show that, under realistic assumptions – such as bounding the input or limiting entity descriptions – these tasks can be implemented efficiently. This enables local, incremental navigation of expansion graphs, supporting practical applications without requiring full graph construction.

[348] NEMO-4-PAYPAL: Leveraging NVIDIA’s Nemo Framework for empowering PayPal’s Commerce Agent

Sudhanshu Garg, Andrew Wang, Chaitanya Kulkarni, Ali Sahami, Farhad Farahani, Sean Yun-Shiuan Chuang, Jian Wan, Srinivasan Manoharan, Uma Kona, Nitin Sharma, Linsey Pang, Prakhar Mehrotra, Jessica Clark, Mark Moyou

Main category: cs.AI

TL;DR: PayPal developed a commerce agent using NVIDIA’s NeMo Framework, fine-tuning a Nemotron SLM to optimize search/retrieval performance, achieving significant latency and cost improvements while maintaining quality.

Details

Motivation: To revolutionize agentic commerce on PayPal by optimizing multi-agent system performance, specifically addressing the retrieval component which accounts for over 50% of total agent response time and represents a key bottleneck.

Method: Used NVIDIA’s NeMo Framework for LLM fine-tuning, replacing base model with fine-tuned Nemotron small language model (SLM). Conducted systematic hyperparameter sweeps across learning rates, optimizers (Adam, AdamW), cosine annealing schedules, and LoRA ranks using llama3.1-nemotron-nano-8B-v1 architecture with LoRA-based training.

Result: Fine-tuned Nemotron SLM effectively resolved key performance issues in retrieval component while maintaining or enhancing overall system performance. Achieved significant improvements in latency and cost for commerce-specific tasks.

Conclusion: Successfully demonstrated the first application of NVIDIA’s NeMo Framework to commerce agent optimization, creating a scalable framework for multi-agent system optimization in production e-commerce environments with practical performance and cost benefits.

Abstract: We present the development and optimization of PayPal’s Commerce Agent, powered by NEMO-4-PAYPAL, a multi-agent system designed to revolutionize agentic commerce on the PayPal platform. Through our strategic partnership with NVIDIA, we leveraged the NeMo Framework for LLM model fine-tuning to enhance agent performance. Specifically, we optimized the Search and Discovery agent by replacing our base model with a fine-tuned Nemotron small language model (SLM). We conducted comprehensive experiments using the llama3.1-nemotron-nano-8B-v1 architecture, training LoRA-based models through systematic hyperparameter sweeps across learning rates, optimizers (Adam, AdamW), cosine annealing schedules, and LoRA ranks. Our contributions include: (1) the first application of NVIDIA’s NeMo Framework to commerce-specific agent optimization, (2) LLM powered fine-tuning strategy for retrieval-focused commerce tasks, (3) demonstration of significant improvements in latency and cost while maintaining agent quality, and (4) a scalable framework for multi-agent system optimization in production e-commerce environments. Our results demonstrate that the fine-tuned Nemotron SLM effectively resolves the key performance issue in the retrieval component, which represents over 50% of total agent response time, while maintaining or enhancing overall system performance.

[349] Monadic Context Engineering

Yifan Zhang, Yang Yuan, Mengdi Wang, Andrew Chi-Chih Yao

Main category: cs.AI

TL;DR: MCE introduces a monadic architecture for AI agents using Functors, Applicatives, and Monads to manage state, errors, and concurrency systematically.

Details

Motivation: Current AI agent architectures are brittle with ad hoc patterns, leading to difficulties in state management, error handling, and concurrency. There's a need for a formal foundation for agent design.

Method: Monadic Context Engineering (MCE) uses algebraic structures (Functors, Applicative Functors, Monads) to treat agent workflows as computational contexts. Monads handle sequential composition, Applicatives manage parallel execution, and Monad Transformers enable systematic composition of capabilities.

Result: Enables construction of complex, resilient, and efficient AI agents from simple, independently verifiable components. Extends to Meta-Agents for generative orchestration through metaprogramming.

Conclusion: MCE provides a formal architectural paradigm that addresses brittleness in current agent systems by leveraging algebraic structures for systematic management of cross-cutting concerns in AI agent workflows.

Abstract: The proliferation of Large Language Models (LLMs) has catalyzed a shift towards autonomous agents capable of complex reasoning and tool use. However, current agent architectures are frequently constructed using imperative, ad hoc patterns. This results in brittle systems plagued by difficulties in state management, error handling, and concurrency. This paper introduces Monadic Context Engineering (MCE), a novel architectural paradigm leveraging the algebraic structures of Functors, Applicative Functors, and Monads to provide a formal foundation for agent design. MCE treats agent workflows as computational contexts where cross-cutting concerns, such as state propagation, short-circuiting error handling, and asynchronous execution, are managed intrinsically by the algebraic properties of the abstraction. We demonstrate how Monads enable robust sequential composition, how Applicatives provide a principled structure for parallel execution, and crucially, how Monad Transformers allow for the systematic composition of these capabilities. This layered approach enables developers to construct complex, resilient, and efficient AI agents from simple, independently verifiable components. We further extend this framework to describe Meta-Agents, which leverage MCE for generative orchestration, dynamically creating and managing sub-agent workflows through metaprogramming.

[350] MindWatcher: Toward Smarter Multimodal Tool-Integrated Reasoning

Jiawei Chen, Xintian Shen, Lihao Zheng, Zhenwei Shao, Handong Cui, Chaoqun Du, Li Gong, Feng Gu, Xuefeng Hao, Wei He, Jiabang He, Yi Hu, Bin Huang, Shanshan Li, Qizhen Li, Jing Luo, Zide Liu, Xiaobo Liu, Ning Mao, Lifu Mu, Xuhao Pan, Zhiheng Qu, Chang Ren, Xudong Rao, Haoyi Sun, Qian Wang, Shuai Wang, Zhichao Wang, Wei Wang, Lian Wen, Jiqing Zhan, Hongfu Yang, Sheng Yang, Jiajun Yang, Pengfei Yu, Hongyuan Zhang, Bin Zhang, Chunpeng Zhou, Zheng Zhou, Shucheng Zhou, Shuo Xie, Yun Zhu, Hao Ma, Tao Wei, Pan Zhou, Wei Chen

Main category: cs.AI

TL;DR: MindWatcher is a tool-integrated reasoning agent that combines interleaved thinking with multimodal chain-of-thought reasoning to autonomously decide when and how to invoke tools, outperforming larger models through superior tool usage.

Details

Motivation: Traditional workflow-based agents have limited intelligence for real-world problems requiring tool invocation. There's a need for autonomous reasoning agents that can make complex decisions involving multi-step interactions with external environments without relying on human prompts or predefined workflows.

Method: MindWatcher integrates interleaved thinking (switching between thinking and tool calling at any stage) with multimodal chain-of-thought reasoning (manipulating images during reasoning). It uses automated data auditing/evaluation pipelines, manually curated datasets, and a comprehensive suite of auxiliary reasoning tools. Features include a large-scale local image retrieval database covering 8 categories and efficient training infrastructure.

Result: MindWatcher matches or exceeds performance of larger/recent models through superior tool invocation. The research also uncovered critical insights like genetic inheritance phenomenon in agentic RL. The system was evaluated using the MindWatcher-Evaluate Bench (MWE-Bench).

Conclusion: MindWatcher demonstrates that tool-integrated reasoning agents with interleaved thinking and multimodal CoT capabilities can effectively address broad-domain multimodal problems, achieving strong performance despite small model size through robust tool invocation and efficient training infrastructure.

Abstract: Traditional workflow-based agents exhibit limited intelligence when addressing real-world problems requiring tool invocation. Tool-integrated reasoning (TIR) agents capable of autonomous reasoning and tool invocation are rapidly emerging as a powerful approach for complex decision-making tasks involving multi-step interactions with external environments. In this work, we introduce MindWatcher, a TIR agent integrating interleaved thinking and multimodal chain-of-thought (CoT) reasoning. MindWatcher can autonomously decide whether and how to invoke diverse tools and coordinate their use, without relying on human prompts or workflows. The interleaved thinking paradigm enables the model to switch between thinking and tool calling at any intermediate stage, while its multimodal CoT capability allows manipulation of images during reasoning to yield more precise search results. We implement automated data auditing and evaluation pipelines, complemented by manually curated high-quality datasets for training, and we construct a benchmark, called MindWatcher-Evaluate Bench (MWE-Bench), to evaluate its performance. MindWatcher is equipped with a comprehensive suite of auxiliary reasoning tools, enabling it to address broad-domain multimodal problems. A large-scale, high-quality local image retrieval database, covering eight categories including cars, animals, and plants, endows model with robust object recognition despite its small size. Finally, we design a more efficient training infrastructure for MindWatcher, enhancing training speed and hardware utilization. Experiments not only demonstrate that MindWatcher matches or exceeds the performance of larger or more recent models through superior tool invocation, but also uncover critical insights for agent training, such as the genetic inheritance phenomenon in agentic RL.

[351] Jenius Agent: Towards Experience-Driven Accuracy Optimization in Real-World Scenarios

Defei Xia, Bingfeng Pi, Shenbin Zhang, Song Hua, Yunfei Wei, Lei Zuo

Main category: cs.AI

TL;DR: Jenius-Agent framework improves LLM-based autonomous agents with adaptive prompts, context-aware tool orchestration, and layered memory, achieving 20% better task accuracy with reduced costs.

Details

Motivation: As LLM-powered agents advance, improving task performance in context understanding, tool usage, and response generation is critical. While prior studies have advanced overall agent design, systematic optimization of internal reasoning and tool-use pipelines remains underexplored.

Method: Introduces Jenius-Agent framework with three key innovations: (1) adaptive prompt generation strategy aligned with agent state and task goals, (2) context-aware tool orchestration module with tool categorization, semantic retrieval, and adaptive invocation, (3) layered memory mechanism integrating session memory, task history, and external summaries with dynamic summarization and compression. Integrated with Model Context Protocol (MCP) tools, file I/O, and execution feedback.

Result: Experiments show 20% improvement in task accuracy, along with reduced token cost, response latency, and invocation failures. Framework is deployed in Jenius (https://www.jenius.cn).

Conclusion: Provides a lightweight and scalable solution for robust, protocol-compatible autonomous agents, demonstrating practical deployment value in real-world applications.

Abstract: As agent systems powered by large language models (LLMs) advance, improving the task performance of an autonomous agent, especially in context understanding, tool usage, and response generation, has become increasingly critical. Although prior studies have advanced the overall design of LLM-based agents, systematic optimization of their internal reasoning and tool-use pipelines remains underexplored. This paper introduces an agent framework grounded in real-world practical experience, with three key innovations: (1) an adaptive prompt generation strategy that aligns with the agent’s state and task goals to improve reliability and robustness; (2) a context-aware tool orchestration module that performs tool categorization, semantic retrieval, and adaptive invocation based on user intent and context; and (3) a layered memory mechanism that integrates session memory, task history, and external summaries to improve relevance and efficiency through dynamic summarization and compression. An end-to-end framework named Jenius-Agent has been integrated with three key optimizations, including tools based on the Model Context Protocol (MCP), file input/output (I/O), and execution feedback. The experiments show a 20 percent improvement in task accuracy, along with a reduced token cost, response latency, and invocation failures. The framework is already deployed in Jenius (https://www.jenius.cn), providing a lightweight and scalable solution for robust, protocol-compatible autonomous agents.

[352] FormuLLA: A Large Language Model Approach to Generating Novel 3D Printable Formulations

Adeshola Okubena, Yusuf Ali Mohammed, Moe Elbadawi

Main category: cs.AI

TL;DR: Researchers fine-tuned large language models on FDM 3D printing formulation data to recommend excipients and predict filament properties, finding that model selection and parameterization significantly impact performance, with smaller models showing catastrophic forgetting issues.

Details

Motivation: While AI has been integrated into pharmaceutical 3D printing, most efforts remain narrowly focused and fail to address broader formulation challenges. Recent advances in artificial general intelligence and large language models offer potential for more generalized, human-like reasoning in pharmaceutical formulation development.

Method: Researchers fine-tuned four LLM architectures on a fused deposition modeling dataset containing over 1400 formulations. The models were trained to recommend suitable excipients based on API dose and predict filament mechanical properties. Systematic evaluation of both fine-tuning and generative parameter configurations was conducted.

Result: Llama2 performed best for recommending excipients for FDM formulations. Model selection and parameterization significantly influenced performance, with smaller LLMs exhibiting catastrophic forgetting. Key findings: (i) even with 1400+ formulations, models can experience catastrophic forgetting; (ii) standard LLM metrics only evaluate linguistic performance, not formulation processability; (iii) LLMs trained on biomedical data don’t always produce the best results.

Conclusion: Addressing challenges like catastrophic forgetting and developing metrics that evaluate formulation processability (not just linguistic performance) is essential to advance LLMs beyond linguistic proficiency toward reliable systems for pharmaceutical formulation development.

Abstract: Pharmaceutical three-dimensional (3D) printing is an advanced fabrication technology with the potential to enable truly personalised dosage forms. Recent studies have integrated artificial intelligence (AI) to accelerate formulation and process development, drastically transforming current approaches to pharmaceutical 3D printing. To date, most AI-driven efforts remain narrowly focused, while failing to account for the broader formulation challenges inherent to the technology. Recent advances in AI have introduced artificial general intelligence concepts, wherein systems extend beyond conventional predictive modelling toward more generalised, human-like reasoning. In this work, we investigate the application of large language models (LLMs), fine-tuned on a fused deposition modelling (FDM) dataset comprising over 1400 formulations, to recommend suitable excipients based on active pharmaceutical ingredient (API) dose, and predict filament mechanical properties. Four LLM architectures were fine-tuned, with systematic evaluation of both fine-tuning and generative parameter configurations. Our results demonstrate that Llama2 was best suited for recommending excipients for FDM formulations. Additionally, model selection and parameterisation significantly influence performance, with smaller LLMs exhibiting instances of catastrophic forgetting. Furthermore, we demonstrate: (i) even with relatively small dataset of over 1400 formulations, it can lead to model catastrophic forgetting; (ii) standard LLM metrics only evaluate linguistic performance but not formulation processability; and (iii) LLMs trained on biomedically-related data do not always produce the best results. Addressing these challenges is essential to advancing LLMs beyond linguistic proficiency and toward reliable systems for pharmaceutical formulation development.

[353] HAL: Inducing Human-likeness in LLMs with Alignment

Masum Hasan, Junjie Zhao, Ehsan Hoque

Main category: cs.AI

TL;DR: HAL framework aligns LLMs to conversational human-likeness using interpretable, data-driven rewards from contrastive dialogue data, enabling targeted alignment without affecting overall model performance.

Details

Motivation: Conversational human-likeness is crucial for human-AI interaction but has been difficult to define, measure, and optimize. Current improvements rely on scale or broad supervised training rather than targeted alignment.

Method: HAL derives explicit conversational traits from contrastive dialogue data, combines them into a compact scalar score, and uses this as a transparent reward signal for alignment with standard preference optimization methods.

Result: Models aligned with HAL are more frequently perceived as human-like in conversation in large-scale human evaluations, without affecting overall performance. The framework enables inspection of alignment behavior and diagnosis of unintended effects.

Conclusion: HAL demonstrates how soft, qualitative properties of language can be made measurable and aligned in an interpretable and explainable way, expanding the scope of alignment beyond traditional approaches.

Abstract: Conversational human-likeness plays a central role in human-AI interaction, yet it has remained difficult to define, measure, and optimize. As a result, improvements in human-like behavior are largely driven by scale or broad supervised training, rather than targeted alignment. We introduce Human Aligning LLMs (HAL), a framework for aligning language models to conversational human-likeness using an interpretable, data-driven reward. HAL derives explicit conversational traits from contrastive dialogue data, combines them into a compact scalar score, and uses this score as a transparent reward signal for alignment with standard preference optimization methods. Using this approach, we align models of varying sizes without affecting their overall performance. In large-scale human evaluations, models aligned with HAL are more frequently perceived as human-like in conversation. Because HAL operates over explicit, interpretable traits, it enables inspection of alignment behavior and diagnosis of unintended effects. More broadly, HAL demonstrates how soft, qualitative properties of language–previously outside the scope for alignment–can be made measurable and aligned in an interpretable and explainable way.

cs.SD

[354] Investigation into respiratory sound classification for an imbalanced data set using hybrid LSTM-KAN architectures

Nithinkumar K., Anand R

Main category: cs.SD

TL;DR: Hybrid LSTM-KAN model with imbalance mitigation achieves 94.6% accuracy for respiratory sound classification on highly imbalanced dataset.

Details

Motivation: Automated respiratory sound classification faces challenges due to subtle acoustic differences and severe class imbalance in clinical datasets, which hinders minority class recognition.

Method: Proposes hybrid deep learning model combining LSTM for sequential feature encoding with Kolmogorov-Arnold Network (KAN) for classification, integrated with feature extraction pipeline and imbalance mitigation strategies including focal loss, class-specific data augmentation, and SMOTE.

Result: Achieves 94.6% overall accuracy and 0.703 macro-averaged F1 score on public respiratory sound database with six classes (COPD dominant at 86+%), with improved minority class detection compared to baselines.

Conclusion: The hybrid LSTM-KAN architecture with targeted imbalance mitigation effectively addresses class imbalance in respiratory sound classification, demonstrating superior performance for minority classes while maintaining high overall accuracy.

Abstract: Respiratory sounds captured via auscultation contain critical clues for diagnosing pulmonary conditions. Automated classification of these sounds faces challenges due to subtle acoustic differences and severe class imbalance in clinical datasets. This study investigates respiratory sound classification with a focus on mitigating pronounced class imbalance. We propose a hybrid deep learning model that combines a Long Short-Term Memory (LSTM) network for sequential feature encoding with a Kolmogorov-Arnold Network (KAN) for classification. The model is integrated with a comprehensive feature extraction pipeline and targeted imbalance mitigation strategies. Experiments were conducted on a public respiratory sound database comprising six classes with a highly skewed distribution. Techniques such as focal loss, class-specific data augmentation, and Synthetic Minority Over-sampling Technique (SMOTE) were employed to enhance minority class recognition. The proposed Hybrid LSTM-KAN model achieves an overall accuracy of 94.6 percent and a macro-averaged F1 score of 0.703, despite the dominant COPD class accounting for over 86 percent of the data. Improved detection performance is observed for minority classes compared to baseline approaches, demonstrating the effectiveness of the proposed architecture for imbalanced respiratory sound classification.

[355] Domain Adaptation of the Pyannote Diarization Pipeline for Conversational Indonesian Audio

Muhammad Daffa’i Rafi Prasetyo, Ramadhan Andika Putra, Zaidan Naufal Ilmi, Kurniawati Azizah

Main category: cs.SD

TL;DR: Domain adaptation for speaker diarization in Indonesian using synthetic TTS data improves performance from 53.47% to 29.24% DER.

Details

Motivation: Adapt English-centric speaker diarization pipelines to low-resource Indonesian language, addressing the challenge of limited training data for conversational audio.

Method: Employ synthetic data generation using neural Text-to-Speech technology, with experiments using small dataset (171 samples) and large dataset (25 hours of synthetic speech). Domain adaptation of baseline pyannote/segmentation-3.0 model trained on AMI Corpus.

Result: Baseline achieves 53.47% DER zero-shot on Indonesian. Small dataset reduces DER to 34.31% (1 epoch) and 34.81% (2 epochs). Large 25-hour synthetic dataset achieves best performance: 29.24% DER (13.68% absolute improvement), 99.06% Recall, 87.14% F1-Score.

Conclusion: Domain adaptation with synthetic TTS data effectively improves speaker diarization for low-resource languages, with larger synthetic datasets yielding better performance while maintaining high recall and F1-scores.

Abstract: This study presents a domain adaptation approach for speaker diarization targeting conversational Indonesian audio. We address the challenge of adapting an English-centric diarization pipeline to a low-resource language by employing synthetic data generation using neural Text-to-Speech technology. Experiments were conducted with varying training configurations, a small dataset (171 samples) and a large dataset containing 25 hours of synthetic speech. Results demonstrate that the baseline \texttt{pyannote/segmentation-3.0} model, trained on the AMI Corpus, achieves a Diarization Error Rate (DER) of 53.47% when applied zero-shot to Indonesian. Domain adaptation significantly improves performance, with the small dataset models reducing DER to 34.31% (1 epoch) and 34.81% (2 epochs). The model trained on the 25-hour dataset achieves the best performance with a DER of 29.24%, representing a 13.68% absolute improvement over the baseline while maintaining 99.06% Recall and 87.14% F1-Score.

[356] IndexTTS 2.5 Technical Report

Yunpei Li, Xun Zhou, Jinchao Wang, Lu Wang, Yong Wu, Siyi Zhou, Yiquan Zhou, Jingchen Shu

Main category: cs.SD

TL;DR: IndexTTS 2.5 enhances the zero-shot TTS foundation model with multilingual support, faster inference, and better quality through semantic compression, architectural upgrades, cross-lingual strategies, and RL optimization.

Details

Motivation: To improve upon IndexTTS 2 by expanding multilingual coverage, increasing inference speed, and enhancing overall synthesis quality while maintaining zero-shot emotional TTS capabilities.

Method: Four key improvements: 1) Semantic codec compression (50Hz→25Hz), 2) Architectural upgrade (U-DiT→Zipformer), 3) Multilingual extension with cross-lingual strategies, 4) Reinforcement learning optimization (GRPO) for T2S module.

Result: Achieves 2.28× RTF improvement while maintaining comparable WER and speaker similarity; supports Chinese, English, Japanese, Spanish; enables emotion transfer without target-language emotional training data.

Conclusion: IndexTTS 2.5 successfully enhances multilingual coverage, inference speed, and synthesis quality while preserving zero-shot emotional TTS capabilities across multiple languages.

Abstract: In prior work, we introduced IndexTTS 2, a zero-shot neural text-to-speech foundation model comprising two core components: a transformer-based Text-to-Semantic (T2S) module and a non-autoregressive Semantic-to-Mel (S2M) module, which together enable faithful emotion replication and establish the first autoregressive duration-controllable generative paradigm. Building upon this, we present IndexTTS 2.5, which significantly enhances multilingual coverage, inference speed, and overall synthesis quality through four key improvements: 1) Semantic Codec Compression: we reduce the semantic codec frame rate from 50 Hz to 25 Hz, halving sequence length and substantially lowering both training and inference costs; 2) Architectural Upgrade: we replace the U-DiT-based backbone of the S2M module with a more efficient Zipformer-based modeling architecture, achieving notable parameter reduction and faster mel-spectrogram generation; 3) Multilingual Extension: We propose three explicit cross-lingual modeling strategies, boundary-aware alignment, token-level concatenation, and instruction-guided generation, establishing practical design principles for zero-shot multilingual emotional TTS that supports Chinese, English, Japanese, and Spanish, and enables robust emotion transfer even without target-language emotional training data; 4) Reinforcement Learning Optimization: we apply GRPO in post-training of the T2S module, improving pronunciation accuracy and natrualness. Experiments show that IndexTTS 2.5 not only supports broader language coverage but also replicates emotional prosody in unseen languages under the same zero-shot setting. IndexTTS 2.5 achieves a 2.28 times improvement in RTF while maintaining comparable WER and speaker similarity to IndexTTS 2.

[357] Lightweight and perceptually-guided voice conversion for electro-laryngeal speech

Benedikt Mayrhofer, Franz Pernkopf, Philipp Aichinger, Martin Hagmüller

Main category: cs.SD

TL;DR: Lightweight StreamVC adaptation for electro-laryngeal speech rehabilitation improves naturalness and intelligibility using self-supervised pretraining with perceptual/intelligibility losses.

Details

Motivation: Electro-laryngeal speech suffers from constant pitch, limited prosody, and mechanical noise, which reduces naturalness and intelligibility, creating a need for voice rehabilitation solutions.

Method: Adapt StreamVC framework by removing pitch/energy modules, combining self-supervised pretraining with supervised fine-tuning on parallel EL/healthy speech data using perceptual and intelligibility losses.

Result: Best model (WavLM features + human-feedback predictions) drastically reduces CER, raises nMOS from 1.1 to 3.3, and narrows gap to healthy ground-truth speech across all metrics.

Conclusion: Lightweight voice conversion architectures can be adapted for EL voice rehabilitation, with prosody generation and intelligibility improvements identified as remaining bottlenecks.

Abstract: Electro-laryngeal (EL) speech is characterized by constant pitch, limited prosody, and mechanical noise, reducing naturalness and intelligibility. We propose a lightweight adaptation of the state-of-the-art StreamVC framework to this setting by removing pitch and energy modules and combining self-supervised pretraining with supervised fine-tuning on parallel EL and healthy (HE) speech data, guided by perceptual and intelligibility losses. Objective and subjective evaluations across different loss configurations confirm their influence: the best model variant, based on WavLM features and human-feedback predictions (+WavLM+HF), drastically reduces character error rate (CER) of EL inputs, raises naturalness mean opinion score (nMOS) from 1.1 to 3.3, and consistently narrows the gap to HE ground-truth speech in all evaluated metrics. These findings demonstrate the feasibility of adapting lightweight voice conversion architectures to EL voice rehabilitation while also identifying prosody generation and intelligibility improvements as the main remaining bottlenecks.

[358] Muse: Towards Reproducible Long-Form Song Generation with Fine-Grained Style Control

Changhao Jiang, Jiahao Chen, Zhenghao Xiang, Zhixiong Yang, Hanchen Wang, Jiabao Zhuang, Xinmeng Che, Jiajun Sun, Hui Li, Yifei Cao, Shihan Dou, Ming Zhang, Junjie Ye, Tao Ji, Tao Gui, Qi Zhang, Xuanjing Huang

Main category: cs.SD

TL;DR: Open-source system for long-form song generation with style conditioning, including licensed synthetic dataset, training pipelines, and Muse model that achieves competitive performance despite modest scale.

Details

Motivation: Academic research in long-form song generation lags behind commercial systems like Suno due to lack of publicly available training data and non-reproducible research, hindering fair comparison and progress.

Method: Release fully open-source system with: 1) 116k licensed synthetic songs dataset with auto-generated lyrics and style descriptions paired with SunoV5 audio, 2) Muse model trained via single-stage supervised finetuning of Qwen-based language model extended with discrete audio tokens using MuCodec, without task-specific losses or additional components.

Result: Muse achieves competitive performance on phoneme error rate, text-music style similarity, and audio aesthetic quality despite modest data scale and model size. Enables controllable segment-level generation across different musical structures.

Conclusion: The fully open-source release (data, model weights, pipelines) paves the way for continued progress in controllable long-form song generation research by enabling reproducible academic work and fair comparisons.

Abstract: Recent commercial systems such as Suno demonstrate strong capabilities in long-form song generation, while academic research remains largely non-reproducible due to the lack of publicly available training data, hindering fair comparison and progress. To this end, we release a fully open-source system for long-form song generation with fine-grained style conditioning, including a licensed synthetic dataset, training and evaluation pipelines, and Muse, an easy-to-deploy song generation model. The dataset consists of 116k fully licensed synthetic songs with automatically generated lyrics and style descriptions paired with audio synthesized by SunoV5. We train Muse via single-stage supervised finetuning of a Qwen-based language model extended with discrete audio tokens using MuCodec, without task-specific losses, auxiliary objectives, or additional architectural components. Our evaluations find that although Muse is trained with a modest data scale and model size, it achieves competitive performance on phoneme error rate, text–music style similarity, and audio aesthetic quality, while enabling controllable segment-level generation across different musical structures. All data, model weights, and training and evaluation pipelines will be publicly released, paving the way for continued progress in controllable long-form song generation research. The project repository is available at https://github.com/yuhui1038/Muse.

[359] BENYO-S2ST-Corpus-1: A Bilingual English-to-Yoruba Direct Speech-to-Speech Translation Corpus

Emmanuel Adetiba, Abdultaofeek Abayomi, Raymond J. Kala, Ayodele H. Ifijeh, Oluwatobi E. Dare, Olabode Idowu-Bismark, Gabriel O. Sobola, Joy N. Adetiba, Monsurat Adepeju Lateef

Main category: cs.SD

TL;DR: Created BENYO-S2ST-Corpus-1, a large English-to-Yoruba speech-to-speech translation dataset using hybrid architecture with audio augmentation, and built YoruTTS-1.5 model as proof of concept.

Details

Motivation: Address major shortage of S2ST datasets for high-to-low resource language pairs like English-to-Yoruba, and bridge digital divides in translation between high and low-resource African languages.

Method: Developed hybrid architecture for large-scale direct S2ST corpus creation: leveraged YORULECT Corpus’s Yoruba audios/transcripts, generated English audios using Facebook MMS models, and created AcoustAug audio augmentation algorithm based on three latent acoustic features.

Result: Created BENYO-S2ST-Corpus-1 with 12,032 audio samples per language (24,064 total samples, 41.20 hours). Built YoruTTS-1.5 model achieving F0 RMSE of 63.54 after 1,000 epochs, showing moderate fundamental pitch similarity with reference audio.

Conclusion: The corpus architecture enables curation of multilingual high-to-low-resource African language datasets, bridging digital divides. BENYO-S2ST-Corpus-1 and YoruTTS-1.5 are publicly available for research and development.

Abstract: There is a major shortage of Speech-to-Speech Translation (S2ST) datasets for high resource-to-low resource language pairs such as English-to-Yoruba. Thus, in this study, we curated the Bilingual English-to-Yoruba Speech-to-Speech Translation Corpus Version 1 (BENYO-S2ST-Corpus-1). The corpus is based on a hybrid architecture we developed for large-scale direct S2ST corpus creation at reduced cost. To achieve this, we leveraged non speech-to-speech Standard Yoruba (SY) real-time audios and transcripts in the YORULECT Corpus as well as the corresponding Standard English (SE) transcripts. YORULECT Corpus is small scale(1,504) samples, and it does not have paired English audios. Therefore, we generated the SE audios using pre-trained AI models (i.e. Facebook MMS). We also developed an audio augmentation algorithm named AcoustAug based on three latent acoustic features to generate augmented audios from the raw audios of the two languages. BENYO-S2ST-Corpus-1 has 12,032 audio samples per language, which gives a total of 24,064 sample size. The total audio duration for the two languages is 41.20 hours. This size is quite significant. Beyond building S2ST models, BENYO-S2ST-Corpus-1 can be used to build pretrained models or improve existing ones. The created corpus and Coqui framework were used to build a pretrained Yoruba TTS model (named YoruTTS-1.5) as a proof of concept. The YoruTTS-1.5 gave a F0 RMSE value of 63.54 after 1,000 epochs, which indicates moderate fundamental pitch similarity with the reference real-time audio. Ultimately, the corpus architecture in this study can be leveraged by researchers and developers to curate datasets for multilingual high-resource-to-low-resource African languages. This will bridge the huge digital divides in translations among high and low-resource language pairs. BENYO-S2ST-Corpus-1 and YoruTTS-1.5 are publicly available at (https://bit.ly/40bGMwi).

[360] DiFlow-TTS: Compact and Low-Latency Zero-Shot Text-to-Speech with Factorized Discrete Flow Matching

Ngoc-Son Nguyen, Thanh V. T. Tran, Hieu-Nghia Huynh-Nguyen, Truong-Son Hy, Van Nguyen

Main category: cs.SD

TL;DR: DiFlow-TTS is a zero-shot TTS system using discrete flow matching for generative speech modeling, achieving strong performance with compact model size (11.7× smaller) and fast inference (34× faster) compared to SOTA baselines.

Details

Motivation: To develop an efficient zero-shot text-to-speech system that can generate natural, expressive speech with accurate speaker identity while maintaining low inference latency, serving as an entry point for further research in discrete flow matching for speech synthesis.

Method: Uses factorized speech representations with a deterministic Phoneme-Content Mapper for linguistic content modeling, and a Factorized Discrete Flow Denoiser that jointly models multiple discrete token streams for prosody and acoustics to capture expressive speech attributes.

Result: Achieves strong performance across multiple metrics while maintaining compact model size (up to 11.7× smaller) and enabling low-latency inference (up to 34× faster than recent SOTA baselines).

Conclusion: DiFlow-TTS demonstrates the effectiveness of discrete flow matching for zero-shot TTS, offering a promising research direction with practical advantages in model efficiency and inference speed.

Abstract: This paper introduces DiFlow-TTS, a novel zero-shot text-to-speech (TTS) system that employs discrete flow matching for generative speech modeling. We position this work as an entry point that may facilitate further advances in this research direction. Through extensive empirical evaluation, we analyze both the strengths and limitations of this approach across key aspects, including naturalness, expressive attributes, speaker identity, and inference latency. To this end, we leverage factorized speech representations and design a deterministic Phoneme-Content Mapper for modeling linguistic content, together with a Factorized Discrete Flow Denoiser that jointly models multiple discrete token streams corresponding to prosody and acoustics to capture expressive speech attributes. Experimental results demonstrate that DiFlow-TTS achieves strong performance across multiple metrics while maintaining a compact model size, up to 11.7 times smaller, and enabling low-latency inference that is up to 34 times faster than recent state-of-the-art baselines. Audio samples are available on our demo page: https://diflow-tts.github.io.

[361] Zero-Day Audio DeepFake Detection via Retrieval Augmentation and Profile Matching

Xuechen Liu, Xin Wang, Junichi Yamagishi

Main category: cs.SD

TL;DR: Training-free retrieval-augmented framework for zero-day audio deepfake detection using knowledge representations and voice profile matching, achieving performance comparable to supervised baselines without additional training.

Details

Motivation: Current audio deepfake detectors struggle with zero-day attacks from novel synthesis methods not seen in training data. Fine-tuning approaches are problematic when prompt response is needed.

Method: Proposes a training-free retrieval-augmented framework leveraging knowledge representations and voice profile matching. Includes simple yet effective retrieval and ensemble methods, and introduces training-free fusion strategies for cross-database generalization.

Result: Achieves performance comparable to supervised baselines and their fine-tuned counterparts on DeepFake-Eval-2024 benchmark without any additional model training. Demonstrates cross-database generalizability through ablation studies on voice profile attributes.

Conclusion: The framework provides an effective training-free solution for zero-day audio deepfake detection that matches supervised approaches while offering better adaptability to novel attacks without requiring model retraining.

Abstract: Modern audio deepfake detectors built on foundation models and large training datasets achieve promising detection performance. However, they struggle with zero-day attacks, where the audio samples are generated by novel synthesis methods that models have not seen from reigning training data. Conventional approaches fine-tune the detector, which can be problematic when prompt response is needed. This paper proposes a training-free retrieval-augmented framework for zero-day audio deepfake detection that leverages knowledge representations and voice profile matching. Within this framework, we propose simple yet effective retrieval and ensemble methods that reach performance comparable to supervised baselines and their fine-tuned counterparts on the DeepFake-Eval-2024 benchmark, without any additional model training. We also conduct ablation on voice profile attributes, and demonstrate the cross-database generalizability of the framework with introducing simple and training-free fusion strategies.

cs.LG

[362] Lightweight Transformer Architectures for Edge Devices in Real-Time Applications

Hema Hariharan Samson

Main category: cs.LG

TL;DR: Survey of lightweight transformer architectures for edge deployment, covering compression techniques, performance benchmarks, and practical deployment strategies.

Details

Motivation: Enable real-time AI applications on resource-constrained edge devices by overcoming challenges of deploying transformer models with limited computational power, memory, and energy.

Method: Comprehensive survey analyzing model compression techniques (quantization, pruning, knowledge distillation), reviewing lightweight transformer variants (MobileBERT, TinyBERT, DistilBERT, etc.), benchmarking performance across standard datasets, and examining deployment frameworks and hardware platforms.

Result: Modern lightweight transformers achieve 75-96% of full-model accuracy with 4-10x size reduction and 3-9x latency improvement, enabling deployment on 2-5W devices. Identified optimal strategies include sparse attention, mixed-precision quantization, and hardware-aware NAS.

Conclusion: Lightweight transformers are viable for edge deployment with proper optimization strategies. Established performance boundaries and provided practical 6-step deployment pipeline achieving 8-12x size reduction with minimal accuracy loss.

Abstract: The deployment of transformer-based models on resource-constrained edge devices represents a critical challenge in enabling real-time artificial intelligence applications. This comprehensive survey examines lightweight transformer architectures specifically designed for edge deployment, analyzing recent advances in model compression, quantization, pruning, and knowledge distillation techniques. We systematically review prominent lightweight variants including MobileBERT, TinyBERT, DistilBERT, EfficientFormer, EdgeFormer, and MobileViT, providing detailed performance benchmarks on standard datasets such as GLUE, SQuAD, ImageNet-1K, and COCO. Our analysis encompasses current industry adoption patterns across major hardware platforms (NVIDIA Jetson, Qualcomm Snapdragon, Apple Neural Engine, ARM architectures), deployment frameworks (TensorFlow Lite, ONNX Runtime, PyTorch Mobile, CoreML), and optimization strategies. Experimental results demonstrate that modern lightweight transformers can achieve 75-96% of full-model accuracy while reducing model size by 4-10x and inference latency by 3-9x, enabling deployment on devices with as little as 2-5W power consumption. We identify sparse attention mechanisms, mixed-precision quantization (INT8/FP16), and hardware-aware neural architecture search as the most effective optimization strategies. Novel findings include memory-bandwidth bottleneck analysis revealing 15-40M parameter models achieve optimal hardware utilization (60-75% efficiency), quantization sweet spots for different model types, and comprehensive energy efficiency profiling across edge platforms. We establish real-time performance boundaries and provide a practical 6-step deployment pipeline achieving 8-12x size reduction with less than 2% accuracy degradation.

[363] Why LLMs Aren’t Scientists Yet: Lessons from Four Autonomous Research Attempts

Dhruv Trehan, Paras Chopra

Main category: cs.LG

TL;DR: Four attempts to autonomously generate ML research papers using LLM agents; three failed, one succeeded and was accepted to an experimental AI-first-author conference, revealing six common failure modes in AI scientific systems.

Details

Motivation: To explore the feasibility and challenges of autonomous scientific paper generation using LLM agents, identifying failure patterns in AI-driven scientific workflows.

Method: Case study of four end-to-end attempts using a pipeline of six LLM agents mapped to scientific workflow stages, with analysis of failure modes and successful pipeline completion.

Result: Three attempts failed during implementation/evaluation; one completed pipeline and was accepted to Agents4Science 2025 (experimental AI-first-author venue), passing human and multi-AI review. Identified six recurring failure modes.

Conclusion: AI-scientist systems face significant challenges including six documented failure modes; paper proposes four design principles for more robust systems and discusses implications for autonomous scientific discovery, with all artifacts released publicly.

Abstract: We report a case study of four end-to-end attempts to autonomously generate ML research papers using a pipeline of six LLM agents mapped to stages of the scientific workflow. Of these four, three attempts failed during implementation or evaluation. One completed the pipeline and was accepted to Agents4Science 2025, an experimental inaugural venue that required AI systems as first authors, passing both human and multi-AI review. From these attempts, we document six recurring failure modes: bias toward training data defaults, implementation drift under execution pressure, memory and context degradation across long-horizon tasks, overexcitement that declares success despite obvious failures, insufficient domain intelligence, and weak scientific taste in experimental design. We conclude by discussing four design principles for more robust AI-scientist systems, implications for autonomous scientific discovery, and we release all prompts, artifacts, and outputs at https://github.com/Lossfunk/ai-scientist-artefacts-v1

[364] Ratio-Variance Regularized Policy Optimization for Efficient LLM Fine-tuning

Yu Luo, Shuo Han, Yihan Hu, Dong Li, Jianye Hao

Main category: cs.LG

TL;DR: R²VPO: A novel RL method for LLM fine-tuning that replaces hard policy ratio clipping with variance constraints, enabling better gradient preservation and off-policy data reuse for improved stability and sample efficiency.

Details

Motivation: Current on-policy RL methods (PPO, GRPO) use hard clipping that indiscriminately truncates gradients from valuable high-return actions ("eureka moments") and discards stale data, leading to poor sample efficiency and suppression of rare but informative reasoning patterns.

Method: Proposes R²VPO (Ratio-Variance Regularized Policy Optimization), a primal-dual framework that constrains the variance (second central moment) of policy ratios instead of hard clipping. This provides smooth relaxation, preserves gradient signals, and enables principled off-policy data reuse through dynamic reweighting of stale samples.

Result: R²VPO achieves superior asymptotic performance with average relative gains up to 17% over clipping-based baselines, requires ~50% fewer rollouts to reach convergence, and shows consistent improvements across mathematical reasoning benchmarks on models like DeepSeek-Distill-Qwen-1.5B and openPangu-Embedded series (1B and 7B).

Conclusion: Ratio-variance control represents a promising direction for improving both stability and data efficiency in RL-based LLM alignment, addressing fundamental limitations of hard clipping while enabling better utilization of valuable training signals.

Abstract: On-policy reinforcement learning (RL), particularly Proximal Policy Optimization (PPO) and Group Relative Policy Optimization (GRPO), has become the dominant paradigm for fine-tuning large language models (LLMs). While policy ratio clipping stabilizes training, this heuristic hard constraint incurs a fundamental cost: it indiscriminately truncates gradients from high-return yet high-divergence actions, suppressing rare but highly informative “eureka moments” in complex reasoning. Moreover, once data becomes slightly stale, hard clipping renders it unusable, leading to severe sample inefficiency. In this work, we revisit the trust-region objective in policy optimization and show that explicitly constraining the \emph{variance (second central moment) of the policy ratio} provides a principled and smooth relaxation of hard clipping. This distributional constraint stabilizes policy updates while preserving gradient signals from valuable trajectories. Building on this insight, we propose $R^2VPO$ (Ratio-Variance Regularized Policy Optimization), a novel primal-dual framework that supports stable on-policy learning and enables principled off-policy data reuse by dynamically reweighting stale samples rather than discarding them. We extensively evaluate $R^2VPO$ on fine-tuning state-of-the-art LLMs, including DeepSeek-Distill-Qwen-1.5B and the openPangu-Embedded series (1B and 7B), across challenging mathematical reasoning benchmarks. Experimental results show that $R^2VPO$ consistently achieves superior asymptotic performance, with average relative gains of up to 17% over strong clipping-based baselines, while requiring approximately 50% fewer rollouts to reach convergence. These findings establish ratio-variance control as a promising direction for improving both stability and data efficiency in RL-based LLM alignment.

[365] Aligning Findings with Diagnosis: A Self-Consistent Reinforcement Learning Framework for Trustworthy Radiology Reporting

Kun Zhao, Siyuan Dai, Pan Wang, Jifeng Song, Hui Ji, Chenghua Lin, Liang Zhan, Haoteng Tang

Main category: cs.LG

TL;DR: The paper proposes a “Reason-then-Summarize” framework with Group Relative Policy Optimization for radiology report generation, achieving SOTA performance on MIMIC-CXR with reduced hallucinations.

Details

Motivation: MLLMs show potential for radiology report generation but face challenges: architectural heterogeneity, factual hallucinations, standard fine-tuning fails to align outputs with visual evidence, and existing RL approaches have high computational costs or limited exploration.

Method: 1) Systematic evaluation to identify optimal vision encoder and LLM backbone configurations; 2) Novel “Reason-then-Summarize” architecture with think block for detailed findings and answer block for structured disease labels; 3) Optimized via Group Relative Policy Optimization (GRPO) with multi-dimensional composite reward function that penalizes logical discrepancies.

Result: Extensive experiments on MIMIC-CXR benchmark demonstrate state-of-the-art performance in clinical efficacy metrics and significant reduction in hallucinations compared to strong supervised baselines.

Conclusion: The proposed framework effectively addresses hallucination issues in radiology report generation by restructuring generation into reasoning and summarization components with explicit logical consistency enforcement, enabling more clinically reliable MLLM applications.

Abstract: Multimodal Large Language Models (MLLMs) have shown strong potential for radiology report generation, yet their clinical translation is hindered by architectural heterogeneity and the prevalence of factual hallucinations. Standard supervised fine-tuning often fails to strictly align linguistic outputs with visual evidence, while existing reinforcement learning approaches struggle with either prohibitive computational costs or limited exploration. To address these challenges, we propose a comprehensive framework for self-consistent radiology report generation. First, we conduct a systematic evaluation to identify optimal vision encoder and LLM backbone configurations for medical imaging. Building on this foundation, we introduce a novel “Reason-then-Summarize” architecture optimized via Group Relative Policy Optimization (GRPO). This framework restructures generation into two distinct components: a think block for detailed findings and an answer block for structured disease labels. By utilizing a multi-dimensional composite reward function, we explicitly penalize logical discrepancies between the generated narrative and the final diagnosis. Extensive experiments on the MIMIC-CXR benchmark demonstrate that our method achieves state-of-the-art performance in clinical efficacy metrics and significantly reduces hallucinations compared to strong supervised baselines.

[366] Mathematical Foundations of Polyphonic Music Generation via Structural Inductive Bias

Joonwon Seo

Main category: cs.LG

TL;DR: Novel polyphonic music generation approach using structural inductive bias to solve “Missing Middle” problem, with mathematical proofs and empirical validation on Beethoven piano sonatas.

Details

Motivation: Address the "Missing Middle" problem in polyphonic music generation - the gap between low-level note generation and high-level musical structure. Current AI music generation lacks mathematical grounding and verifiable theoretical foundations.

Method: 1) Empirical verification of pitch-hand independence using normalized mutual information (NMI=0.167). 2) Smart Embedding architecture reducing parameters by 48.30%. 3) Mathematical proofs using information theory (0.153 bits loss bound), Rademacher complexity (28.09% tighter bound), and category theory. 4) Validation through SVD analysis and expert listening study (N=53).

Result: 9.47% reduction in validation loss, improved stability and generalization. Smart Embedding achieves 48.30% parameter reduction while maintaining performance. Expert listening study confirms quality.

Conclusion: The approach bridges theoretical and applied AI music generation, providing mathematically grounded deep learning with verifiable insights. Offers a dual framework addressing both theoretical soundness and practical performance in polyphonic music generation.

Abstract: This monograph introduces a novel approach to polyphonic music generation by addressing the “Missing Middle” problem through structural inductive bias. Focusing on Beethoven’s piano sonatas as a case study, we empirically verify the independence of pitch and hand attributes using normalized mutual information (NMI=0.167) and propose the Smart Embedding architecture, achieving a 48.30% reduction in parameters. We provide rigorous mathematical proofs using information theory (negligible loss bounded at 0.153 bits), Rademacher complexity (28.09% tighter generalization bound), and category theory to demonstrate improved stability and generalization. Empirical results show a 9.47% reduction in validation loss, confirmed by SVD analysis and an expert listening study (N=53). This dual theoretical and applied framework bridges gaps in AI music generation, offering verifiable insights for mathematically grounded deep learning.

[367] Sensor to Pixels: Decentralized Swarm Gathering via Image-Based Reinforcement Learning

Yigal Koifman, Eran Iceland, Erez Koifman, Ariel Barel, Alfred M. Bruckstein

Main category: cs.LG

TL;DR: Proposes image-based RL for decentralized multi-agent swarm control using visual observation encoding, achieving high convergence rates comparable to fast neural methods while maintaining reliability.

Details

Motivation: Traditional multi-agent RL methods rely on handcrafted features or raw vector representations that limit scalability and efficiency regarding input order and size. Need better ways for agents to sense, interpret, and process inputs for effective policy learning in swarm tasks.

Method: Image-based reinforcement learning method for decentralized multi-agent control where observations are encoded as structured visual inputs. Neural networks process these visual inputs to extract spatial features and generate novel decentralized motion control rules.

Result: Achieves high convergence in multi-agent aggregation tasks with limited-range bearing-only sensing. Performs nearly as fast as VariAntNet (fast neural method) while maintaining high success rates. In some scenarios, serves as the only practical alternative between slow analytical solutions and unreliable fast methods.

Conclusion: Image-based RL provides an effective approach for decentralized swarm control, offering a balance between convergence speed and reliability by leveraging visual encoding of observations for spatial feature extraction.

Abstract: This study highlights the potential of image-based reinforcement learning methods for addressing swarm-related tasks. In multi-agent reinforcement learning, effective policy learning depends on how agents sense, interpret, and process inputs. Traditional approaches often rely on handcrafted feature extraction or raw vector-based representations, which limit the scalability and efficiency of learned policies concerning input order and size. In this work we propose an image-based reinforcement learning method for decentralized control of a multi-agent system, where observations are encoded as structured visual inputs that can be processed by Neural Networks, extracting its spatial features and producing novel decentralized motion control rules. We evaluate our approach on a multi-agent convergence task of agents with limited-range and bearing-only sensing that aim to keep the swarm cohesive during the aggregation. The algorithm’s performance is evaluated against two benchmarks: an analytical solution proposed by Bellaiche and Bruckstein, which ensures convergence but progresses slowly, and VariAntNet, a neural network-based framework that converges much faster but shows medium success rates in hard constellations. Our method achieves high convergence, with a pace nearly matching that of VariAntNet. In some scenarios, it serves as the only practical alternative.

[368] HEEGNet: Hyperbolic Embeddings for EEG

Shanglin Li, Shiwen Chu, Okan Koç, Yi Ding, Qibin Zhao, Motoaki Kawanabe, Ziheng Chen

Main category: cs.LG

TL;DR: HEEGNet is a hybrid hyperbolic network that captures EEG’s hierarchical structure using hyperbolic embeddings, improving generalization across domains through coarse-to-fine domain adaptation.

Details

Motivation: EEG-based brain-computer interfaces suffer from poor generalization due to distribution shifts across subjects/domains. Learning robust, domain-invariant representations that capture EEG's hierarchical structure could mitigate these shifts.

Method: HEEGNet combines Euclidean and hyperbolic encoders with a novel coarse-to-fine domain adaptation strategy. First demonstrates EEG’s hyperbolicity, then uses hyperbolic embeddings to represent hierarchical data, leveraging hyperbolic spaces as natural geometry for tree-like structures.

Result: Extensive experiments on multiple public EEG datasets (visual evoked potentials, emotion recognition, intracranial EEG) show HEEGNet achieves state-of-the-art performance, with hyperbolic embeddings improving generalization across domains.

Conclusion: EEG data exhibits hyperbolicity, and hyperbolic embeddings effectively capture its hierarchical structure, enabling better domain adaptation and generalization in EEG decoding tasks through the proposed HEEGNet architecture.

Abstract: Electroencephalography (EEG)-based brain-computer interfaces facilitate direct communication with a computer, enabling promising applications in human-computer interactions. However, their utility is currently limited because EEG decoding often suffers from poor generalization due to distribution shifts across domains (e.g., subjects). Learning robust representations that capture underlying task-relevant information would mitigate these shifts and improve generalization. One promising approach is to exploit the underlying hierarchical structure in EEG, as recent studies suggest that hierarchical cognitive processes, such as visual processing, can be encoded in EEG. While many decoding methods still rely on Euclidean embeddings, recent work has begun exploring hyperbolic geometry for EEG. Hyperbolic spaces, regarded as the continuous analogue of tree structures, provide a natural geometry for representing hierarchical data. In this study, we first empirically demonstrate that EEG data exhibit hyperbolicity and show that hyperbolic embeddings improve generalization. Motivated by these findings, we propose HEEGNet, a hybrid hyperbolic network architecture to capture the hierarchical structure in EEG and learn domain-invariant hyperbolic embeddings. To this end, HEEGNet combines both Euclidean and hyperbolic encoders and employs a novel coarse-to-fine domain adaptation strategy. Extensive experiments on multiple public EEG datasets, covering visual evoked potentials, emotion recognition, and intracranial EEG, demonstrate that HEEGNet achieves state-of-the-art performance.

[369] Extreme-value forest fire prediction A study of the Loss Function in an Ordinality Scheme

Nicolas Caron, Christophe Guyeux, Hassan Noura, Benjamin Aynes

Main category: cs.LG

TL;DR: This paper introduces an ordinal classification framework for wildfire severity forecasting in France, comparing different loss functions to improve prediction of rare high-severity events, with Weighted Kappa Loss showing best results.

Details

Motivation: Wildfires are highly imbalanced natural hazards with rare but critical high-severity events that are challenging to predict using conventional methods. There's a need for forecasting systems aligned with operational decision-making that can better handle extreme events.

Method: The authors introduce the first ordinal classification framework for wildfire severity forecasting in France. They investigate loss-function design, comparing standard cross-entropy with ordinal-aware objectives including a novel probabilistic TDeGPD loss derived from truncated discrete exponentiated Generalized Pareto Distribution. They conduct extensive benchmarking over multiple neural architectures using real operational data.

Result: Ordinal supervision substantially improves model performance over conventional approaches. Weighted Kappa Loss (WKLoss) achieves the best overall results with more than +0.1 IoU gain on the most extreme severity classes while maintaining competitive calibration quality. However, performance remains limited for the rarest events due to their extremely low representation in the dataset.

Conclusion: The findings highlight the importance of integrating severity ordering, data imbalance considerations, and seasonality risk into wildfire forecasting systems. Future work will focus on incorporating seasonal dynamics and uncertainty information to further improve extreme-event prediction reliability.

Abstract: Wildfires are highly imbalanced natural hazards in both space and severity, making the prediction of extreme events particularly challenging. In this work, we introduce the first ordinal classification framework for forecasting wildfire severity levels directly aligned with operational decision-making in France. Our study investigates the influence of loss-function design on the ability of neural models to predict rare yet critical high-severity fire occurrences. We compare standard cross-entropy with several ordinal-aware objectives, including the proposed probabilistic TDeGPD loss derived from a truncated discrete exponentiated Generalized Pareto Distribution. Through extensive benchmarking over multiple architectures and real operational data, we show that ordinal supervision substantially improves model performance over conventional approaches. In particular, the Weighted Kappa Loss (WKLoss) achieves the best overall results, with more than +0.1 IoU gain on the most extreme severity classes while maintaining competitive calibration quality. However, performance remains limited for the rarest events due to their extremely low representation in the dataset. These findings highlight the importance of integrating both severity ordering, data imbalance considerations, and seasonality risk into wildfire forecasting systems. Future work will focus on incorporating seasonal dynamics and uncertainty information into training to further improve the reliability of extreme-event prediction.

[370] Inferring Clinically Relevant Molecular Subtypes of Pancreatic Cancer from Routine Histopathology Using Deep Learning

Abdul Rehman Akbar, Alejandro Levya, Ashwini Esnakula, Elshad Hasanov, Anne Noonan, Upender Manne, Vaibhav Sahai, Lingbin Meng, Susan Tsai, Anil Parwani, Wei Chen, Ashish Manne, Muhammad Khalid Khan Niazi

Main category: cs.LG

TL;DR: PanSubNet is an interpretable deep learning framework that predicts pancreatic cancer molecular subtypes (basal-like vs classical) directly from routine H&E-stained whole slide images, offering a clinically deployable alternative to expensive RNA-seq methods.

Details

Motivation: Current molecular subtyping of pancreatic ductal adenocarcinoma (PDAC) using RNA-seq is limited in clinical practice due to high cost, long turnaround time, and tissue requirements, restricting its application in PDAC management.

Method: PanSubNet uses a dual-scale architecture that fuses cellular-level morphology with tissue-level architecture from H&E-stained WSIs, employing attention mechanisms for multi-scale representation learning and transparent feature attribution. Developed on 1,055 patients across two cohorts with paired histology and RNA-seq data.

Result: Achieved mean AUC of 88.5% on internal validation and 84.0% on external validation without fine-tuning. Preserved and strengthened prognostic stratification compared to RNA-seq labels, with prediction uncertainty linked to intermediate transcriptional states rather than classification noise.

Conclusion: PanSubNet enables rapid, cost-effective molecular stratification from routine H&E slides, offering a clinically deployable and interpretable tool for genetic subtyping that can advance precision oncology for PDAC.

Abstract: Molecular subtyping of PDAC into basal-like and classical has established prognostic and predictive value. However, its use in clinical practice is limited by cost, turnaround time, and tissue requirements, thereby restricting its application in the management of PDAC. We introduce PanSubNet, an interpretable deep learning framework that predicts therapy-relevant molecular subtypes directly from standard H&E-stained WSIs. PanSubNet was developed using data from 1,055 patients across two multi-institutional cohorts (PANCAN, n=846; TCGA, n=209) with paired histology and RNA-seq data. Ground-truth labels were derived using the validated Moffitt 50-gene signature refined by GATA6 expression. The model employs dual-scale architecture that fuses cellular-level morphology with tissue-level architecture, leveraging attention mechanisms for multi-scale representation learning and transparent feature attribution. On internal validation within PANCAN using five-fold cross-validation, PanSubNet achieved mean AUC of 88.5% with balanced sensitivity and specificity. External validation on the independent TCGA cohort without fine-tuning demonstrated robust generalizability (AUC 84.0%). PanSubNet preserved and, in metastatic disease, strengthened prognostic stratification compared to RNA-seq based labels. Prediction uncertainty linked to intermediate transcriptional states, not classification noise. Model predictions are aligned with established transcriptomic programs, differentiation markers, and DNA damage repair signatures. By enabling rapid, cost-effective molecular stratification from routine H&E-stained slides, PanSubNet offers a clinically deployable and interpretable tool for genetic subtyping. We are gathering data from two institutions to validate and assess real-world performance, supporting integration into digital pathology workflows and advancing precision oncology for PDAC.

[371] Attention mechanisms in neural networks

Hasi Hays

Main category: cs.LG

TL;DR: A comprehensive mathematical treatment of attention mechanisms covering theory, computation, and applications across NLP, vision, and multimodal tasks.

Details

Motivation: Attention mechanisms represent a fundamental paradigm shift in neural networks, enabling selective focus on relevant input portions through learned weighting functions. There's a need for rigorous mathematical treatment of their foundations and properties.

Method: Provides comprehensive mathematical analysis of attention mechanisms including theoretical foundations, computational properties, and practical implementations. Examines applications in language modeling (autoregressive transformers, bidirectional encoders), sequence-to-sequence translation, Vision Transformers for image classification, and cross-modal attention for vision-language tasks.

Result: Empirical analysis reveals training characteristics, scaling laws relating performance to model size/computation, attention pattern visualizations, and performance benchmarks across standard datasets. Shows interpretability of learned attention patterns and their relationship to linguistic/visual structures.

Conclusion: Concludes with critical examination of current limitations including computational scalability, data efficiency, systematic generalization, and interpretability challenges. Provides comprehensive understanding of attention mechanisms as fundamental neural network paradigm.

Abstract: Attention mechanisms represent a fundamental paradigm shift in neural network architectures, enabling models to selectively focus on relevant portions of input sequences through learned weighting functions. This monograph provides a comprehensive and rigorous mathematical treatment of attention mechanisms, encompassing their theoretical foundations, computational properties, and practical implementations in contemporary deep learning systems. Applications in natural language processing, computer vision, and multimodal learning demonstrate the versatility of attention mechanisms. We examine language modeling with autoregressive transformers, bidirectional encoders for representation learning, sequence-to-sequence translation, Vision Transformers for image classification, and cross-modal attention for vision-language tasks. Empirical analysis reveals training characteristics, scaling laws that relate performance to model size and computation, attention pattern visualizations, and performance benchmarks across standard datasets. We discuss the interpretability of learned attention patterns and their relationship to linguistic and visual structures. The monograph concludes with a critical examination of current limitations, including computational scalability, data efficiency, systematic generalization, and interpretability challenges.

[372] LUT-KAN: Segment-wise LUT Quantization for Fast KAN Inference

Oleksandr Kuznetsov

Main category: cs.LG

TL;DR: LUT-KAN introduces a lookup-table compilation and quantization method for KAN networks that replaces expensive spline evaluations with quantized LUTs, achieving 10-12x speedup while maintaining accuracy.

Details

Motivation: KAN networks use learnable spline functions instead of scalar weights, which makes CPU inference expensive due to many spline evaluations. Standard quantization methods don't work well because the main computation isn't matrix multiplication but spline evaluation.

Method: LUT-KAN converts each edge function into per-segment lookup tables with affine int8/uint8 quantization and linear interpolation. It provides explicit inference contracts with boundary conventions and OOB policies, and uses an “honest baseline” methodology for fair speed comparisons.

Result: In a DoS attack detection case study, LUT-KAN preserved classification quality (F1 drop <0.0002) while reducing CPU inference latency by 12x under NumPy and 10x under Numba backends. Memory overhead is ~10x at L=64 resolution.

Conclusion: LUT-KAN enables efficient CPU inference for KAN networks through LUT compilation and quantization, achieving significant speedups while maintaining accuracy, with reproducible artifacts and explicit inference contracts.

Abstract: Kolmogorov–Arnold Networks (KAN) replace scalar weights by learnable univariate functions, often implemented with B-splines. This design can be accurate and interpretable, but it makes inference expensive on CPU because each layer requires many spline evaluations. Standard quantization toolchains are also hard to apply because the main computation is not a matrix multiply but repeated spline basis evaluation. This paper introduces LUT-KAN, a segment-wise lookup-table (LUT) compilation and quantization method for PyKAN-style KAN layers. LUT-KAN converts each edge function into a per-segment LUT with affine int8/uint8 quantization and linear interpolation. The method provides an explicit and reproducible inference contract, including boundary conventions and out-of-bounds (OOB) policies. We propose an ``honest baseline’’ methodology for speed evaluation: B-spline evaluation and LUT evaluation are compared under the same backend optimization (NumPy vs NumPy and Numba vs Numba), which separates representation gains from vectorization and JIT effects. Experiments include controlled sweeps over LUT resolution L in 16, 32, 64, 128 and two quantization schemes (symmetric int8 and asymmetric uint8). We report accuracy, speed, and memory metrics with mean and standard deviation across multiple seeds. A two-by-two OOB robustness matrix evaluates behavior under different boundary modes and OOB policies. In a case study, we compile a trained KAN model for DoS attack detection (CICIDS2017 pipeline) into LUT artifacts. The compiled model preserves classification quality (F1 drop below 0.0002) while reducing steady-state CPU inference latency by 12x under NumPy and 10x under Numba backends (honest baseline). The memory overhead is approximately 10x at L=64. All code and artifacts are publicly available with fixed release tags for reproducibility.

[373] Physics-Informed Gaussian Process Regression for the Constitutive Modeling of Concrete: A Data-Driven Improvement to Phenomenological Models

Chenyang Li, Himanshu Sharma, Youcai Wu, Joseph Magallanes, K. T. Ramesh, Michael D. Shields

Main category: cs.LG

TL;DR: A physics-informed Gaussian Process Regression framework replaces empirical failure surfaces in concrete constitutive models, improving generalization and uncertainty quantification while maintaining modular elastoplastic structure.

Details

Motivation: Existing phenomenological concrete models like KCC rely on empirically calibrated failure surfaces that lack flexibility in model form and proper uncertainty quantification, limiting their reliability and generalization capabilities.

Method: Develop a physics-informed framework that retains KCC’s modular elastoplastic structure but replaces empirical failure surfaces with constrained Gaussian Process Regression surrogates learned directly from experimental data, incorporating derivative-based constraints aligned with known material behavior.

Result: Physics-informed GPR with derivative constraints yields markedly better accuracy and reliability than unconstrained GPR, especially under extrapolation to higher confinement levels beyond training range, while also reducing predictive variance and producing tighter confidence intervals.

Conclusion: The proposed approach delivers a robust, uncertainty-aware surrogate that improves generalization and streamlines calibration without sacrificing interpretability and numerical efficiency, offering a practical path toward improved constitutive models for concrete.

Abstract: Understanding and modeling the constitutive behavior of concrete is crucial for civil and defense applications, yet widely used phenomenological models such as Karagozian & Case concrete (KCC) model depend on empirically calibrated failure surfaces that lack flexibility in model form and associated uncertainty quantification. This work develops a physics-informed framework that retains the modular elastoplastic structure of KCC model while replacing its empirical failure surface with a constrained Gaussian Process Regression (GPR) surrogate that can be learned directly from experimentally accessible observables. Triaxial compression data under varying confinement levels are used for training, and the surrogate is then evaluated at confinement levels not included in the training set to assess its generalization capability. Results show that an unconstrained GPR interpolates well near training conditions but deteriorates and violates essential physical constraints under extrapolation, even when augmented with simulated data. In contrast, a physics-informed GPR that incorporates derivative-based constraints aligned with known material behavior yields markedly better accuracy and reliability, including at higher confinement levels beyond the training range. Probabilistic enforcement of these constraints also reduces predictive variance, producing tighter confidence intervals in data-scarce regimes. Overall, the proposed approach delivers a robust, uncertainty-aware surrogate that improves generalization and streamlines calibration without sacrificing the interpretability and numerical efficiency of the KCC model, offering a practical path toward an improved constitutive models for concrete.

[374] Improving Underwater Acoustic Classification Through Learnable Gabor Filter Convolution and Attention Mechanisms

Lucas Cesar Ferreira Domingos, Russell Brinkworth, Paulo Eduardo Santos, Karl Sammut

Main category: cs.LG

TL;DR: GSE ResNeXt: A deep learning model combining learnable Gabor filters with ResNeXt backbone and squeeze-excitation attention for underwater acoustic target classification, showing improved performance and training efficiency.

Details

Motivation: Underwater acoustic target classification is crucial for environmental monitoring and defense, but faces challenges from complex ship-radiated/environmental noise, limited datasets, and lack of standardized experimentation that hinder generalization and robustness.

Method: Proposes GSE ResNeXt architecture integrating learnable Gabor convolutional layers with ResNeXt backbone enhanced by squeeze-and-excitation attention. Gabor filters act as 2D adaptive band-pass filters extending feature channel representation. Evaluated using three training-test split strategies addressing data leakage, temporal separation, and taxonomy issues.

Result: GSE ResNeXt consistently outperforms baseline models (Xception, ResNet, MobileNetV2) in classification performance. Adding Gabor convolutions reduced training time by up to 62%. Temporal separation between subsets proved more influential than training data volume on performance.

Conclusion: Signal processing can enhance model reliability and generalization in data-limited underwater acoustic classification. Future work should focus on mitigating environmental effects on input signals.

Abstract: Remotely detecting and classifying underwater acoustic targets is critical for environmental monitoring and defence. However, the complexity of ship-radiated and environmental noise poses significant challenges for accurate signal processing. While recent advancements in machine learning have improved classification accuracy, limited dataset availability and a lack of standardised experimentation hinder generalisation and robustness. This paper introduces GSE ResNeXt, a deep learning architecture integrating learnable Gabor convolutional layers with a ResNeXt backbone enhanced by squeeze-and-excitation attention. The Gabor filters serve as two-dimensional adaptive band-pass filters, extending the feature channel representation. Its combination with channel attention improves training stability and convergence while enhancing the model’s ability to extract discriminative features. The model is evaluated using three training-test split strategies that reflect increasingly complex classification tasks, demonstrating how systematic evaluation design addresses issues such as data leakage, temporal separation, and taxonomy. Results show that GSE ResNeXt consistently outperforms baseline models like Xception, ResNet, and MobileNetV2, in terms of classification performance. Regarding stability and convergence, adding Gabor convolutions to the initial layers of the model reduced training time by up to 62%. During the evaluation of training-testing splits, temporal separation between subsets significantly affected performance, proving more influential than training data volume. These findings suggest that signal processing can enhance model reliability and generalisation under varying environmental conditions, particularly in data-limited underwater acoustic classification. Future developments should focus on mitigating environmental effects on input signals.

[375] Enhancing Small Dataset Classification Using Projected Quantum Kernels with Convolutional Neural Networks

A. M. A. S. D. Alagiyawanna, Asoka Karunananda, A. Mahasinghe, Thushari Silva

Main category: cs.LG

TL;DR: PQK-enhanced CNN achieves 95% accuracy on MNIST and 90% on CIFAR-10 with only 1000 samples, significantly outperforming classical CNN (60% and 12%).

Details

Motivation: CNNs require large labeled datasets, but many applications have limited data availability. Quantum computing principles offer potential to capture complex patterns that traditional CNNs might miss.

Method: Introduce projected quantum kernels (PQK) derived from quantum computing principles to enhance CNN feature extraction for small datasets.

Result: PQK-enhanced CNN achieved 95% accuracy on MNIST and 90% on CIFAR-10 with only 1000 training samples, while classical CNN achieved only 60% and 12% respectively.

Conclusion: Projected quantum kernels can overcome data scarcity issues in machine learning and serve as a powerful approach for enhancing CNN-based classification in data-constrained environments.

Abstract: Convolutional Neural Networks (CNNs) have shown promising results in efficiency and accuracy in image classification. However, their efficacy often relies on large, labeled datasets, posing challenges for applications with limited data availability. Our research addresses these challenges by introducing an innovative approach that leverages projected quantum kernels (PQK) to enhance feature extraction for CNNs, specifically tailored for small datasets. Projected quantum kernels, derived from quantum computing principles, offer a promising avenue for capturing complex patterns and intricate data structures that traditional CNNs might miss. By incorporating these kernels into the feature extraction process, we improved the representational ability of CNNs. Our experiments demonstrated that, with 1000 training samples, the PQK-enhanced CNN achieved 95% accuracy on the MNIST dataset and 90% on the CIFAR-10 dataset, significantly outperforming the classical CNN, which achieved only 60% and 12% accuracy on the respective datasets. This research reveals the potential of quantum computing in overcoming data scarcity issues in machine learning and paves the way for future exploration of quantum-assisted neural networks, suggesting that projected quantum kernels can serve as a powerful approach for enhancing CNN-based classification in data-constrained environments.

[376] Weather-Aware Transformer for Real-Time Route Optimization in Drone-as-a-Service Operations

Kamal Mohamed, Lillian Wassim, Ali Hamdi, Khaled Shaban

Main category: cs.LG

TL;DR: Weather-aware deep learning models accelerate drone route prediction by incorporating weather heuristics into transformer/attention architectures, achieving real-time performance while maintaining optimization quality.

Details

Motivation: Classical path-planning algorithms like A* and Dijkstra have computational limitations for real-time drone operations in dynamic environments, especially when weather conditions affect routing decisions.

Method: Train ML/DL models on synthetic datasets from classical algorithm simulations, using transformer-based and attention-based architectures that incorporate weather heuristics (wind patterns, bearing, temperature) to predict optimal next-node selections.

Result: Weather-aware models achieve significant computational speedup over traditional algorithms while maintaining route optimization performance, with transformer-based architectures showing superior adaptation to dynamic environmental constraints.

Conclusion: The framework enables real-time, weather-responsive route optimization for large-scale Drone-as-a-Service operations, advancing efficiency and safety of autonomous drone systems.

Abstract: This paper presents a novel framework to accelerate route prediction in Drone-as-a-Service operations through weather-aware deep learning models. While classical path-planning algorithms, such as A* and Dijkstra, provide optimal solutions, their computational complexity limits real-time applicability in dynamic environments. We address this limitation by training machine learning and deep learning models on synthetic datasets generated from classical algorithm simulations. Our approach incorporates transformer-based and attention-based architectures that utilize weather heuristics to predict optimal next-node selections while accounting for meteorological conditions affecting drone operations. The attention mechanisms dynamically weight environmental factors including wind patterns, wind bearing, and temperature to enhance routing decisions under adverse weather conditions. Experimental results demonstrate that our weather-aware models achieve significant computational speedup over traditional algorithms while maintaining route optimization performance, with transformer-based architectures showing superior adaptation to dynamic environmental constraints. The proposed framework enables real-time, weather-responsive route optimization for large-scale DaaS operations, representing a substantial advancement in the efficiency and safety of autonomous drone systems.

[377] SPD Matrix Learning for Neuroimaging Analysis: Perspectives, Methods, and Challenges

Ce Ju, Reinmar Kobler, Antoine Collas, Motoaki Kawanabe, Cuntai Guan, Bertrand Thirion

Main category: cs.LG

TL;DR: This review paper consolidates machine learning methods that operate on symmetric positive definite (SPD) matrices for neuroimaging analysis, presenting SPD matrix learning as a unified framework that bridges classical geometric statistics with modern AI approaches.

Details

Motivation: Neuroimaging faces modality-specific challenges including measurement noise, spatial/temporal distortions, heterogeneous protocols, and limited sample sizes. SPD-valued representations naturally emerge across neuroimaging modalities, and Riemannian geometry provides principled statistical modeling on the resulting manifold.

Method: SPD matrix learning framework that endows SPD space with Riemannian metrics, enabling non-Euclidean geometric structure for statistical modeling and machine learning. The approach preserves symmetry and positive definiteness while avoiding degeneracies inherent to Euclidean embeddings.

Result: SPD matrix learning provides: (1) mathematically natural and numerically stable modeling on the SPD manifold, (2) extension of established geometric statistical tools used across neuroimaging, and (3) integration with new-generation AI technologies for previously inaccessible neuroimaging problems.

Conclusion: SPD matrix learning offers a principled and forward-looking framework for next-generation neuroimaging analytics, bringing conceptual clarity across modalities, establishing continuity with decades of geometric statistics, and serving as a methodological bridge between classical analysis and emerging AI paradigms.

Abstract: Neuroimaging provides essential tools for characterizing brain activity by quantifying connectivity strength between remote regions, using different modalities that capture different aspects of connectivity. Yet, decoding meaningful neural signatures must contend with modality-specific challenges, including measurement noise, spatial and temporal distortions, heterogeneous acquisition protocols, and limited sample sizes. A unifying perspective emerges when these data are expressed through symmetric positive definite (SPD)-valued representations: across neuroimaging modalities, SPD-valued representations naturally give rise to SPD matrices that capture dependencies between sensors or brain regions. Endowing the SPD space with Riemannian metrics equips it with a non-Euclidean geometric structure, enabling principled statistical modeling and machine learning on the resulting manifold. This review consolidates machine learning methodologies that operate on the SPD manifold under a unified framework termed SPD matrix learning. SPD matrix learning brings conceptual clarity across multiple modalities, establishes continuity with decades of geometric statistics in neuroimaging, and positions SPD modeling as a methodological bridge between classical analysis and emerging AI-driven paradigms. We show that (i) modeling on the SPD manifold is mathematically natural and numerically stable, preserving symmetry and positive definiteness while avoiding degeneracies inherent to Euclidean embeddings; (ii) SPD matrix learning extends a broad family of established geometric statistical tools used across neuroimaging; and (iii) SPD matrix learning integrates new-generation AI technologies, driving a new class of neuroimaging problems that were previously out of reach. Taken together, SPD matrix learning offers a principled and forward-looking framework for next-generation neuroimaging analytics.

[378] SIGMA: Scalable Spectral Insights for LLM Collapse

Yi Gu, Lingyou Pang, Xiangkun Ye, Tianyu Wang, Jianyu Lin, Carey E. Priebe, Alexander Aue

Main category: cs.LG

TL;DR: SIGMA framework uses spectral analysis of embedding Gram matrices to quantify and predict model collapse in LLMs trained on synthetic data.

Details

Motivation: Model collapse is a degenerative process in LLMs trained recursively on synthetic data, causing distributional variance contraction and representational quality degradation. Current methods lack rigorous quantification and prediction of collapse onset in high-dimensional spaces.

Method: Introduces SIGMA (Spectral Inequalities for Gram Matrix Analysis) - a unified framework that benchmarks model collapse through spectral analysis of embedding Gram matrices. Uses deterministic and stochastic bounds on the matrix spectrum to track representation space contraction. Stochastic formulation enables scalable estimation for large models.

Result: SIGMA effectively captures the transition towards degenerate states, providing both theoretical insights into collapse mechanics and a practical, scalable tool for monitoring recursive training pipeline health.

Conclusion: SIGMA offers a mathematically grounded, scalable framework to quantify and predict model collapse in LLMs, addressing a critical challenge in synthetic data training pipelines.

Abstract: The rapid adoption of synthetic data for training Large Language Models (LLMs) has introduced the technical challenge of “model collapse”-a degenerative process where recursive training on model-generated content leads to a contraction of distributional variance and representational quality. While the phenomenology of collapse is increasingly evident, rigorous methods to quantify and predict its onset in high-dimensional spaces remain elusive. In this paper, we introduce SIGMA (Spectral Inequalities for Gram Matrix Analysis), a unified framework that benchmarks model collapse through the spectral lens of the embedding Gram matrix. By deriving and utilizing deterministic and stochastic bounds on the matrix’s spectrum, SIGMA provides a mathematically grounded metric to track the contraction of the representation space. Crucially, our stochastic formulation enables scalable estimation of these bounds, making the framework applicable to large-scale foundation models where full eigendecomposition is intractable. We demonstrate that SIGMA effectively captures the transition towards degenerate states, offering both theoretical insights into the mechanics of collapse and a practical, scalable tool for monitoring the health of recursive training pipelines.

[379] Jailbreaking LLMs Without Gradients or Priors: Effective and Transferable Attacks

Zhakshylyk Nurlanov, Frank R. Schmidt, Florian Bernard

Main category: cs.LG

TL;DR: RAILS is a gradient-free jailbreak attack framework that uses random iterative local search on model logits, achieving high success rates on open-source models and strong transferability to closed-source systems.

Details

Motivation: Current safety evaluations overestimate LLM robustness because existing automated attacks rely on restrictive assumptions like handcrafted priors or white-box gradient access. There's a need for more effective jailbreak attacks that don't require these constraints.

Method: RAILS uses token-level iterative optimization without gradients or priors. It operates solely on model logits with two key innovations: 1) an auto-regressive loss that enforces exact prefix matching, and 2) a history-based selection strategy that bridges the gap between proxy optimization and true attack success. It also enables cross-tokenizer ensemble attacks.

Result: RAILS achieves near 100% success rates on multiple open-source models and demonstrates high black-box attack transferability to closed-source systems like GPT and Gemini through cross-tokenizer ensemble attacks.

Conclusion: Token-level iterative optimization can succeed without gradients or priors, challenging current assumptions about jailbreak attacks. RAILS provides a more realistic evaluation of LLM safety by enabling effective gradient-free attacks with strong cross-model transferability.

Abstract: As Large Language Models (LLMs) are increasingly deployed in safety-critical domains, rigorously evaluating their robustness against adversarial jailbreaks is essential. However, current safety evaluations often overestimate robustness because existing automated attacks are limited by restrictive assumptions. They typically rely on handcrafted priors or require white-box access for gradient propagation. We challenge these constraints by demonstrating that token-level iterative optimization can succeed without gradients or priors. We introduce RAILS (RAndom Iterative Local Search), a framework that operates solely on model logits. RAILS matches the effectiveness of gradient-based methods through two key innovations: a novel auto-regressive loss that enforces exact prefix matching, and a history-based selection strategy that bridges the gap between the proxy optimization objective and the true attack success rate. Crucially, by eliminating gradient dependency, RAILS enables cross-tokenizer ensemble attacks. This allows for the discovery of shared adversarial patterns that generalize across disjoint vocabularies, significantly enhancing transferability to closed-source systems. Empirically, RAILS achieves near 100% success rates on multiple open-source models and high black-box attack transferability to closed-source systems like GPT and Gemini.

[380] Spectral Archaeology: The Causal Topology of Model Evolution

Valentin Noël

Main category: cs.LG

TL;DR: Training-free mechanistic probe using attention-graph spectra reveals stable “spectral fingerprints” that expose model discontinuities missed by standard evaluation, including a syntax-triggered connectivity failure called PTCC.

Details

Motivation: Behavioral benchmarks only tell us what a model does, not how it works internally. There's a need for mechanistic analysis tools that can reveal internal processing strategies and identify potential failures in model architecture.

Method: Introduced attention-graph spectra analysis: treat each layer as a token graph, compute algebraic connectivity (λ₂), smoothness, and spectral entropy. Applied across 12 models and 10 languages to create “spectral fingerprints.”

Result: Four key findings: (1) PTCC - syntax-triggered connectivity failure on non-canonical constructions; (2) specialization trade-off between formal routing and stylistic flexibility; (3) four recurrent processing strategies identifiable via frozen-threshold rules; (4) PTCC localizes to Layer 2 compensatory patch, partially recoverable via activation steering.

Conclusion: Attention-graph spectra provide practical tools for model auditing and training-regime verification. Topological regimes track tokenization density more than language identity, suggesting geometry varies systematically across scripts.

Abstract: Behavioral benchmarks tell us \textit{what} a model does, but not \textit{how}. We introduce a training-free mechanistic probe using attention-graph spectra. Treating each layer as a token graph, we compute algebraic connectivity ($λ_2$), smoothness, and spectral entropy. Across 12 models and 10 languages, these measures yield stable spectral fingerprints'' that expose discontinuities missed by standard evaluation. We report four results. (1) Models undergoing specific curriculum transitions (e.g., code-to-chat) show an English-only, syntax-triggered connectivity failure on non-canonical constructions, reaching $Δλ_2 \approx -0.76$. We term this scar \textit{Passive-Triggered Connectivity Collapse} (PTCC). Analysis of the Phi lineage reveals that PTCC appears and resolves across developmental stages, implicating brittle curriculum shifts rather than synthetic data per se. (2) PTCC reflects a specialization trade-off: strengthened formal routing at the expense of stylistic flexibility. (3) We identify four recurrent processing strategies; simple frozen-threshold rules enable perfect forensic identification across lineages. (4) Mechanistically, PTCC localizes to a sparse Layer 2 compensatory patch’’ of heads that fails under syntactic stress; activation steering can partially restore connectivity, recovering $\approx 38%$ of lost information flow. Finally, dominant topological regimes track tokenization density more than language identity, suggesting ``healthy’’ geometry varies systematically across scripts. Overall, attention-graph spectra provide a practical tool for auditing and training-regime verification.

[381] The Illusion of Specialization: Unveiling the Domain-Invariant “Standing Committee” in Mixture-of-Experts Models

Yan Wang, Yitao Xu, Nanhan Shen, Jinyan Su, Jimin Huang, Zining Zhu

Main category: cs.LG

TL;DR: MoE models don’t achieve true domain specialization; instead they form a compact “Standing Committee” of experts that handles most computation across all domains, while only peripheral experts handle domain-specific knowledge.

Details

Motivation: To challenge the common assumption that Mixture of Experts models achieve domain specialization through sparse routing, and to understand the actual routing behavior in these models.

Method: Introduces COMMITTEEAUDIT, a post hoc framework that analyzes routing behavior at the expert group level rather than individual experts. Applied across three representative models and the MMLU benchmark.

Result: Uncovered a domain-invariant “Standing Committee” - a compact coalition of routed experts that consistently captures majority of routing mass across domains, layers, and routing budgets, even in architectures with shared experts. Standing Committees anchor reasoning structure and syntax while peripheral experts handle domain-specific knowledge.

Conclusion: Specialization in MoE models is far less pervasive than commonly believed, revealing a strong structural bias toward centralized computation. Current training objectives like load-balancing losses that enforce uniform expert utilization may be working against the model’s natural optimization path, limiting training efficiency and performance.

Abstract: Mixture of Experts models are widely assumed to achieve domain specialization through sparse routing. In this work, we question this assumption by introducing COMMITTEEAUDIT, a post hoc framework that analyzes routing behavior at the level of expert groups rather than individual experts. Across three representative models and the MMLU benchmark, we uncover a domain-invariant Standing Committee. This is a compact coalition of routed experts that consistently captures the majority of routing mass across domains, layers, and routing budgets, even when architectures already include shared experts. Qualitative analysis further shows that Standing Committees anchor reasoning structure and syntax, while peripheral experts handle domain-specific knowledge. These findings reveal a strong structural bias toward centralized computation, suggesting that specialization in Mixture of Experts models is far less pervasive than commonly believed. This inherent bias also indicates that current training objectives, such as load-balancing losses that enforce uniform expert utilization, may be working against the model’s natural optimization path, thereby limiting training efficiency and performance.

[382] Enabling Agents to Communicate Entirely in Latent Space

Zhuoyun Du, Runze Wang, Huiyu Bai, Zouying Cao, Xiaoyong Zhu, Yu Cheng, Bo Zheng, Wei Chen, Haochao Ying

Main category: cs.LG

TL;DR: Interlat enables LLM agents to communicate via continuous hidden states instead of natural language, improving collaborative problem-solving through latent space communication and compression.

Details

Motivation: Natural language communication between LLM agents limits information depth and nuance due to downsampling rich internal states into discrete tokens, hindering collaborative problem-solving. Inspired by telepathy's bypass of symbolic language.

Method: Proposes Interlat (Inter-agent Latent Space Communication) using continuous last hidden states of LLMs as thought representations for direct communication. Adds learned compression process for latent space reasoning to further compress communication.

Result: Outperforms fine-tuned chain-of-thought prompting and single-agent baselines, works across heterogeneous models, promotes exploratory behavior, enables genuine latent information utilization. Compression accelerates inference up to 24× while maintaining competitive performance.

Conclusion: Demonstrates feasibility of entirely latent space inter-agent communication, highlighting its potential for future research with efficient information-preserving mechanisms.

Abstract: While natural language is the de facto communication medium for LLM-based agents, it presents a fundamental constraint. The process of downsampling rich, internal latent states into discrete tokens inherently limits the depth and nuance of information that can be transmitted, thereby hindering collaborative problem-solving. Inspired by telepathy, which bypasses symbolic language in communication, we propose Interlat (Inter-agent Latent Space Communication), a paradigm that leverages the continuous last hidden states of an LLM as a representation of its thought for direct communication (termed latent communication). An additional learned compression process further compresses latent communication via latent space reasoning. Experiments demonstrate that Interlat outperforms both fine-tuned chain-of-thought (CoT) prompting and single-agent baselines, even across heterogeneous models, promoting more exploratory behavior and enabling genuine utilization of latent information. Further compression not only substantially accelerates inference by up to 24 times but also maintains competitive performance through an efficient information-preserving mechanism. We position this work as a feasibility study of entirely latent space inter-agent communication, and our results highlight its potential, offering valuable insights for future research.

[383] VNU-Bench: A Benchmarking Dataset for Multi-Source Multimodal News Video Understanding

Zibo Liu, Muyang Li, Zhe Jiang, Shigang Chen

Main category: cs.LG

TL;DR: VNU-Bench is the first benchmark for multi-source, cross-video news understanding, testing models’ ability to compare perspectives, align multimodal evidence, and synthesize information across different news sources.

Details

Motivation: Existing benchmarks focus on single-source, intra-video reasoning, but real-world news consumption is inherently multi-sourced with different outlets providing complementary details, distinct narratives, and sometimes conflicting claims that unfold over time. Robust news understanding requires models to compare perspectives across sources.

Method: Introduces VNU-Bench with new question types for multi-source multimodal news understanding. Uses a novel hybrid human-model QA generation process to address scalability and quality control issues. The dataset includes 429 news groups, 1,405 videos, and 2,501 high-quality questions.

Result: Comprehensive evaluation of both closed- and open-source multimodal models shows that VNU-Bench poses substantial challenges for current MLLMs, indicating that existing models struggle with multi-source, cross-video news understanding.

Conclusion: VNU-Bench fills a critical gap in evaluating multimodal models for real-world news understanding by focusing on multi-source, cross-video reasoning, revealing significant limitations in current MLLMs’ ability to handle complex, multi-perspective news narratives.

Abstract: News videos are carefully edited multimodal narratives that combine narration, visuals, and external quotations into coherent storylines. In recent years, there have been significant advances in evaluating multimodal large language models (MLLMs) for news video understanding. However, existing benchmarks largely focus on single-source, intra-video reasoning, where each report is processed in isolation. In contrast, real-world news consumption is inherently multi-sourced: the same event is reported by different outlets with complementary details, distinct narrative choices, and sometimes conflicting claims that unfold over time. Robust news understanding, therefore, requires models to compare perspectives from different sources, align multimodal evidence across sources, and synthesize multi-source information. To fill this gap, we introduce VNU-Bench, the first benchmark for multi-source, cross-video understanding in the news domain. We design a set of new question types that are unique in testing models’ ability of understanding multi-source multimodal news from a variety of different angles. We design a novel hybrid human-model QA generation process that addresses the issues of scalability and quality control in building a large dataset for cross-source news understanding. The dataset comprises 429 news groups, 1,405 videos, and 2,501 high-quality questions. Comprehensive evaluation of both closed- and open-source multimodal models shows that VNU-Bench poses substantial challenges for current MLLMs.

[384] Soft Contextualized Encoder For User Defined Text Classification

Charu Maheshwari, Vyas Raina

Main category: cs.LG

TL;DR: Soft-contextualized encoder architecture for user-defined text classification achieves SOTA performance on unseen topic sets by contextualizing labels with input queries.

Details

Motivation: Real-world applications like enterprise analytics, content moderation, and domain-specific information retrieval frequently require classifying text to user-specified, previously unseen classes, which existing methods struggle with.

Method: Proposes a soft-contextualized encoder architecture that contextualizes each candidate label with both the label set and a static soft prompt representation of the input query, trained on diverse multi-source datasets.

Result: Achieves state-of-the-art performance across multiple unseen UDTC benchmarks, consistently outperforming or matching baselines on both held-out in-distribution test data and unseen domains.

Conclusion: The proposed architecture effectively generalizes to zero-shot classification over entirely unseen topic sets from arbitrary domains, demonstrating strong performance for real-world user-defined text classification tasks.

Abstract: User-Defined Text Classification (UDTC) considers the challenge of classifying input text to user-specified, previously unseen classes, a setting that arises frequently in real-world applications such as enterprise analytics, content moderation, and domain-specific information retrieval. We propose a soft-contextualized encoder architecture for UDTC which contextualizes each candidate label with the label set and a static soft prompt representation of the input query. Training on diverse, multi-source datasets enables the model to generalize effectively to zero-shot classification over entirely unseen topic sets drawn from arbitrary domains. We evaluate the proposed architecture both on held-out in-distribution test data and on multiple unseen UDTC benchmarks. Across datasets, the model achieves state-of-the-art performance, consistently outperforming or matching the baselines.

[385] An Expectation-Maximization Algorithm for Domain Adaptation in Gaussian Causal Models

Mohammad Ali Javidian

Main category: cs.LG

TL;DR: First-order EM algorithm for imputing systematically missing target variables in shifted domains using Gaussian causal DAGs, with theoretical guarantees and improved accuracy over baselines.

Details

Motivation: Address the problem of imputing target variables that are systematically missing in deployment domains when only source domain data is fully observed, leveraging causal DAG structure to transfer information across domains under distribution shifts.

Method: Unified EM-based framework combining source and target data through DAG structure. Introduces first-order (gradient) EM update replacing costly generalized least-squares M-step with single projected gradient step. Exploits causal DAG to freeze source-invariant mechanisms and re-estimate only shift-affected conditional distributions.

Result: Shows first-order EM operator is locally contractive around true target parameters under standard assumptions, yielding geometric convergence and finite-sample guarantees. Experiments on synthetic 7-node SEM, 64-node MAGIC-IRRI genetic network, and Sachs protein-signaling data demonstrate improved target imputation accuracy over baselines, with largest gains under pronounced domain shift.

Conclusion: Proposed DAG-aware first-order EM algorithm effectively handles covariate shift and local mechanism shifts in Gaussian SEMs, providing scalable solution for target imputation in shifted domains with theoretical guarantees and practical performance improvements.

Abstract: We study the problem of imputing a designated target variable that is systematically missing in a shifted deployment domain, when a Gaussian causal DAG is available from a fully observed source domain. We propose a unified EM-based framework that combines source and target data through the DAG structure to transfer information from observed variables to the missing target. On the methodological side, we formulate a population EM operator in the DAG parameter space and introduce a first-order (gradient) EM update that replaces the costly generalized least-squares M-step with a single projected gradient step. Under standard local strong-concavity and smoothness assumptions and a BWY-style \cite{Balakrishnan2017EM} gradient-stability (bounded missing-information) condition, we show that this first-order EM operator is locally contractive around the true target parameters, yielding geometric convergence and finite-sample guarantees on parameter error and the induced target-imputation error in Gaussian SEMs under covariate shift and local mechanism shifts. Algorithmically, we exploit the known causal DAG to freeze source-invariant mechanisms and re-estimate only those conditional distributions directly affected by the shift, making the procedure scalable to higher-dimensional models. In experiments on a synthetic seven-node SEM, the 64-node MAGIC-IRRI genetic network, and the Sachs protein-signaling data, the proposed DAG-aware first-order EM algorithm improves target imputation accuracy over a fit-on-source Bayesian network and a Kiiveri-style EM baseline, with the largest gains under pronounced domain shift.

[386] Hybrid Approach for Driver Behavior Analysis with Machine Learning, Feature Optimization, and Explainable AI

Mehedi Hasan Shuvo, Md. Raihan Tapader, Nur Mohammad Tamjid, Sajjadul Islam, Ahnaf Atef Choudhury, Jia Uddin

Main category: cs.LG

TL;DR: A hybrid driver behavior analysis model combining Random Forest with LIME explainable AI achieves 94.2% accuracy while maintaining interpretability.

Details

Motivation: Previous ML/DL approaches for driver behavior analytics suffer from low feature optimization, compromising both performance and interpretability. There's a need for models that balance high accuracy with explainability for road safety applications.

Method: Used Kaggle dataset (12,857×18), applied preprocessing (label encoding, random oversampling, standard scaling), tested 13 ML algorithms, selected Random Forest (95% accuracy), then applied LIME XAI to identify top 10 influential features, and retrained models with feature optimization.

Result: Random Forest achieved 95% accuracy initially, then 94.2% after LIME-based feature optimization - demonstrating efficiency improvement without significant performance sacrifice. The hybrid approach provides both predictive power and explainability.

Conclusion: The proposed hybrid model successfully balances performance and interpretability for driver behavior analysis, offering ROI through improved predictive power and explainability while maintaining high accuracy for road safety applications.

Abstract: Progressive driver behavior analytics is crucial for improving road safety and mitigating the issues caused by aggressive or inattentive driving. Previous studies have employed machine learning and deep learning techniques, which often result in low feature optimization, thereby compromising both high performance and interpretability. To fill these voids, this paper proposes a hybrid approach to driver behavior analysis that uses a 12,857-row and 18-column data set taken from Kaggle. After applying preprocessing techniques such as label encoding, random oversampling, and standard scaling, 13 machine learning algorithms were tested. The Random Forest Classifier achieved an accuracy of 95%. After deploying the LIME technique in XAI, the top 10 features with the most significant positive and negative influence on accuracy were identified, and the same algorithms were retrained. The accuracy of the Random Forest Classifier decreased slightly to 94.2%, confirming that the efficiency of the model can be improved without sacrificing performance. This hybrid model can provide a return on investment in terms of the predictive power and explainability of the driver behavior process.

[387] From Bits to Chips: An LLM-based Hardware-Aware Quantization Agent for Streamlined Deployment of LLMs

Kaiyuan Deng, Hangyu Zheng, Minghai Qing, Kunxiong Zhu, Gen Li, Yang Xiao, Lan Emily Zhang, Linke Guo, Bo Hui, Yanzhi Wang, Geng Yuan, Gagan Agrawal, Wei Niu, Xiaolong Ma

Main category: cs.LG

TL;DR: HAQA is an automated LLM-powered framework that simplifies model quantization and deployment by automatically finding optimal hardware-aware settings, achieving up to 2.3x speedup while improving accuracy.

Details

Motivation: Deploying large language models is challenging for non-experts due to hardware constraints and the complexity of quantization tuning. Current quantization approaches require specialized expertise and manual effort, making deployment unfriendly to most users.

Method: HAQA uses LLMs to automate the entire quantization and deployment pipeline, including hyperparameter tuning and hardware configuration. It implements adaptive quantization strategies across diverse hardware platforms and can find optimal settings even when they appear counterintuitive.

Result: Achieves up to 2.3x inference speedup, increased throughput, and improved accuracy compared to unoptimized models on Llama. Demonstrates superior adaptability by automatically finding optimal settings across diverse hardware platforms.

Conclusion: HAQA successfully streamlines quantization and deployment, making it accessible to non-experts while simultaneously improving deployment quality and ease of use through LLM-powered automation.

Abstract: Deploying models, especially large language models (LLMs), is becoming increasingly attractive to a broader user base, including those without specialized expertise. However, due to the resource constraints of certain hardware, maintaining high accuracy with larger model while meeting the hardware requirements remains a significant challenge. Model quantization technique helps mitigate memory and compute bottlenecks, yet the added complexities of tuning and deploying quantized models further exacerbates these challenges, making the process unfriendly to most of the users. We introduce the Hardware-Aware Quantization Agent (HAQA), an automated framework that leverages LLMs to streamline the entire quantization and deployment process by enabling efficient hyperparameter tuning and hardware configuration, thereby simultaneously improving deployment quality and ease of use for a broad range of users. Our results demonstrate up to a 2.3x speedup in inference, along with increased throughput and improved accuracy compared to unoptimized models on Llama. Additionally, HAQA is designed to implement adaptive quantization strategies across diverse hardware platforms, as it automatically finds optimal settings even when they appear counterintuitive, thereby reducing extensive manual effort and demonstrating superior adaptability. Code will be released.

[388] VeRPO: Verifiable Dense Reward Policy Optimization for Code Generation

Longwen Wang, Xuan’er Wu, Xiaohui Hu, Yirui Liu, Yuankai Fan, Kaidong Yu, Qizhen Weng, Wei Xi, Xuelong Li

Main category: cs.LG

TL;DR: VeRPO introduces a novel RL framework for code generation that creates dense rewards from weighted partial success of unit tests, eliminating the need for external reward models while maintaining verifiable execution feedback.

Details

Motivation: Current RL approaches for code generation face challenges with sparse pass/fail rewards and problematic external reward models that suffer from misalignment and high computational costs.

Method: VeRPO constructs dense rewards by dynamically weighting unit tests based on execution statistics during training, combining partial test success with global execution outcomes to create robust, verifiable rewards.

Result: VeRPO consistently outperforms outcome-driven and RM-based baselines, achieving up to +8.83% gain in pass@1 with negligible time cost (<0.02%) and zero GPU memory overhead.

Conclusion: VeRPO provides an effective solution for RL-based code generation by synthesizing dense, verifiable rewards from execution feedback without the drawbacks of external reward models.

Abstract: Effective reward design is a central challenge in Reinforcement Learning (RL) for code generation. Mainstream pass/fail outcome rewards enforce functional correctness via executing unit tests, but the resulting sparsity limits potential performance gains. While recent work has explored external Reward Models (RM) to generate richer, continuous rewards, the learned RMs suffer from reward misalignment and prohibitive computational cost. In this paper, we introduce \textbf{VeRPO} (\textbf{V}erifiable D\textbf{e}nse \textbf{R}eward \textbf{P}olicy \textbf{O}ptimization), a novel RL framework for code generation that synthesizes \textit{robust and dense rewards fully grounded in verifiable execution feedback}. The core idea of VeRPO is constructing dense rewards from weighted partial success: by dynamically estimating the difficulty weight of each unit test based on the execution statistics during training, a dense reward is derived from the sum of weights of the passed unit tests. To solidify the consistency between partial success and end-to-end functional correctness, VeRPO further integrates the dense signal with global execution outcomes, establishing a robust and dense reward paradigm relying solely on verifiable execution feedback. Extensive experiments across diverse benchmarks and settings demonstrate that VeRPO consistently outperforms outcome-driven and RM-based baselines, achieving up to +8.83% gain in pass@1 with negligible time cost (< 0.02%) and zero GPU memory overhead.

[389] Green’s-Function Spherical Neural Operators for Biological Heterogeneity

Hao Tang, Hao Chen, Hao Li, Chao Li

Main category: cs.LG

TL;DR: GSNO is a spherical neural operator that uses designable Green’s functions to balance geometric inductive biases with real-world heterogeneity modeling, achieving superior performance on various spherical tasks.

Details

Motivation: Existing spherical deep learning approaches struggle to balance strong spherical geometric inductive biases with the need to model real-world heterogeneity while retaining spherical geometry.

Method: Introduces a designable Green’s function framework (DGF) for spherical operator solutions, then proposes GSNO which fuses three operator solutions: Equivariant Solution for symmetry-consistent modeling, Invariant Solution to eliminate nuisance heterogeneity, and Anisotropic Solution to model anisotropic systems like fibers.

Result: GSNO demonstrates superiority on spherical MNIST, Shallow Water Equation, diffusion MRI fiber prediction, cortical parcellation, and molecule structure modeling tasks.

Conclusion: GSNO can adapt to real-world heterogeneous systems with nuisance variability and anisotropy while retaining spectral efficiency and spherical geometry, providing an effective solution for balancing geometric biases with heterogeneity modeling.

Abstract: Spherical deep learning has been widely applied to a broad range of real-world problems. Existing approaches often face challenges in balancing strong spherical geometric inductive biases with the need to model real-world heterogeneity. To solve this while retaining spherical geometry, we first introduce a designable Green’s function framework (DGF) to provide new spherical operator solution strategy: Design systematic Green’s functions under rotational group. Based on DGF, to model biological heterogeneity, we propose Green’s-Function Spherical Neural Operator (GSNO) fusing 3 operator solutions: (1) Equivariant Solution derived from Equivariant Green’s Function for symmetry-consistent modeling; (2) Invariant Solution derived from Invariant Green’s Function to eliminate nuisance heterogeneity, e.g., consistent background field; (3) Anisotropic Solution derived from Anisotropic Green’s Function to model anisotropic systems, especially fibers with preferred direction. Therefore, the resulting model, GSNO can adapt to real-world heterogeneous systems with nuisance variability and anisotropy while retaining spectral efficiency. Evaluations on spherical MNIST, Shallow Water Equation, diffusion MRI fiber prediction, cortical parcellation and molecule structure modeling demonstrate the superiority of GSNO.

[390] A Proposed Paradigm for Imputing Missing Multi-Sensor Data in the Healthcare Domain

Vaibhav Gupta, Florian Grensing, Beyza Cinar, Maria Maleshkova

Main category: cs.LG

TL;DR: This paper reviews imputation techniques for handling missing data in continuous glucose monitoring, proposing feature-specific strategies based on temporal patterns and gap durations.

Details

Motivation: Chronic diseases like diabetes require continuous monitoring to predict hypoglycemia events, but wearable sensor data suffers from noise and frequent missing values that hinder effective analysis.

Method: The study conducts comprehensive analysis of existing imputation techniques, evaluates ML/DL methods from other healthcare contexts, and proposes a systematic paradigm with feature-specific imputation strategies tailored to missing data durations.

Result: The review identifies limitations in current datasets, emphasizes temporal characteristics of hypoglycemia features, and demonstrates that different features require different imputation approaches based on their temporal patterns.

Conclusion: Effective hypoglycemia prediction requires investigating temporal dynamics of individual features and implementing multiple, feature-specific imputation techniques to handle heterogeneous temporal patterns in wearable sensor data.

Abstract: Chronic diseases such as diabetes pose significant management challenges, particularly due to the risk of complications like hypoglycemia, which require timely detection and intervention. Continuous health monitoring through wearable sensors offers a promising solution for early prediction of glycemic events. However, effective use of multisensor data is hindered by issues such as signal noise and frequent missing values. This study examines the limitations of existing datasets and emphasizes the temporal characteristics of key features relevant to hypoglycemia prediction. A comprehensive analysis of imputation techniques is conducted, focusing on those employed in state-of-the-art studies. Furthermore, imputation methods derived from machine learning and deep learning applications in other healthcare contexts are evaluated for their potential to address longer gaps in time-series data. Based on this analysis, a systematic paradigm is proposed, wherein imputation strategies are tailored to the nature of specific features and the duration of missing intervals. The review concludes by emphasizing the importance of investigating the temporal dynamics of individual features and the implementation of multiple, feature-specific imputation techniques to effectively address heterogeneous temporal patterns inherent in the data.

[391] Local Intrinsic Dimensionality of Ground Motion Data for Early Detection of Complex Catastrophic Slope Failure

Yuansan Liu, Antoinette Tordesillas, James Bailey

Main category: cs.LG

TL;DR: The paper introduces stLID, a spatiotemporal Local Intrinsic Dimensionality method that enhances landslide failure detection by incorporating both spatial and temporal information, outperforming existing approaches in precision and lead-time.

Details

Motivation: Early and accurate identification of landslide failure zones is crucial for geohazard mitigation. Existing methods using surface displacement data often fail to capture both spatial correlations and temporal dynamics inherent in landslide monitoring data.

Method: Extends existing sLID (spatial LID) with three key enhancements: (1) Kinematic enhancement by incorporating velocity into sLID computation, (2) Spatial fusion using Bayesian estimation to aggregate sLID values across neighborhoods, (3) Temporal modeling via tLID that learns long-term dynamics from time series. These are integrated into a unified stLID framework.

Result: Extensive experiments show stLID consistently outperforms existing methods in failure detection precision and lead-time, enabling detection of complex landslides including multiple successive failures in distinct areas of the same slope.

Conclusion: The proposed stLID framework effectively addresses limitations of existing approaches by jointly modeling spatial and temporal dependencies, providing a robust method for early landslide failure detection with improved accuracy and timeliness.

Abstract: Local Intrinsic Dimensionality (LID) has shown strong potential for identifying anomalies and outliers in high-dimensional data across a wide range of real-world applications, including landslide failure detection in granular media. Early and accurate identification of failure zones in landslide-prone areas is crucial for effective geohazard mitigation. While existing approaches typically rely on surface displacement data analyzed through statistical or machine learning techniques, they often fall short in capturing both the spatial correlations and temporal dynamics that are inherent in such data. To address this gap, we focus on ground-monitored landslides and introduce a novel approach that jointly incorporates spatial and temporal information, enabling the detection of complex landslides and including multiple successive failures occurring in distinct areas of the same slope. To be specific, our method builds upon an existing LID-based technique, known as sLID. We extend its capabilities in three key ways. (1) Kinematic enhancement: we incorporate velocity into the sLID computation to better capture short-term temporal dependencies and deformation rate relationships. (2) Spatial fusion: we apply Bayesian estimation to aggregate sLID values across spatial neighborhoods, effectively embedding spatial correlations into the LID scores. (3) Temporal modeling: we introduce a temporal variant, tLID, that learns long-term dynamics from time series data, providing a robust temporal representation of displacement behavior. Finally, we integrate both components into a unified framework, referred to as spatiotemporal LID (stLID), to identify samples that are anomalous in either or both dimensions. Extensive experiments show that stLID consistently outperforms existing methods in failure detection precision and lead-time.

[392] Variational Inference, Entropy, and Orthogonality: A Unified Theory of Mixture-of-Experts

Ye Su, Yong Liu

Main category: cs.LG

TL;DR: This paper provides the first unified theoretical framework for Mixture-of-Experts (MoE) models, rigorously deriving Top-k routing and load balancing as optimal Bayesian approximations and information-theoretic mechanisms, while identifying routing as an NP-hard problem and proposing orthogonality regularization as the optimal solution.

Details

Motivation: Current MoE models use heuristic mechanisms (Top-k routing and auxiliary load balancing) without theoretical foundation. The paper aims to build a cohesive theoretical framework to understand and improve these practices.

Method: Develops a unified theoretical framework from Bayesian and information-theoretic perspectives, analyzes routing as NP-hard sparse subset selection problem, identifies the “Coherence Barrier” phenomenon, and proposes geometric orthogonality regularization in expert feature space.

Result: Proves that when expert representations have high mutual coherence, greedy routing strategies fail to recover optimal expert subsets. Shows that imposing geometric orthogonality narrows the gap between NP-hard global optimum and polynomial-time greedy approximation. Confirms orthogonality regularization as optimal engineering relaxation for large-scale models.

Conclusion: The work provides essential theoretical support and technical assurance for understanding and designing MoE models, offering a rigorous foundation for existing practices and novel designs through orthogonality regularization.

Abstract: Mixture-of-Experts models enable large language models to scale efficiently, as they only activate a subset of experts for each input. Their core mechanisms, Top-k routing and auxiliary load balancing, remain heuristic, however, lacking a cohesive theoretical underpinning to support them. To this end, we build the first unified theoretical framework that rigorously derives these practices as optimal sparse posterior approximation and prior regularization from a Bayesian perspective, while simultaneously framing them as mechanisms to minimize routing ambiguity and maximize channel capacity from an information-theoretic perspective. We also pinpoint the inherent combinatorial hardness of routing, defining it as the NP-hard sparse subset selection problem. We rigorously prove the existence of a “Coherence Barrier”; when expert representations exhibit high mutual coherence, greedy routing strategies theoretically fail to recover the optimal expert subset. Importantly, we formally verify that imposing geometric orthogonality in the expert feature space is sufficient to narrow the divide between the NP-hard global optimum and polynomial-time greedy approximation. Our comparative analyses confirm orthogonality regularization as the optimal engineering relaxation for large-scale models. Our work offers essential theoretical support and technical assurance for a deeper understanding and novel designs of MoE.

[393] Local Gradient Regulation Stabilizes Federated Learning under Client Heterogeneity

Ping Luo, Jiahuan Wang, Ziqing Wen, Tao Sun, Dongsheng Li

Main category: cs.LG

TL;DR: ECGR stabilizes federated learning under data heterogeneity by regulating local gradient dynamics through a swarm intelligence-inspired approach that balances well-aligned and misaligned gradient components.

Details

Motivation: Federated learning faces stability challenges due to statistical heterogeneity in real-world deployments, where client data distributions differ significantly. This heterogeneity causes systematic drift in local gradient dynamics that accumulates across communication rounds and impedes global convergence.

Method: The paper develops a client-side perspective that regulates local gradient contributions without extra communication overhead. It introduces Exploratory–Convergent Gradient Re-aggregation (ECGR), inspired by swarm intelligence, which balances well-aligned and misaligned gradient components to preserve informative updates while suppressing destabilizing effects.

Result: Theoretical analysis and extensive experiments, including evaluations on the LC25000 medical imaging dataset, demonstrate that regulating local gradient dynamics consistently stabilizes federated learning across state-of-the-art methods under heterogeneous data distributions.

Conclusion: Local gradient dynamics serve as a key regulatory lever for stabilizing heterogeneous FL systems, and the proposed ECGR approach effectively addresses stability challenges by balancing gradient components without additional communication overhead.

Abstract: Federated learning (FL) enables collaborative model training across distributed clients without sharing raw data, yet its stability is fundamentally challenged by statistical heterogeneity in realistic deployments. Here, we show that client heterogeneity destabilizes FL primarily by distorting local gradient dynamics during client-side optimization, causing systematic drift that accumulates across communication rounds and impedes global convergence. This observation highlights local gradients as a key regulatory lever for stabilizing heterogeneous FL systems. Building on this insight, we develop a general client-side perspective that regulates local gradient contributions without incurring additional communication overhead. Inspired by swarm intelligence, we instantiate this perspective through Exploratory–Convergent Gradient Re-aggregation (ECGR), which balances well-aligned and misaligned gradient components to preserve informative updates while suppressing destabilizing effects. Theoretical analysis and extensive experiments, including evaluations on the LC25000 medical imaging dataset, demonstrate that regulating local gradient dynamics consistently stabilizes federated learning across state-of-the-art methods under heterogeneous data distributions.

[394] ALERT: Zero-shot LLM Jailbreak Detection via Internal Discrepancy Amplification

Xiao Lin, Philip Li, Zhichen Zeng, Tingwei Li, Tianxin Wei, Xuying Ning, Gaotang Li, Yuzhong Chen, Hanghang Tong

Main category: cs.LG

TL;DR: ALERT is a zero-shot jailbreak detection framework that amplifies internal feature discrepancies between benign and malicious prompts to identify safety threats without relying on pre-existing jailbreak templates.

Details

Motivation: Current jailbreak detection methods depend on known attack templates from training data, but real-world scenarios involve constantly evolving zero-day attacks where no templates are available during training.

Method: Proposes a layer-wise, module-wise, and token-wise amplification framework that progressively magnifies internal feature discrepancies. Identifies safety-relevant layers, specific modules encoding discriminative signals, and informative safety tokens. ALERT uses two independent complementary classifiers on amplified representations.

Result: ALERT achieves consistently strong zero-shot detection performance across three safety benchmarks, reliably ranking among top two methods across all datasets and attack strategies, outperforming second-best baseline by 10-40% in Accuracy and F1-score.

Conclusion: The amplification-based approach effectively addresses the challenging zero-shot jailbreak detection problem, providing robust protection against evolving attacks without requiring prior knowledge of jailbreak templates.

Abstract: Despite rich safety alignment strategies, large language models (LLMs) remain highly susceptible to jailbreak attacks, which compromise safety guardrails and pose serious security risks. Existing detection methods mainly detect jailbreak status relying on jailbreak templates present in the training data. However, few studies address the more realistic and challenging zero-shot jailbreak detection setting, where no jailbreak templates are available during training. This setting better reflects real-world scenarios where new attacks continually emerge and evolve. To address this challenge, we propose a layer-wise, module-wise, and token-wise amplification framework that progressively magnifies internal feature discrepancies between benign and jailbreak prompts. We uncover safety-relevant layers, identify specific modules that inherently encode zero-shot discriminative signals, and localize informative safety tokens. Building upon these insights, we introduce ALERT (Amplification-based Jailbreak Detector), an efficient and effective zero-shot jailbreak detector that introduces two independent yet complementary classifiers on amplified representations. Extensive experiments on three safety benchmarks demonstrate that ALERT achieves consistently strong zero-shot detection performance. Specifically, (i) across all datasets and attack strategies, ALERT reliably ranks among the top two methods, and (ii) it outperforms the second-best baseline by at least 10% in average Accuracy and F1-score, and sometimes by up to 40%.

[395] A Comparative Study of Traditional Machine Learning, Deep Learning, and Large Language Models for Mental Health Forecasting using Smartphone Sensing Data

Kaidong Feng, Zhu Sun, Roy Ka-Wei Lee, Xun Jiang, Yin-Leng Theng, Yi Ding

Main category: cs.LG

TL;DR: This paper presents the first comprehensive benchmarking study comparing traditional ML, deep learning, and LLM approaches for forecasting mental health using smartphone sensing data from college students.

Details

Motivation: While smartphone sensing can track behaviors linked to mental health, most prior work focuses on detecting existing conditions rather than forecasting future mental states. Forecasting enables proactive interventions through Just-in-Time Adaptive Interventions.

Method: Systematic evaluation of traditional ML, deep learning (including Transformer models), and LLM approaches using the College Experience Sensing (CES) dataset. The study examines temporal windows, feature granularities, personalization strategies, and class imbalance handling.

Result: Deep learning models, particularly Transformers (Macro-F1 = 0.58), achieve the best overall performance. LLMs show strength in contextual reasoning but weaker temporal modeling. Personalization substantially improves forecasts of severe mental health states.

Conclusion: This work establishes foundational benchmarks for mental health forecasting and reveals how different modeling approaches interpret behavioral data over time, laying groundwork for next-generation adaptive mental health technologies.

Abstract: Smartphone sensing offers an unobtrusive and scalable way to track daily behaviors linked to mental health, capturing changes in sleep, mobility, and phone use that often precede symptoms of stress, anxiety, or depression. While most prior studies focus on detection that responds to existing conditions, forecasting mental health enables proactive support through Just-in-Time Adaptive Interventions. In this paper, we present the first comprehensive benchmarking study comparing traditional machine learning (ML), deep learning (DL), and large language model (LLM) approaches for mental health forecasting using the College Experience Sensing (CES) dataset, the most extensive longitudinal dataset of college student mental health to date. We systematically evaluate models across temporal windows, feature granularities, personalization strategies, and class imbalance handling. Our results show that DL models, particularly Transformer (Macro-F1 = 0.58), achieve the best overall performance, while LLMs show strength in contextual reasoning but weaker temporal modeling. Personalization substantially improves forecasts of severe mental health states. By revealing how different modeling approaches interpret phone sensing behavioral data over time, this work lays the groundwork for next-generation, adaptive, and human-centered mental health technologies that can advance both research and real-world well-being.

[396] Policy-Guided Search on Tree-of-Thoughts for Efficient Problem Solving with Bounded Language Model Queries

Sumedh Pendurkar, Guni Sharon

Main category: cs.LG

TL;DR: Levin Tree Search (LTS) adapted to Tree-of-Thoughts framework reduces LM inference costs while maintaining or improving problem-solving accuracy under computational constraints.

Details

Motivation: Existing Tree-of-Thoughts search algorithms ignore the high computational costs of LM inference, making them impractical for resource-constrained applications. There's a need to improve LM problem-solving performance under limited computational budgets.

Method: Adapt Levin Tree Search (LTS) to the ToT framework, using LM probabilities as heuristics to guide tree exploration efficiently. The method leverages LMs as policies and provides theoretical bounds on state expansions.

Result: LTS achieves comparable or higher accuracy than baseline search algorithms under fixed LM query budgets across three domains (Blocksworld, PrOntoQA, Array Sorting) and four distinct LMs, while reducing thought evaluations.

Conclusion: LTS enables cost-effective and time-efficient problem-solving in ToT frameworks, making it suitable for latency-critical and resource-constrained applications by efficiently leveraging LM probabilities as search heuristics.

Abstract: Recent studies explored integrating state-space search algorithms with Language Models (LM) to perform look-ahead on the token generation process, the ‘‘Tree-of-Thoughts’’ (ToT), generated by LMs, thereby improving performance on problem-solving tasks. However, the affiliated search algorithms often overlook the significant computational costs associated with LM inference, particularly in scenarios with constrained computational budgets. Consequently, we address the problem of improving LM performance on problem-solving tasks under limited computational budgets. We demonstrate how the probabilities assigned to thoughts by LMs can serve as a heuristic to guide search within the ToT framework, thereby reducing the number of thought evaluations. Building on this insight, we adapt a heuristic search algorithm, Levin Tree Search (LTS), to the ToT framework, which leverages LMs as policies to guide the tree exploration efficiently. We extend the theoretical results of LTS by showing that, for ToT (a pruned tree), LTS guarantees a bound on the number of states expanded, and consequently, on the number of thoughts generated. Additionally, we analyze the sensitivity of this bound to the temperature values commonly used in the final softmax layer of the LM. Empirical evaluation under a fixed LM query budget demonstrates that LTS consistently achieves comparable or higher accuracy than baseline search algorithms within the ToT framework, across three domains (Blocksworld, PrOntoQA, Array Sorting) and four distinct LMs. These findings highlight the efficacy of LTS on ToT, particularly in enabling cost-effective and time-efficient problem-solving, making it well-suited for latency-critical and resource-constrained applications.

[397] Learning Shortest Paths When Data is Scarce

Dmytro Matsypura, Yu Pan, Hanzhao Wang

Main category: cs.LG

TL;DR: The paper proposes a method to calibrate biased simulator outputs for routing decisions using limited real-world measurements and edge similarity structure, with theoretical guarantees and active learning for cold-start settings.

Details

Motivation: Digital twins and simulators are increasingly used for routing decisions but often exhibit systematic bias, while ground-truth measurements are costly and scarce. There's a need to bridge the simulator-to-reality gap efficiently.

Method: Model simulator-to-reality discrepancy as unknown edge-specific bias varying smoothly over similarity graph. Use Laplacian-regularized least squares to estimate bias. Develop finite-sample error bounds, path-level suboptimality guarantees, and data-driven certificates. For cold-start settings, propose bias-aware active learning algorithm that adaptively selects edges to measure.

Result: The approach yields calibrated edge cost estimates even in data-scarce regimes. Numerical experiments on multiple road networks and traffic graphs demonstrate effectiveness. Theoretical guarantees include finite-sample error bounds and path-level suboptimality guarantees.

Conclusion: The proposed framework effectively leverages abundant synthetic data and limited real measurements to calibrate simulators for routing decisions, with theoretical guarantees and practical algorithms for both data-scarce and cold-start scenarios.

Abstract: Digital twins and other simulators are increasingly used to support routing decisions in large-scale networks. However, simulator outputs often exhibit systematic bias, while ground-truth measurements are costly and scarce. We study a stochastic shortest-path problem in which a planner has access to abundant synthetic samples, limited real-world observations, and an edge-similarity structure capturing expected behavioral similarity across links. We model the simulator-to-reality discrepancy as an unknown, edge-specific bias that varies smoothly over the similarity graph, and estimate it using Laplacian-regularized least squares. This approach yields calibrated edge cost estimates even in data-scarce regimes. We establish finite-sample error bounds, translate estimation error into path-level suboptimality guarantees, and propose a computable, data-driven certificate that verifies near-optimality of a candidate route. For cold-start settings without initial real data, we develop a bias-aware active learning algorithm that leverages the simulator and adaptively selects edges to measure until a prescribed accuracy is met. Numerical experiments on multiple road networks and traffic graphs further demonstrate the effectiveness of our methods.

[398] Kantorovich-Type Stochastic Neural Network Operators for the Mean-Square Approximation of Certain Second-Order Stochastic Processes

Sachin Saini, Uaday Singh

Main category: cs.LG

TL;DR: A new Kantorovich-type Stochastic Neural Network Operator (K-SNNO) is proposed that incorporates randomness through stochastic neurons driven by stochastic integrators, enabling approximation of stochastic processes with proven mean-square convergence and error estimates.

Details

Motivation: While artificial neural network operators are widely used for approximating deterministic functions, their extension to random dynamics remains relatively unexplored. There is a need for neural network operators that can effectively model and approximate stochastic signals while inheriting the probabilistic structure of underlying processes.

Method: The authors construct K-SNNOs where randomness is incorporated through stochastic neurons driven by stochastic integrators (not at coefficient level). This framework allows the operator to inherit the probabilistic structure of the underlying stochastic process. They establish mean-square convergence and derive quantitative error estimates using modulus of continuity.

Result: Theoretical results show mean-square convergence of K-SNNOs to target stochastic processes with quantitative error estimates. Numerical simulations validate accurate reconstruction of sample paths and rapid decay of mean square error. Graphical results demonstrate robustness and effectiveness of the stochastic-neuron-based operator.

Conclusion: The proposed K-SNNO framework successfully extends neural network operators to stochastic dynamics, providing a theoretically sound and practically effective approach for approximating stochastic signals with proven convergence properties and validated numerical performance.

Abstract: Artificial neural network operators (ANNOs) have been widely used for approximating deterministic input-output functions; however, their extension to random dynamics remains comparatively unexplored. In this paper, we construct a new class of \textbf{Kantorovich-type Stochastic Neural Network Operators (K-SNNOs)} in which randomness is incorporated not at the coefficient level, but through \textbf{stochastic neurons} driven by stochastic integrators. This framework enables the operator to inherit the probabilistic structure of the underlying process, making it suitable for modeling and approximating stochastic signals. We establish mean-square convergence of K-SNNOs to the target stochastic process and derive quantitative error estimates expressing the rate of approximation in terms of the modulus of continuity. Numerical simulations further validate the theoretical results by demonstrating accurate reconstruction of sample paths and rapid decay of the mean square error (MSE). Graphical results, including sample-wise approximations and empirical MSE behaviour, illustrate the robustness and effectiveness of the proposed stochastic-neuron-based operator.

[399] ReLA: Representation Learning and Aggregation for Job Scheduling with Reinforcement Learning

Zhengyi Kwan, Zhang Wei, Aik Beng Ng, Zhengkui Wang, Simon See

Main category: cs.LG

TL;DR: ReLA is a reinforcement learning scheduler using structured representation learning and aggregation for job scheduling, achieving state-of-the-art performance across different problem scales.

Details

Motivation: Existing job scheduling solutions suffer from long running times or poor schedule quality, especially as problem scale increases, creating a need for more efficient and effective scheduling approaches.

Method: ReLA uses structured representation learning with two intra-entity modules (self-attention and convolution) and one inter-entity module (cross-attention) in a multi-scale architecture, aggregating outputs to support RL decision-making.

Result: ReLA achieves best makespan in most settings, reducing optimality gap by 13.0% on non-large instances and 78.6% on large-scale instances, with average gaps lowered to 7.3% and 2.1% respectively.

Conclusion: ReLA’s learned representations and aggregation provide strong decision support for RL scheduling, enabling fast job completion and decision-making suitable for real-world manufacturing applications.

Abstract: Job scheduling is widely used in real-world manufacturing systems to assign ordered job operations to machines under various constraints. Existing solutions remain limited by long running time or insufficient schedule quality, especially when problem scale increases. In this paper, we propose ReLA, a reinforcement-learning (RL) scheduler built on structured representation learning and aggregation. ReLA first learns diverse representations from scheduling entities, including job operations and machines, using two intra-entity learning modules with self-attention and convolution and one inter-entity learning module with cross-attention. These modules are applied in a multi-scale architecture, and their outputs are aggregated to support RL decision-making. Across experiments on small, medium, and large job instances, ReLA achieves the best makespan in most tested settings over the latest solutions. On non-large instances, ReLA reduces the optimality gap of the SOTA baseline by 13.0%, while on large-scale instances it reduces the gap by 78.6%, with the average optimality gaps lowered to 7.3% and 2.1%, respectively. These results confirm that ReLA’s learned representations and aggregation provide strong decision support for RL scheduling, and enable fast job completion and decision-making for real-world applications.

[400] Quantum Classical Ridgelet Neural Network For Time Series Model

Bahadur Yadav, Sanjay Kumar Mohanty

Main category: cs.LG

TL;DR: Quantum computing method combines Ridgelet transforms with single-qubit quantum processing for improved time series forecasting, showing superior performance on financial data.

Details

Motivation: To enhance feature extraction and forecasting capabilities in time series analysis by integrating quantum computing with Ridgelet neural networks.

Method: Integrates Ridgelet neural network with single-qubit quantum computing method in quantum processing pipelines for time series data.

Result: Experimental results using financial time series data demonstrate superior performance compared to existing models.

Conclusion: The quantum computing method with Ridgelet transforms effectively improves time series forecasting capabilities, particularly for financial data.

Abstract: In this study, we present a quantum computing method that incorporates ridglet transforms into the quantum processing pipelines for time series data. Here, the Ridgelet neural network is integrated with a single-qubit quantum computing method, which improves feature extraction and forecasting capabilities. Furthermore, experimental results using financial time series data demonstrate the superior performance of our model compared to existing models.

[401] In Search of Grandmother Cells: Tracing Interpretable Neurons in Tabular Representations

Ricardo Knauer, Erik Rodner

Main category: cs.LG

TL;DR: The paper proposes information-theoretic measures to identify grandmother cell-like neurons in foundation models, finding evidence of moderately selective neurons for high-level concepts in TabPFN.

Details

Motivation: Foundation models are powerful but opaque, and there's ongoing interest in whether some neurons behave like "grandmother cells" - inherently interpretable neurons that respond exclusively to single concepts. The authors aim to quantify and identify such interpretable neurons in foundation models.

Method: Proposed two information-theoretic measures to quantify neuronal saliency and selectivity for single concepts. Applied these metrics to TabPFN (a tabular foundation model) and performed a simple search across neuron-concept pairs to find the most salient and selective pairs.

Result: Found first evidence that some neurons in such models show moderate, statistically significant saliency and selectivity for high-level concepts. This suggests interpretable neurons can emerge naturally in foundation models.

Conclusion: Interpretable neurons can emerge naturally in foundation models and can sometimes be identified without complex interpretability techniques, providing a simpler approach to understanding model decision-making.

Abstract: Foundation models are powerful yet often opaque in their decision-making. A topic of continued interest in both neuroscience and artificial intelligence is whether some neurons behave like grandmother cells, i.e., neurons that are inherently interpretable because they exclusively respond to single concepts. In this work, we propose two information-theoretic measures that quantify the neuronal saliency and selectivity for single concepts. We apply these metrics to the representations of TabPFN, a tabular foundation model, and perform a simple search across neuron-concept pairs to find the most salient and selective pair. Our analysis provides the first evidence that some neurons in such models show moderate, statistically significant saliency and selectivity for high-level concepts. These findings suggest that interpretable neurons can emerge naturally and that they can, in some cases, be identified without resorting to more complex interpretability techniques.

[402] Group and Exclusive Sparse Regularization-based Continual Learning of CNNs

Basile Tousside, Janis Mohr, Jörg Frochte

Main category: cs.LG

TL;DR: GESCL is a regularization-based continual learning method that prevents catastrophic forgetting in fixed-capacity CNNs using stability and plasticity regularization terms without network expansion or data memorization.

Details

Motivation: To address catastrophic forgetting in continual learning for fixed-capacity CNNs without resorting to network expansion or storing past task data, which are computationally expensive approaches.

Method: Uses two regularization terms: stability regularization prevents important filters from deviating too much when learning new tasks, while plasticity regularization leverages CNN over-parameterization to sparsify the network and tune unimportant filters for future tasks.

Result: Significant improvements over state-of-the-art methods on popular CL vision benchmarks in terms of overall classification accuracy and avoiding catastrophic forgetting, with reduced parameters and computation.

Conclusion: GESCL effectively balances stability and plasticity in fixed-capacity CNNs for continual learning, achieving strong performance without network expansion or data memorization overhead.

Abstract: We present a regularization-based approach for continual learning (CL) of fixed capacity convolutional neural networks (CNN) that does not suffer from the problem of catastrophic forgetting when learning multiple tasks sequentially. This method referred to as Group and Exclusive Sparsity based Continual Learning (GESCL) avoids forgetting of previous tasks by ensuring the stability of the CNN via a stability regularization term, which prevents filters detected as important for past tasks to deviate too much when learning a new task. On top of that, GESCL makes the network plastic via a plasticity regularization term that leverage the over-parameterization of CNNs to efficiently sparsify the network and tunes unimportant filters making them relevant for future tasks. Doing so, GESCL deals with significantly less parameters and computation compared to CL approaches that either dynamically expand the network or memorize past tasks’ data. Experiments on popular CL vision benchmarks show that GESCL leads to significant improvements over state-of-the-art method in terms of overall CL performance, as measured by classification accuracy as well as in terms of avoiding catastrophic forgetting.

[403] AMIR-GRPO: Inducing Implicit Preference Signals into GRPO

Amir Hossein Yari, Fajri Koto

Main category: cs.LG

TL;DR: AMIR-GRPO improves GRPO for LLM reasoning tasks by adding a DPO-style contrastive regularizer using intra-group reward rankings, addressing length bias and better utilizing rollout supervision.

Details

Motivation: GRPO has structural limitations in reasoning-heavy settings: sequence-level advantage normalization causes length bias, penalties for low-quality trajectories are diluted, and scalar objectives discard rich pairwise preference information from reward rankings, leading to underutilization of costly rollout supervision.

Method: AMIR-GRPO augments GRPO with an implicit DPO-style contrastive regularizer constructed directly from intra-group reward rankings, requiring no additional annotations. This mechanism amplifies suppression of low-reward trajectories, attenuates response-level length bias, and transforms each rollout group into denser supervision constraints.

Result: Across multiple mathematical reasoning benchmarks, AMIR-GRPO consistently outperforms strong GRPO baselines, yields clearer separation between correct and incorrect reasoning chains, and delivers broader coverage gains beyond instances solved by standard GRPO.

Conclusion: AMIR-GRPO effectively addresses GRPO’s limitations in reasoning tasks by better utilizing intra-group reward rankings through contrastive regularization, improving performance and supervision efficiency without additional annotation costs.

Abstract: Reinforcement learning has become the primary paradigm for aligning large language models (LLMs) on complex reasoning tasks, with group relative policy optimization (GRPO) widely used in large-scale post-training. However, GRPO faces structural limitations in reasoning-heavy settings: sequence-level advantage normalization introduces systematic length bias, penalties for low-quality trajectories are diluted, and the scalar objective discards rich pairwise preference information embedded in within-group reward rankings. As a result, valuable supervision from costly rollouts remains underutilized. We propose AMIR-GRPO, which augments GRPO with an implicit DPO-style contrastive regularizer constructed directly from intra-group reward rankings, requiring no additional annotations. This mechanism amplifies suppression of low-reward trajectories, attenuates response-level length bias, and transforms each rollout group into a denser set of supervision constraints. Across multiple mathematical reasoning benchmarks, AMIR-GRPO consistently outperforms strong GRPO baselines, yields clearer separation between correct and incorrect reasoning chains, and delivers broader coverage gains beyond the subset of instances solved by standard GRPO.

[404] Stochastic Voronoi Ensembles for Anomaly Detection

Yang Cao

Main category: cs.LG

TL;DR: SVEAD is a linear-time anomaly detection method using random Voronoi ensembles to handle varying local densities without parameter tuning.

Details

Motivation: Existing anomaly detection methods struggle with datasets having varying local densities - distance-based methods miss local anomalies, while density-based methods require careful parameter selection and have quadratic time complexity.

Method: SVEAD constructs ensemble random Voronoi diagrams and scores points using normalized cell-relative distances weighted by local scale, decomposing data space into restricted regions to identify local anomalies.

Result: SVEAD achieves linear time complexity and constant space complexity, and outperforms 12 state-of-the-art approaches on 45 datasets.

Conclusion: The geometric insight of decomposing data space into restricted regions enables effective local anomaly detection with computational efficiency, making SVEAD a practical solution for real-world applications.

Abstract: Anomaly detection aims to identify data instances that deviate significantly from majority of data, which has been widely used in fraud detection, network security, and industrial quality control. Existing methods struggle with datasets exhibiting varying local densities: distance-based methods miss local anomalies, while density-based approaches require careful parameter selection and incur quadratic time complexity. We observe that local anomalies, though indistinguishable under global analysis, become conspicuous when the data space is decomposed into restricted regions and each region is examined independently. Leveraging this geometric insight, we propose SVEAD (Stochastic Voronoi Ensembles Anomaly Detector), which constructs ensemble random Voronoi diagrams and scores points by normalized cell-relative distances weighted by local scale. The proposed method achieves linear time complexity and constant space complexity. Experiments on 45 datasets demonstrate that SVEAD outperforms 12 state-of-the-art approaches.

[405] Disentangling Aleatoric and Epistemic Uncertainty in Physics-Informed Neural Networks. Application to Insulation Material Degradation Prognostics

Ibai Ramirez, Jokin Alcibar, Joel Pino, Mikel Sanz, Jose I. Aizpurua

Main category: cs.LG

TL;DR: A Bayesian Physics-Informed Neural Network (B-PINN) framework is proposed for transformer insulation ageing estimation, jointly modeling epistemic and aleatoric uncertainty to provide full predictive posteriors, outperforming existing PINN variants in accuracy and uncertainty calibration.

Details

Motivation: Current Physics-Informed Neural Networks (PINNs) in Prognostics and Health Management (PHM) have limited uncertainty quantification capabilities, being mostly deterministic or only accounting for epistemic uncertainty, which restricts their suitability for risk-aware decision-making in transformer asset management.

Method: Developed a heteroscedastic Bayesian Physics-Informed Neural Network (B-PINN) framework that integrates Bayesian Neural Networks (BNNs) with physics-based residual enforcement and prior distributions, enabling probabilistic inference within a physics-informed learning architecture for spatiotemporal insulation material ageing estimation.

Result: The proposed B-PINN provides improved predictive accuracy and better-calibrated uncertainty estimates compared to deterministic PINNs, dropout-based PINNs (d-PINNs), and alternative B-PINN variants, as validated with finite-element thermal models and field measurements from a solar power plant.

Conclusion: Bayesian physics-informed learning shows strong potential for uncertainty-aware prognostics and informed decision-making in transformer asset management, with systematic sensitivity studies revealing the impact of boundary conditions, initial conditions, and residual sampling strategies on model performance.

Abstract: Physics-Informed Neural Networks (PINNs) provide a framework for integrating physical laws with data. However, their application to Prognostics and Health Management (PHM) remains constrained by the limited uncertainty quantification (UQ) capabilities. Most existing PINN-based prognostics approaches are deterministic or account only for epistemic uncertainty, limiting their suitability for risk-aware decision-making. This work introduces a heteroscedastic Bayesian Physics-Informed Neural Network (B-PINN) framework that jointly models epistemic and aleatoric uncertainty, yielding full predictive posteriors for spatiotemporal insulation material ageing estimation. The approach integrates Bayesian Neural Networks (BNNs) with physics-based residual enforcement and prior distributions, enabling probabilistic inference within a physics-informed learning architecture. The framework is evaluated on transformer insulation ageing application, validated with a finite-element thermal model and field measurements from a solar power plant, and benchmarked against deterministic PINNs, dropout-based PINNs (d-PINNs), and alternative B-PINN variants. Results show that the proposed B-PINN provides improved predictive accuracy and better-calibrated uncertainty estimates than competing approaches. A systematic sensitivity study further analyzes the impact of boundary-condition, initial-condition, and residual sampling strategies on accuracy, calibration, and generalization. Overall, the findings highlight the potential of Bayesian physics-informed learning to support uncertainty-aware prognostics and informed decision-making in transformer asset management.

[406] Rethinking Recurrent Neural Networks for Time Series Forecasting: A Reinforced Recurrent Encoder with Prediction-Oriented Proximal Policy Optimization

Xin Lai, Shiming Deng, Lu Yu, Yumin Lai, Shenghao Qiao, Xinze Zhang

Main category: cs.LG

TL;DR: RRE-PPO4Pred: A reinforced recurrent encoder with prediction-oriented PPO that improves RNN-based time series forecasting by treating RNN adaptation as a Markov Decision Process for better feature selection and temporal modeling.

Details

Motivation: Conventional RNN-based time series predictors treat all time steps and hidden states equally without considering their distinct contributions to forecasting, leading to suboptimal performance. There's a need for better adaptation mechanisms in RNNs for time series modeling.

Method: Proposes RRE-PPO4Pred with three innovations: 1) Reinforced Recurrent Encoder framework treating RNN adaptation as Markov Decision Process for feature selection, skip connections, and output selection; 2) Improved PPO4Pred algorithm with Transformer-based agent and dynamic transition sampling; 3) Co-evolutionary optimization paradigm for joint learning of RNN predictor and policy agent.

Result: Comprehensive evaluations on five real-world datasets show the method consistently outperforms existing baselines and achieves better accuracy than state-of-the-art Transformer models.

Conclusion: RRE-PPO4Pred provides an advanced time series predictor for engineering informatics by enhancing RNNs through reinforcement learning, achieving superior forecasting accuracy through adaptive and interactive time series modeling.

Abstract: Time series forecasting plays a crucial role in contemporary engineering information systems for supporting decision-making across various industries, where Recurrent Neural Networks (RNNs) have been widely adopted due to their capability in modeling sequential data. Conventional RNN-based predictors adopt an encoder-only strategy with sliding historical windows as inputs to forecast future values. However, this approach treats all time steps and hidden states equally without considering their distinct contributions to forecasting, leading to suboptimal performance. To address this limitation, we propose a novel Reinforced Recurrent Encoder with Prediction-oriented Proximal Policy Optimization, RRE-PPO4Pred, which significantly improves time series modeling capacity and forecasting accuracy of the RNN models. The core innovations of this method are: (1) A novel Reinforced Recurrent Encoder (RRE) framework that enhances RNNs by formulating their internal adaptation as a Markov Decision Process, creating a unified decision environment capable of learning input feature selection, hidden skip connection, and output target selection; (2) An improved Prediction-oriented Proximal Policy Optimization algorithm, termed PPO4Pred, which is equipped with a Transformer-based agent for temporal reasoning and develops a dynamic transition sampling strategy to enhance sampling efficiency; (3) A co-evolutionary optimization paradigm to facilitate the learning of the RNN predictor and the policy agent, providing adaptive and interactive time series modeling. Comprehensive evaluations on five real-world datasets indicate that our method consistently outperforms existing baselines, and attains accuracy better than state-of-the-art Transformer models, thus providing an advanced time series predictor in engineering informatics.

[407] A Pre-trained Reaction Embedding Descriptor Capturing Bond Transformation Patterns

Weiqi Liu, Fenglei Cao, Yuan Qi, Li-Cheng Xu

Main category: cs.LG

TL;DR: RXNEmb is a novel reaction-level descriptor derived from a pre-trained model that learns bond formation/cleavage patterns, enabling better reaction clustering and visualization than rule-based categories.

Details

Motivation: There's a scarcity of general-purpose, reaction-wise descriptors for bridging real-world chemistry with digital representations, despite the rise of data-driven reaction prediction models.

Method: Developed RXNEmb descriptor from RXNGraphormer, a model pre-trained to distinguish real reactions from fictitious ones with erroneous bond changes, learning intrinsic bond formation and cleavage patterns.

Result: Successfully re-clustered USPTO-50k dataset showing bond-change similarities better than rule-based categories, enabled reaction space visualization, and attention analysis revealed focus on chemically critical sites.

Conclusion: RXNEmb serves as a powerful, interpretable tool for reaction fingerprinting and analysis, paving the way for more data-centric approaches in reaction analysis and discovery.

Abstract: With the rise of data-driven reaction prediction models, effective reaction descriptors are crucial for bridging the gap between real-world chemistry and digital representations. However, general-purpose, reaction-wise descriptors remain scarce. This study introduces RXNEmb, a novel reaction-level descriptor derived from RXNGraphormer, a model pre-trained to distinguish real reactions from fictitious ones with erroneous bond changes, thereby learning intrinsic bond formation and cleavage patterns. We demonstrate its utility by data-driven re-clustering of the USPTO-50k dataset, yielding a classification that more directly reflects bond-change similarities than rule-based categories. Combined with dimensionality reduction, RXNEmb enables visualization of reaction space diversity. Furthermore, attention weight analysis reveals the model’s focus on chemically critical sites, providing mechanistic insight. RXNEmb serves as a powerful, interpretable tool for reaction fingerprinting and analysis, paving the way for more data-centric approaches in reaction analysis and discovery.

[408] Inference Attacks Against Graph Generative Diffusion Models

Xiuling Wang, Xin Huang, Guibo Luo, Jianliang Xu

Main category: cs.LG

TL;DR: This paper investigates privacy risks in graph generative diffusion models through three black-box inference attacks and proposes defense mechanisms.

Details

Motivation: Graph generative diffusion models have become powerful for generating complex graph structures, but their privacy risks remain largely unexplored. The authors aim to investigate information leakage in these models through inference attacks.

Method: The authors design three types of black-box inference attacks: 1) graph reconstruction attack to reconstruct structurally similar training graphs, 2) property inference attack to infer properties like average graph density and density distribution, and 3) two membership inference attacks to determine if a graph was in the training set. They also propose two defense mechanisms.

Result: Extensive experiments on three different graph generative diffusion models and six real-world graphs demonstrate the effectiveness of these attacks, significantly outperforming baseline approaches. The proposed defense mechanisms achieve better trade-off between defense strength and model utility than existing methods.

Conclusion: Graph generative diffusion models are vulnerable to privacy attacks, and the proposed defense mechanisms can help mitigate these risks while maintaining model utility.

Abstract: Graph generative diffusion models have recently emerged as a powerful paradigm for generating complex graph structures, effectively capturing intricate dependencies and relationships within graph data. However, the privacy risks associated with these models remain largely unexplored. In this paper, we investigate information leakage in such models through three types of black-box inference attacks. First, we design a graph reconstruction attack, which can reconstruct graphs structurally similar to those training graphs from the generated graphs. Second, we propose a property inference attack to infer the properties of the training graphs, such as the average graph density and the distribution of densities, from the generated graphs. Third, we develop two membership inference attacks to determine whether a given graph is present in the training set. Extensive experiments on three different types of graph generative diffusion models and six real-world graphs demonstrate the effectiveness of these attacks, significantly outperforming the baseline approaches. Finally, we propose two defense mechanisms that mitigate these inference attacks and achieve a better trade-off between defense strength and target model utility than existing methods. Our code is available at https://zenodo.org/records/17946102.

[409] TreeAdv: Tree-Structured Advantage Redistribution for Group-Based RL

Lang Cao, Hui Ruan, Yongqian Li, Peng Chao, Wu Ning, Haonan Song, Renhong Chen, Yitong Li

Main category: cs.LG

TL;DR: TreeAdv improves group-based RL for LLMs by using tree-structured advantage redistribution instead of treating rollouts as flat sequences, reducing token usage while improving performance on math reasoning tasks.

Details

Motivation: Standard GRPO treats rollout trajectories as independent flat sequences and assigns single sequence-level advantages to all tokens, causing sample inefficiency and length bias toward verbose, redundant chains of thought without improving logical depth.

Method: TreeAdv makes tree structure of group rollouts explicit using entropy-driven sampling to build forests where trees branch at high-uncertainty decisions while sharing low-uncertainty tokens. It aggregates token-level advantages for internal tree segments by redistributing advantages of complete rollouts (leaf nodes).

Result: Across 10 math reasoning benchmarks, TreeAdv consistently outperforms GRPO and GSPO while using substantially fewer generated tokens under identical supervision, data, and decoding budgets.

Conclusion: TreeAdv provides a more efficient approach to group-based RL for LLM alignment by leveraging tree structures for both exploration and advantage assignment, addressing key limitations of flat sequence treatment in existing methods.

Abstract: Reinforcement learning with group-based objectives, such as Group Relative Policy Optimization (GRPO), is a common framework for aligning large language models on complex reasoning tasks. However, standard GRPO treats each rollout trajectory as an independent flat sequence and assigns a single sequence-level advantage to all tokens, which leads to sample inefficiency and a length bias toward verbose, redundant chains of thought without improving logical depth. We introduce TreeAdv (Tree-Structured Advantage Redistribution for Group-Based RL), which makes the tree structure of group rollouts explicit for both exploration and advantage assignment. Specifically, TreeAdv builds a group of trees (a forest) based on an entropy-driven sampling method where each tree branches at high-uncertainty decisions while sharing low-uncertainty tokens across rollouts. Then, TreeAdv aggregates token-level advantages for internal tree segments by redistributing the advantages of complete rollouts (all leaf nodes), and TreeAdv can easily apply to group-based objectives such as GRPO or GSPO. Across 10 math reasoning benchmarks, TreeAdv consistently outperforms GRPO and GSPO, while using substantially fewer generated tokens under identical supervision, data, and decoding budgets.

[410] Investigating Knowledge Distillation Through Neural Networks for Protein Binding Affinity Prediction

Wajid Arshad Abbasi, Syed Ali Abbas, Maryum Bibi, Saiqa Andleeb, Muhammad Naveed Akhtar

Main category: cs.LG

TL;DR: Knowledge distillation framework transfers structural knowledge to sequence-based protein-protein binding affinity prediction, improving performance without requiring structures at inference time.

Details

Motivation: Predicting protein-protein binding affinity is challenging due to trade-off between accuracy and data availability. Structure-based models outperform sequence-based ones but require experimentally resolved structures which are often unavailable.

Method: Proposed knowledge distillation regression framework: uses structure-informed teacher network to supervise sequence-based student network during training via binding affinity labels and intermediate feature representations. Student only needs sequence data during inference.

Result: Sequence-only baseline: P_r=0.375, RMSE=2.712 kcal/mol. Structure-based models: P_r=0.512, RMSE=2.445 kcal/mol. Distillation-based student: P_r=0.481, RMSE=2.488 kcal/mol - significant improvement over sequence-only. Error analysis shows improved agreement and reduced bias.

Conclusion: Knowledge distillation effectively transfers structural knowledge to sequence-based predictors, bridging performance gap. Framework has potential to further improve with larger datasets. Code available at provided GitHub repository.

Abstract: The trade-off between predictive accuracy and data availability makes it difficult to predict protein–protein binding affinity accurately. The lack of experimentally resolved protein structures limits the performance of structure-based machine learning models, which generally outperform sequence-based methods. In order to overcome this constraint, we suggest a regression framework based on knowledge distillation that uses protein structural data during training and only needs sequence data during inference. The suggested method uses binding affinity labels and intermediate feature representations to jointly supervise the training of a sequence-based student network under the guidance of a structure-informed teacher network. Leave-One-Complex-Out (LOCO) cross-validation was used to assess the framework on a non-redundant protein–protein binding affinity benchmark dataset. A maximum Pearson correlation coefficient (P_r) of 0.375 and an RMSE of 2.712 kcal/mol were obtained by sequence-only baseline models, whereas a P_r of 0.512 and an RMSE of 2.445 kcal/mol were obtained by structure-based models. With a P_r of 0.481 and an RMSE of 2.488 kcal/mol, the distillation-based student model greatly enhanced sequence-only performance. Improved agreement and decreased bias were further confirmed by thorough error analyses. With the potential to close the performance gap between sequence-based and structure-based models as larger datasets become available, these findings show that knowledge distillation is an efficient method for transferring structural knowledge to sequence-based predictors. The source code for running inference with the proposed distillation-based binding affinity predictor can be accessed at https://github.com/wajidarshad/ProteinAffinityKD.

[411] The Geometry of the Pivot: A Note on Lazy Pivoted Cholesky and Farthest Point Sampling

Gil Shabat

Main category: cs.LG

TL;DR: Pivoted Cholesky decomposition for kernel matrices is geometrically equivalent to Farthest Point Sampling in RKHS, with Cholesky factor construction as implicit Gram-Schmidt orthogonalization.

Details

Motivation: While Pivoted Cholesky decomposition is widely used for scaling Gaussian Processes to large datasets, its geometric intuition within kernel methods remains obscure. The authors aim to bridge this gap between numerical linear algebra theory and practical kernel method applications.

Method: The authors provide a geometric interpretation of Pivoted Cholesky decomposition within Reproducing Kernel Hilbert Space (RKHS). They demonstrate that the pivotal selection step is mathematically equivalent to Farthest Point Sampling using the kernel metric, and that Cholesky factor construction corresponds to implicit Gram-Schmidt orthogonalization.

Result: The paper establishes a clear geometric connection between Pivoted Cholesky decomposition and kernel methods, showing that the algorithm performs Farthest Point Sampling in the RKHS metric space. This provides intuitive understanding of why the method works well for kernel matrix approximation.

Conclusion: Pivoted Cholesky decomposition has a natural geometric interpretation in RKHS as Farthest Point Sampling with implicit Gram-Schmidt orthogonalization. The authors provide both theoretical derivation and practical Python implementation to connect theory with application.

Abstract: Low-rank approximations of large kernel matrices are ubiquitous in machine learning, particularly for scaling Gaussian Processes to massive datasets. The Pivoted Cholesky decomposition is a standard tool for this task, offering a computationally efficient, greedy low-rank approximation. While its algebraic properties are well-documented in numerical linear algebra, its geometric intuition within the context of kernel methods often remains obscure. In this note, we elucidate the geometric interpretation of the algorithm within the Reproducing Kernel Hilbert Space (RKHS). We demonstrate that the pivotal selection step is mathematically equivalent to Farthest Point Sampling (FPS) using the kernel metric, and that the Cholesky factor construction is an implicit Gram-Schmidt orthogonalization. We provide a concise derivation and a minimalist Python implementation to bridge the gap between theory and practice.

[412] R$^3$L: Reflect-then-Retry Reinforcement Learning with Language-Guided Exploration, Pivotal Credit, and Positive Amplification

Weijie Shi, Yanxi Chen, Zexi Li, Xuchen Pan, Yuchang Sun, Jiajie Xu, Xiaofang Zhou, Yaliang Li

Main category: cs.LG

TL;DR: R³L is a reinforcement learning method for LLM reasoning that improves exploration via language-guided error diagnosis and retry, enables precise credit assignment to failure points, and amplifies positive signals for stable training.

Details

Motivation: Current RL approaches for LLMs struggle with exploration (low success rates, high rollout costs) and exploitation (coarse credit assignment, training instability from failure-dominated data).

Method: R³L uses three key components: 1) Reflect-then-Retry for active trajectory synthesis via language feedback to diagnose errors and restart from failure points, 2) Pivotal Credit Assignment that updates only diverging suffixes where contrastive signals exist, and 3) Positive Amplification that upweights successful trajectories to guide optimization.

Result: Experiments on agentic and reasoning tasks show 5% to 52% relative improvements over baselines while maintaining training stability.

Conclusion: R³L effectively addresses exploration-exploitation trade-offs in LLM RL through language-guided synthesis, precise credit assignment, and positive signal amplification, enabling stable training and significant performance gains on difficult tasks.

Abstract: Reinforcement learning drives recent advances in LLM reasoning and agentic capabilities, yet current approaches struggle with both exploration and exploitation. Exploration suffers from low success rates on difficult tasks and high costs of repeated rollouts from scratch. Exploitation suffers from coarse credit assignment and training instability: Trajectory-level rewards penalize valid prefixes for later errors, and failure-dominated groups overwhelm the few positive signals, leaving optimization without constructive direction. To this end, we propose R$^3$L, Reflect-then-Retry Reinforcement Learning with Language-Guided Exploration, Pivotal Credit, and Positive Amplification. To synthesize high-quality trajectories, R$^3$L shifts from stochastic sampling to active synthesis via reflect-then-retry, leveraging language feedback to diagnose errors, transform failed attempts into successful ones, and reduce rollout costs by restarting from identified failure points. With errors diagnosed and localized, Pivotal Credit Assignment updates only the diverging suffix where contrastive signals exist, excluding the shared prefix from gradient update. Since failures dominate on difficult tasks and reflect-then-retry produces off-policy data, risking training instability, Positive Amplification upweights successful trajectories to ensure positive signals guide the optimization process. Experiments on agentic and reasoning tasks demonstrate 5% to 52% relative improvements over baselines while maintaining training stability. Our code is released at https://github.com/shiweijiezero/R3L.

[413] ETR: Outcome-Guided Elastic Trust Regions for Policy Optimization

Shijie Zhang, Kevin Zhang, Zheyuan Gu, Xiang Guo, Rujun Guo, Shaoyu Liu, Guanjun Jiang, Xiaozhao Wang

Main category: cs.LG

TL;DR: ETR (Elastic Trust Regions) improves RLVR by replacing GRPO’s static trust regions with dynamic, signal-aware constraints that adapt to advantage magnitudes and group variances, preventing entropy collapse while accelerating learning.

Details

Motivation: GRPO's static trust region constraint assumes signal homogeneity, which misaligns with the heterogeneous nature of outcome-driven learning where advantage magnitudes and variances fluctuate. This leads to suboptimal exploitation of high-quality signals and insufficient noise suppression, often causing rapid entropy collapse.

Method: ETR introduces dual-level elasticity: (1) micro-level scaling of clipping boundaries based on advantage magnitude to accelerate learning from high-confidence paths, and (2) macro-level leveraging of group variance to implicitly allocate larger update budgets to tasks in the optimal learning zone.

Result: Extensive experiments on AIME and MATH benchmarks show ETR consistently outperforms GRPO, achieving superior accuracy while effectively mitigating policy entropy degradation to ensure sustained exploration.

Conclusion: ETR’s dynamic, signal-aware trust region mechanism addresses GRPO’s structural limitations by aligning optimization constraints with signal quality, enabling more efficient reinforcement learning with verifiable rewards while maintaining exploration.

Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as an important paradigm for unlocking reasoning capabilities in large language models, exemplified by the success of OpenAI o1 and DeepSeek-R1. Currently, Group Relative Policy Optimization (GRPO) stands as the dominant algorithm in this domain due to its stable training and critic-free efficiency. However, we argue that GRPO suffers from a structural limitation: it imposes a uniform, static trust region constraint across all samples. This design implicitly assumes signal homogeneity, a premise misaligned with the heterogeneous nature of outcome-driven learning, where advantage magnitudes and variances fluctuate significantly. Consequently, static constraints fail to fully exploit high-quality signals while insufficiently suppressing noise, often precipitating rapid entropy collapse. To address this, we propose \textbf{E}lastic \textbf{T}rust \textbf{R}egions (\textbf{ETR}), a dynamic mechanism that aligns optimization constraints with signal quality. ETR constructs a signal-aware landscape through dual-level elasticity: at the micro level, it scales clipping boundaries based on advantage magnitude to accelerate learning from high-confidence paths; at the macro level, it leverages group variance to implicitly allocate larger update budgets to tasks in the optimal learning zone. Extensive experiments on AIME and MATH benchmarks demonstrate that ETR consistently outperforms GRPO, achieving superior accuracy while effectively mitigating policy entropy degradation to ensure sustained exploration.

[414] EDCO: Dynamic Curriculum Orchestration for Domain-specific Large Language Model Fine-tuning

Jing-Cheng Pang, Liu Sun, Chang Zhou, Xian Tang, Haichuan Ma, Kun Jiang, Jianlong Wang, Kai Zhang, Sijie Wu, Haoran Cai, Chenwei Wu, Xubin Li, Xin Chen

Main category: cs.LG

TL;DR: EDCO is a dynamic curriculum learning framework for fine-tuning domain-specific LLMs that adaptively selects training samples based on inference entropy, outperforming static curriculum methods while reducing computational costs.

Details

Motivation: Current LLM fine-tuning methods use static curricula designed before training, which lack adaptability to the model's evolving needs during fine-tuning. There's a need for dynamic curriculum learning that can adjust based on the model's current state.

Method: EDCO framework with three components: 1) Efficient entropy estimator using prefix tokens to approximate full-sequence entropy, 2) Entropy-based curriculum generator selecting data points with highest inference entropy, 3) LLM trainer optimizing the model on the selected curriculum. Inspired by findings that high answer entropy benefits long-term reasoning.

Result: EDCO outperforms traditional curriculum strategies for fine-tuning Qwen3-4B and Llama3.2-3B models in communication, medicine, and law domains under supervised and reinforcement learning settings. Efficient entropy estimation reduces computational time by 83.5% while maintaining high accuracy.

Conclusion: EDCO provides an effective dynamic curriculum learning framework for domain-specific LLM fine-tuning that adapts to model needs during training, improves performance across multiple domains, and significantly reduces computational overhead through efficient entropy estimation.

Abstract: Domain-specific large language models (LLMs), typically developed by fine-tuning a pre-trained general-purpose LLM on specialized datasets, represent a significant advancement in applied AI. A common strategy in LLM fine-tuning is curriculum learning, which pre-orders training samples based on metrics like difficulty to improve learning efficiency compared to a random sampling strategy. However, most existing methods for LLM fine-tuning rely on a static curriculum, designed prior to training, which lacks adaptability to the model’s evolving needs during fine-tuning. To address this, we propose EDCO, a novel framework based on two key concepts: inference entropy and dynamic curriculum orchestration. Inspired by recent findings that maintaining high answer entropy benefits long-term reasoning gains, EDCO prioritizes samples with high inference entropy in a continuously adapted curriculum. EDCO integrates three core components: an efficient entropy estimator that uses prefix tokens to approximate full-sequence entropy, an entropy-based curriculum generator that selects data points with the highest inference entropy, and an LLM trainer that optimizes the model on the selected curriculum. Comprehensive experiments in communication, medicine and law domains, EDCO outperforms traditional curriculum strategies for fine-tuning Qwen3-4B and Llama3.2-3B models under supervised and reinforcement learning settings. Furthermore, the proposed efficient entropy estimation reduces computational time by 83.5% while maintaining high accuracy.

[415] Probabilistic Transformers for Joint Modeling of Global Weather Dynamics and Decision-Centric Variables

Paulius Rauba, Viktor Cikojevic, Fran Bartolic, Sam Levang, Ty Dickinson, Chase Dwelle

Main category: cs.LG

TL;DR: GEM-2 is a lightweight probabilistic transformer that directly learns decision-relevant weather targets (like extremes and accumulations) alongside atmospheric dynamics, outperforming operational NWP models and competing with more complex ML approaches while being much more efficient.

Details

Motivation: Current weather forecasts require users to post-process atmospheric state variables into decision-relevant targets (extremes, accumulations, threshold exceedances), which introduces suboptimality and structural bias. Decisions depend on distributions over these functionals that models aren't trained to learn directly.

Method: GEM-2 is a probabilistic transformer (~275M parameters) trained on CRPS objective to jointly learn global atmospheric dynamics alongside user-relevant variables. It’s lightweight and computationally efficient (~20-100x training speedup vs state-of-the-art).

Result: Directly outperforms operational numerical weather prediction models; competitive with ML models using expensive multi-step diffusion or bespoke fine-tuning; achieves state-of-the-art economic value metrics; shows stable convergence to climatology at S2S/seasonal timescales; surprisingly insensitive to many architectural/training design choices.

Conclusion: GEM-2 demonstrates that lightweight transformers can directly learn decision-relevant weather targets alongside atmospheric dynamics, providing efficient, high-quality probabilistic forecasts that better serve downstream decision-making needs.

Abstract: Weather forecasts sit upstream of high-stakes decisions in domains such as grid operations, aviation, agriculture, and emergency response. Yet forecast users often face a difficult trade-off. Many decision-relevant targets are functionals of the atmospheric state variables, such as extrema, accumulations, and threshold exceedances, rather than state variables themselves. As a result, users must estimate these targets via post-processing, which can be suboptimal and can introduce structural bias. The core issue is that decisions depend on distributions over these functionals that the model is not trained to learn directly. In this work, we introduce GEM-2, a probabilistic transformer that jointly learns global atmospheric dynamics alongside a suite of variables that users directly act upon. Using this training recipe, we show that a lightweight (~275M params) and computationally efficient (~20-100x training speedup relative to state-of-the-art) transformer trained on the CRPS objective can directly outperform operational numerical weather prediction (NWP) models and be competitive with ML models that rely on expensive multi-step diffusion processes or require bespoke multi-stage fine-tuning strategies. We further demonstrate state-of-the-art economic value metrics under decision-theoretic evaluation, stable convergence to climatology at S2S and seasonal timescales, and a surprising insensitivity to many commonly assumed architectural and training design choices.

[416] Learning Shrinks the Hard Tail: Training-Dependent Inference Scaling in a Solvable Linear Model

Noam Levi

Main category: cs.LG

TL;DR: The paper analyzes neural scaling laws in a solvable model of last-layer fine-tuning where targets have instance-heterogeneous difficulty, showing that pass@k failure rates follow power-law decay with training-dependent exponents that saturate at an intrinsic limit.

Details

Motivation: To understand how neural scaling laws behave when targets have intrinsic, instance-specific difficulty, and to connect generalization loss to inference performance metrics like pass@k failure rates.

Method: Develops a Latent Instance Difficulty (LID) model where each input’s target variance is governed by a latent “precision” drawn from a heavy-tailed distribution. Analyzes neural scaling laws in this solvable model of last-layer fine-tuning, deriving closed-form predictions for pass@k behavior.

Result: Pass@k failure rates exhibit power-law decay k^{-β_eff} with training-dependent exponents β_eff that grow with sample size N before saturating at an intrinsic limit β set by the difficulty distribution’s tail. Learning shrinks the “hard tail” of the error distribution, and improvements in generalization error steepen the pass@k curve until irreducible target variance dominates.

Conclusion: The LID model reveals coupling between learning and inference: training reduces the hard tail of errors until intrinsic target variance limits further improvement. This yields a compute-allocation rule favoring training before saturation and inference attempts after, validated in simulations and real-data proxies (CIFAR-10H and maths distillation tasks).

Abstract: We analyze neural scaling laws in a solvable model of last-layer fine-tuning where targets have intrinsic, instance-heterogeneous difficulty. In our Latent Instance Difficulty (LID) model, each input’s target variance is governed by a latent precision'' drawn from a heavy-tailed distribution. While generalization loss recovers standard scaling laws, our main contribution connects this to inference. The pass@$k$ failure rate exhibits a power-law decay, $k^{-β_\text{eff}}$, but the observed exponent $β_\text{eff}$ is training-dependent. It grows with sample size $N$ before saturating at an intrinsic limit $β$ set by the difficulty distribution's tail. This coupling reveals that learning shrinks the hard tail’’ of the error distribution: improvements in the model’s generalization error steepen the pass@$k$ curve until irreducible target variance dominates. The LID model yields testable, closed-form predictions for this behavior, including a compute-allocation rule that favors training before saturation and inference attempts after. We validate these predictions in simulations and in two real-data proxies: CIFAR-10H (human-label variance) and a maths teacher-student distillation task.

[417] Improving Compactness and Reducing Ambiguity of CFIRE Rule-Based Explanations

Sebastian Müller, Tobias Schneider, Ruben Kemna, Vanessa Toborek

Main category: cs.LG

TL;DR: CFIRE algorithm for tabular data explanations produces ambiguous rule assignments; proposed pruning strategy reduces ambiguity while maintaining performance.

Details

Motivation: Tabular models in sensitive domains need transparent explanations. CFIRE creates surrogate rule models but suffers from ambiguity when assigning multiple conflicting rules to the same sample.

Method: Post-hoc pruning strategy that removes rules with low contribution or conflicting coverage from CFIRE’s surrogate rule models.

Result: Experiments show smaller, less ambiguous models with preserved fidelity and minimal impact on predictive performance across multiple datasets.

Conclusion: The pruning strategy effectively addresses CFIRE’s ambiguity problem, producing more interpretable surrogate models suitable for sensitive applications.

Abstract: Models trained on tabular data are widely used in sensitive domains, increasing the demand for explanation methods to meet transparency needs. CFIRE is a recent algorithm in this domain that constructs compact surrogate rule models from local explanations. While effective, CFIRE may assign rules associated with different classes to the same sample, introducing ambiguity. We investigate this ambiguity and propose a post-hoc pruning strategy that removes rules with low contribution or conflicting coverage, yielding smaller and less ambiguous models while preserving fidelity. Experiments across multiple datasets confirm these improvements with minimal impact on predictive performance.

[418] Prompt Tuning without Labeled Samples for Zero-Shot Node Classification in Text-Attributed Graphs

Sethupathy Parameswaran, Suresh Sundaram, Yuan Fang

Main category: cs.LG

TL;DR: Zero-shot Prompt Tuning (ZPT) framework uses a Universal Bimodal Conditional Generator to create synthetic node embeddings from class names for zero-shot node classification in text-attributed graphs.

Details

Motivation: Zero-shot node classification in text-attributed graphs is challenging due to lack of labeled data, with applications in social networks, article grouping, and e-commerce categorization.

Method: Pre-train graph-language model to capture structure and text, train conditional generator to learn joint distribution, generate synthetic embeddings from class names, and perform continuous prompt tuning for classification.

Result: Extensive experiments show ZPT outperforms state-of-the-art baselines on multiple benchmark datasets, with ablation studies validating the bimodal generator’s contribution.

Conclusion: The proposed ZPT framework effectively addresses zero-shot node classification in TAGs by generating synthetic data from class names and using prompt tuning, demonstrating superior performance.

Abstract: Node classification is a fundamental problem in information retrieval with many real-world applications, such as community detection in social networks, grouping articles published online and product categorization in e-commerce. Zero-shot node classification in text-attributed graphs (TAGs) presents a significant challenge, particularly due to the absence of labeled data. In this paper, we propose a novel Zero-shot Prompt Tuning (ZPT) framework to address this problem by leveraging a Universal Bimodal Conditional Generator (UBCG). Our approach begins with pre-training a graph-language model to capture both the graph structure and the associated textual descriptions of each node. Following this, a conditional generative model is trained to learn the joint distribution of nodes in both graph and text modalities, enabling the generation of synthetic samples for each class based solely on the class name. These synthetic node and text embeddings are subsequently used to perform continuous prompt tuning, facilitating effective node classification in a zero-shot setting. Furthermore, we conduct extensive experiments on multiple benchmark datasets, demonstrating that our framework performs better than existing state-of-the-art baselines. We also provide ablation studies to validate the contribution of the bimodal generator. The code is provided at: https://github.com/Sethup123/ZPT.

[419] Quantum vs. Classical Machine Learning: A Benchmark Study for Financial Prediction

Rehan Ahmad, Muhammad Kashif, Nouhaila Innan, Muhammad Shafique

Main category: cs.LG

TL;DR: A reproducible benchmarking framework compares quantum machine learning (QML) models with classical counterparts on three financial tasks, showing QML can outperform classical methods when data structure and circuit design are well-aligned.

Details

Motivation: To provide a fair, systematic comparison between QML models and classical machine learning methods in financial applications, identifying scenarios where quantum approaches offer tangible improvements versus where classical methods still dominate.

Method: Developed a reproducible benchmarking framework that standardizes data splits, features, and evaluation metrics. Compared QML models with architecture-matched classical counterparts across three tasks: directional return prediction on US/Turkish equities, live-trading simulation with Quantum LSTMs vs classical LSTMs on S&P 500, and realized volatility forecasting using Quantum Support Vector Regression.

Result: Quantum approaches show performance gains when data structure and circuit design are well-aligned. Hybrid quantum neural networks surpassed classical ANNs by +3.8 AUC/+3.4 accuracy points on AAPL and +4.9 AUC/+3.6 accuracy points on Turkish stock KCHOL. QLSTM achieved higher risk-adjusted returns in 2 of 4 S&P 500 regimes. Angle-encoded QSVR attained lowest QLIKE on KCHOL and remained competitive on S&P 500 and AAPL.

Conclusion: The benchmarking framework successfully identifies specific scenarios where current QML architectures offer measurable improvements over classical methods, while also revealing areas where classical approaches continue to dominate, providing guidance for practical QML adoption in finance.

Abstract: In this paper, we present a reproducible benchmarking framework that systematically compares QML models with architecture-matched classical counterparts across three financial tasks: (i) directional return prediction on U.S. and Turkish equities, (ii) live-trading simulation with Quantum LSTMs versus classical LSTMs on the S&P 500, and (iii) realized volatility forecasting using Quantum Support Vector Regression. By standardizing data splits, features, and evaluation metrics, our study provides a fair assessment of when current-generation QML models can match or exceed classical methods. Our results reveal that quantum approaches show performance gains when data structure and circuit design are well aligned. In directional classification, hybrid quantum neural networks surpass the parameter-matched ANN by \textbf{+3.8 AUC} and \textbf{+3.4 accuracy points} on \texttt{AAPL} stock and by \textbf{+4.9 AUC} and \textbf{+3.6 accuracy points} on Turkish stock \texttt{KCHOL}. In live trading, the QLSTM achieves higher risk-adjusted returns in \textbf{two of four} S&P~~500 regimes. For volatility forecasting, an angle-encoded QSVR attains the \textbf{lowest QLIKE} on \texttt{KCHOL} and remains within $\sim$0.02-0.04 QLIKE of the best classical kernels on \texttt{S&P~~500} and \texttt{AAPL}. Our benchmarking framework clearly identifies the scenarios where current QML architectures offer tangible improvements and where established classical methods continue to dominate.

[420] Detecting Semantic Backdoors in a Mystery Shopping Scenario

Arpad Berta, Gabor Danner, Istvan Hegedus, Mark Jelasity

Main category: cs.LG

TL;DR: Proposes a method to detect semantic backdoors in ML models by training reference models, calibrating distance thresholds, and using adversarial training with model inversion to distinguish clean vs poisoned models.

Details

Motivation: Semantic backdoors (activated by natural out-of-distribution inputs) are harder to detect than trigger-based backdoors, and there's little existing work on this problem. The research is motivated by consumer protection scenarios where authorities need to verify if ML service providers have inserted backdoors.

Method: Create reference model pool by training clean and poisoned models on trusted infrastructure, calibrate model distance threshold. Use adversarial training from provider and measure model distances using input samples generated by inverting models to maximize distance from clean samples.

Result: The method can often completely separate clean and poisoned models, proving superior to state-of-the-art backdoor detectors. Most reliable approach uses adversarial training with model inversion for distance measurement.

Conclusion: Proposed approach effectively detects semantic backdoors in ML models, especially in consumer protection scenarios where authorities can verify service providers. The combination of reference models, calibrated thresholds, and adversarial training with model inversion provides robust detection capabilities.

Abstract: Detecting semantic backdoors in classification models–where some classes can be activated by certain natural, but out-of-distribution inputs–is an important problem that has received relatively little attention. Semantic backdoors are significantly harder to detect than backdoors that are based on trigger patterns due to the lack of such clearly identifiable patterns. We tackle this problem under the assumption that the clean training dataset and the training recipe of the model are both known. These assumptions are motivated by a consumer protection scenario, in which the responsible authority performs mystery shopping to test a machine learning service provider. In this scenario, the authority uses the provider’s resources and tools to train a model on a given dataset and tests whether the provider included a backdoor. In our proposed approach, the authority creates a reference model pool by training a small number of clean and poisoned models using trusted infrastructure, and calibrates a model distance threshold to identify clean models. We propose and experimentally analyze a number of approaches to compute model distances and we also test a scenario where the provider performs an adaptive attack to avoid detection. The most reliable method is based on requesting adversarial training from the provider. The model distance is best measured using a set of input samples generated by inverting the models in such a way as to maximize the distance from clean samples. With these settings, our method can often completely separate clean and poisoned models, and it proves to be superior to state-of-the-art backdoor detectors as well.

[421] Logic Tensor Network-Enhanced Generative Adversarial Network

Nijesh Upreti, Vaishak Belle

Main category: cs.LG

TL;DR: LTN-GAN enhances GANs by incorporating Logic Tensor Networks to enforce domain-specific logical constraints during sample generation, improving logical consistency while maintaining sample quality.

Details

Motivation: Traditional GANs lack mechanisms to incorporate prior knowledge or enforce logical consistency, limiting their applicability in domains requiring rule adherence. There's a need to combine realistic data synthesis with logical reasoning capabilities.

Method: LTN-GAN integrates Logic Tensor Networks (LTNs) with Generative Adversarial Networks. LTNs provide a principled way to integrate first-order logic with neural networks, enabling models to reason over and satisfy logical constraints during the generative process.

Result: The model significantly outperforms traditional GANs in adherence to predefined logical constraints while maintaining quality and diversity of generated samples across multiple datasets including synthetic datasets (gaussian, grid, rings) and MNIST.

Conclusion: This work demonstrates the potential of neuro-symbolic approaches to enhance generative modeling in knowledge-intensive domains by combining the strengths of GANs for realistic data synthesis with LTNs for logical reasoning.

Abstract: In this paper, we introduce Logic Tensor Network-Enhanced Generative Adversarial Network (LTN-GAN), a novel framework that enhances Generative Adversarial Networks (GANs) by incorporating Logic Tensor Networks (LTNs) to enforce domain-specific logical constraints during the sample generation process. Although GANs have shown remarkable success in generating realistic data, they often lack mechanisms to incorporate prior knowledge or enforce logical consistency, limiting their applicability in domains requiring rule adherence. LTNs provide a principled way to integrate first-order logic with neural networks, enabling models to reason over and satisfy logical constraints. By combining the strengths of GANs for realistic data synthesis with LTNs for logical reasoning, we gain valuable insights into how logical constraints influence the generative process while improving both the diversity and logical consistency of the generated samples. We evaluate LTN-GAN across multiple datasets, including synthetic datasets (gaussian, grid, rings) and the MNIST dataset, demonstrating that our model significantly outperforms traditional GANs in terms of adherence to predefined logical constraints while maintaining the quality and diversity of generated samples. This work highlights the potential of neuro-symbolic approaches to enhance generative modeling in knowledge-intensive domains.

[422] Feature-Aware One-Shot Federated Learning via Hierarchical Token Sequences

Shudong Liu, Hanwen Zhang, Xiuling Wang, Yuesheng Zhu, Guibo Luo

Main category: cs.LG

TL;DR: FALCON is a one-shot federated learning framework that uses hierarchical token sequences and knowledge distillation to handle non-IID image data effectively, achieving 9.58% higher accuracy than existing methods.

Details

Motivation: Existing one-shot federated learning methods struggle with robust performance on real-world domains like medical imaging and are inefficient with non-IID data, creating a need for more effective approaches.

Method: Uses pretrained visual encoder with hierarchical scale encoding to compress images into hierarchical token sequences, multi-scale autoregressive transformer generator to model distributions, and knowledge distillation in global training.

Result: Outperforms best OSFL baselines by 9.58% in average accuracy on medical and natural image datasets across diverse non-IID scenarios.

Conclusion: FALCON effectively addresses non-IID challenges in one-shot federated learning through hierarchical token sequences and knowledge distillation, demonstrating superior performance on real-world image data.

Abstract: One-shot federated learning (OSFL) reduces the communication cost and privacy risks of iterative federated learning by constructing a global model with a single round of communication. However, most existing methods struggle to achieve robust performance on real-world domains such as medical imaging, or are inefficient when handling non-IID (Independent and Identically Distributed) data. To address these limitations, we introduce FALCON, a framework that enhances the effectiveness of OSFL over non-IID image data. The core idea of FALCON is to leverage the feature-aware hierarchical token sequences generation and knowledge distillation into OSFL. First, each client leverages a pretrained visual encoder with hierarchical scale encoding to compress images into hierarchical token sequences, which capture multi-scale semantics. Second, a multi-scale autoregressive transformer generator is used to model the distribution of these token sequences and generate the synthetic sequences. Third, clients upload the synthetic sequences along with the local classifier trained on the real token sequences to the server. Finally, the server incorporates knowledge distillation into global training to reduce reliance on precise distribution modeling. Experiments on medical and natural image datasets validate the effectiveness of FALCON in diverse non-IID scenarios, outperforming the best OSFL baselines by 9.58% in average accuracy.

[423] Spectral Manifold Regularization for Stable and Modular Routing in Deep MoE Architectures

Ibrahim Delibasoglu

Main category: cs.LG

TL;DR: SR-MoE uses spectral regularization to prevent expert collapse in Mixture of Experts architectures, maintaining modularity and enabling stable lifelong learning.

Details

Motivation: MoE architectures suffer from expert collapse where routing converges to few dominant experts, reducing model capacity and causing catastrophic interference during adaptation.

Method: Spectrally-Regularized Mixture of Experts (SR-MoE) imposes geometric constraints on routing manifold using dual regularization: spectral norm constraints bound routing function Lipschitz continuity, and stable rank penalties preserve high-dimensional feature diversity in expert selection.

Result: Traditional linear gating fails with increasing depth (accuracy drops up to 4.72% due to expert entanglement), while SR-MoE maintains structural integrity (mean interference -0.32%). Spectral constraints facilitate positive knowledge transfer and enable localized expert updates without global performance decay.

Conclusion: SR-MoE provides a general solution for building high-capacity, modular networks capable of stable lifelong learning by preventing expert collapse through spectral regularization.

Abstract: Mixture of Experts (MoE) architectures enable efficient scaling of neural networks but suffer from expert collapse, where routing converges to a few dominant experts. This reduces model capacity and causes catastrophic interference during adaptation. We propose the Spectrally-Regularized Mixture of Experts (SR-MoE), which imposes geometric constraints on the routing manifold to enforce structural modularity. Our method uses dual regularization: spectral norm constraints bound routing function Lipschitz continuity, while stable rank penalties preserve high-dimensional feature diversity in expert selection. We evaluate SR-MoE across architectural scales and dataset complexities using modular one-shot adaptation tasks. Results show that traditional linear gating fails with increasing depth (accuracy drops up to 4.72% due to expert entanglement), while SR-MoE maintains structural integrity (mean interference -0.32%). Our spectral constraints facilitate positive knowledge transfer, enabling localized expert updates without global performance decay. SR-MoE provides a general solution for building high-capacity, modular networks capable of stable lifelong learning.

[424] Adaptive-Boundary-Clipping GRPO: Ensuring Bounded Ratios for Stable and Generalizable Training

Chi Liu, Xin Chen

Main category: cs.LG

TL;DR: ABC-GRPO improves GRPO with adaptive asymmetric clipping for better performance and exploration in LLM reinforcement learning.

Details

Motivation: The authors identify that GRPO's clipping mechanism is suboptimal in certain scenarios, limiting flexibility and generalization. They aim to enhance GRPO for improved performance in mathematical reasoning tasks with LLMs.

Method: Propose Adaptive-Boundary-Clipping GRPO (ABC-GRPO), an asymmetric and adaptive refinement of GRPO’s clipping mechanism that allows for more flexible policy updates while maintaining training stability.

Result: ABC-GRPO achieves superior performance over standard GRPO on mathematical reasoning tasks using Qwen3 LLMs, while maintaining substantially higher entropy throughout training to preserve exploration capacity and mitigate premature convergence.

Conclusion: ABC-GRPO represents a significant enhancement to GRPO that improves both performance and exploration capabilities, with publicly available implementation for reproducibility.

Abstract: Group Relative Policy Optimization (GRPO) has emerged as a popular algorithm for reinforcement learning with large language models (LLMs). However, upon analyzing its clipping mechanism, we argue that it is suboptimal in certain scenarios. With appropriate modifications, GRPO can be significantly enhanced to improve both flexibility and generalization. To this end, we propose Adaptive-Boundary-Clipping GRPO (ABC-GRPO), an asymmetric and adaptive refinement of the original GRPO framework. We demonstrate that ABC-GRPO achieves superior performance over standard GRPO on mathematical reasoning tasks using the Qwen3 LLMs. Moreover, ABC-GRPO maintains substantially higher entropy throughout training, thereby preserving the model’s exploration capacity and mitigating premature convergence. The implementation code is available online to ease reproducibility https://github.com/chi2liu/ABC-GRPO.

[425] A Gap Between Decision Trees and Neural Networks

Akash Kumar

Main category: cs.LG

TL;DR: Shallow ReLU networks struggle to approximate axis-aligned decision trees with geometrically simple boundaries due to infinite Radon total variation, but a smooth barrier score can achieve finite complexity while exactly recovering box decision regions.

Details

Motivation: To understand the trade-off between interpretability (geometric simplicity of decision boundaries) and approximation accuracy when shallow neural networks try to approximate axis-aligned decision trees, which have rule-based, box-like decision regions.

Method: Analyze infinite-width, bounded-norm, single-hidden-layer ReLU networks using Radon total variation (RTV) seminorm to measure geometric complexity. Study different smoothing approaches for tree indicators and construct a smooth barrier score with finite RTV that exactly recovers box decision regions.

Result: Hard tree indicators and common smoothing methods (piecewise-linear, sigmoidal) have infinite RTV in dimensions >1. Gaussian convolution yields finite RTV but with exponential dependence on dimension. A smooth barrier score achieves finite RTV and exact recovery of box regions with polynomial calibration bounds.

Conclusion: There’s a fundamental conflict between geometric simplicity and accurate approximation of axis-aligned trees by shallow networks, but careful score construction can achieve finite complexity while exactly recovering decision regions, revealing an accuracy-complexity tradeoff.

Abstract: We study when geometric simplicity of decision boundaries, used here as a notion of interpretability, can conflict with accurate approximation of axis-aligned decision trees by shallow neural networks. Decision trees induce rule-based, axis-aligned decision regions (finite unions of boxes), whereas shallow ReLU networks are typically trained as score models whose predictions are obtained by thresholding. We analyze the infinite-width, bounded-norm, single-hidden-layer ReLU class through the Radon total variation ($\mathrm{R}\mathrm{TV}$) seminorm, which controls the geometric complexity of level sets. We first show that the hard tree indicator $1_A$ has infinite $\mathrm{R}\mathrm{TV}$. Moreover, two natural split-wise continuous surrogates–piecewise-linear ramp smoothing and sigmoidal (logistic) smoothing–also have infinite $\mathrm{R}\mathrm{TV}$ in dimensions $d>1$, while Gaussian convolution yields finite $\mathrm{R}\mathrm{TV}$ but with an explicit exponential dependence on $d$. We then separate two goals that are often conflated: classification after thresholding (recovering the decision set) versus score learning (learning a calibrated score close to $1_A$). For classification, we construct a smooth barrier score $S_A$ with finite $\mathrm{R}\mathrm{TV}$ whose fixed threshold $τ=1$ exactly recovers the box. Under a mild tube-mass condition near $\partial A$, we prove an $L_1(P)$ calibration bound that decays polynomially in a sharpness parameter, along with an explicit $\mathrm{R}\mathrm{TV}$ upper bound in terms of face measures. Experiments on synthetic unions of rectangles illustrate the resulting accuracy–complexity tradeoff and how threshold selection shifts where training lands along it.

[426] FOREVER: Forgetting Curve-Inspired Memory Replay for Language Model Continual Learning

Yujie Feng, Hao Wang, Jian Li, Xu Chu, Zhaolu Kang, Yiran Liu, Yasha Wang, Philip S. Yu, Xiao-Ming Wu

Main category: cs.LG

TL;DR: FOREVER is a continual learning framework for LLMs that uses model-centric time (based on optimizer updates) and forgetting curve-inspired replay scheduling to better align with actual learning progress.

Details

Motivation: Current memory replay methods for continual learning in LLMs rely on fixed step-based heuristics that misalign with actual learning progress, since identical training steps can produce varying degrees of parameter change. Recent findings show LLM forgetting follows the Ebbinghaus human forgetting curve, suggesting better alignment is possible.

Method: FOREVER defines “model time” using the magnitude of optimizer updates rather than training steps. It uses a forgetting curve-based replay scheduler to determine when to replay, and an intensity-aware regularization mechanism to adaptively control how to replay, aligning replay intervals with the model’s internal evolution.

Result: Extensive experiments on three continual learning benchmarks with models ranging from 0.6B to 13B parameters demonstrate that FOREVER consistently mitigates catastrophic forgetting.

Conclusion: Aligning replay schedules with model-centric time (based on optimizer updates) and forgetting curve principles provides more effective continual learning for LLMs compared to traditional step-based heuristics.

Abstract: Continual learning (CL) for large language models (LLMs) aims to enable sequential knowledge acquisition without catastrophic forgetting. Memory replay methods are widely used for their practicality and effectiveness, but most rely on fixed, step-based heuristics that often misalign with the model’s actual learning progress, since identical training steps can result in varying degrees of parameter change. Motivated by recent findings that LLM forgetting mirrors the Ebbinghaus human forgetting curve, we propose FOREVER (FORgEtting curVe-inspired mEmory Replay), a novel CL framework that aligns replay schedules with a model-centric notion of time. FOREVER defines model time using the magnitude of optimizer updates, allowing forgetting curve-inspired replay intervals to align with the model’s internal evolution rather than raw training steps. Building on this approach, FOREVER incorporates a forgetting curve-based replay scheduler to determine when to replay and an intensity-aware regularization mechanism to adaptively control how to replay. Extensive experiments on three CL benchmarks and models ranging from 0.6B to 13B parameters demonstrate that FOREVER consistently mitigates catastrophic forgetting.

[427] Stage-specific cancer survival prediction enriched by explainable machine learning

Parisa Poorhasani, Bogdan Iancu

Main category: cs.LG

TL;DR: The paper develops explainable ML models for stage-specific cancer survival prediction using SEER data, focusing on colorectal, stomach, and liver cancers, with SHAP and LIME for interpretability.

Details

Motivation: Traditional survival prediction models combine all cancer stages, potentially overestimating performance and ignoring stage-specific variations. There's limited research on explainability and transparency in ML survival models.

Method: Used SEER dataset to create explainable ML models for stage-specific cancer survivability prediction. Applied SHAP and LIME interpretability techniques to reveal feature-stage interactions and important factors at each stage.

Result: Identified significant feature-cancer stage interactions that traditional black-box models would miss. Discovered how demographic and clinical variables influence survival differently across cancer stages and types (colorectal, stomach, liver).

Conclusion: Stage-specific explainable ML models provide transparency and clinical relevance for personalized treatment planning, revealing important factors at each cancer stage that would be hidden in traditional combined-stage models.

Abstract: Despite the fact that cancer survivability rates vary greatly between stages, traditional survival prediction models have frequently been trained and assessed using examples from all combined phases of the disease. This method may result in an overestimation of performance and ignore the stage-specific variations. Using the SEER dataset, we created and verified explainable machine learning (ML) models to predict stage-specific cancer survivability in colorectal, stomach, and liver cancers. ML-based cancer survival analysis has been a long-standing topic in the literature; however, studies involving the explainability and transparency of ML survivability models are limited. Our use of explainability techniques, including SHapley Additive exPlanations (SHAP) and Local Interpretable Model-agnostic Explanations (LIME), enabled us to illustrate significant feature-cancer stage interactions that would have remained hidden in traditional black-box models. We identified how certain demographic and clinical variables influenced survival differently across cancer stages and types. These insights provide not only transparency but also clinical relevance, supporting personalized treatment planning. By focusing on stage-specific models, this study provides new insights into the most important factors at each stage of cancer, offering transparency and potential clinical relevance to support personalized treatment planning.

[428] Modeling Behavioral Patterns in News Recommendations Using Fuzzy Neural Networks

Kevin Innerebner, Stephan Bartl, Markus Reiter-Haas, Elisabeth Lex

Main category: cs.LG

TL;DR: Transparent news recommender system using fuzzy neural networks to learn human-readable rules for predicting article clicks, balancing accuracy with interpretability.

Details

Motivation: Current news recommender systems are black-box models that lack transparency for editorial decision-making, making it difficult for editors to understand and align content curation with audience behavior.

Method: Uses fuzzy neural networks to learn human-readable rules from behavioral data for predicting article clicks. Rules can be extracted at configurable thresholds to control complexity and interpretability.

Result: Evaluated on MIND and EB-NeRD datasets, the system accurately predicts click behavior compared to established baselines while learning interpretable rules. The learned rules reveal news consumption patterns.

Conclusion: The transparent recommender system enables editors to understand audience behavior and align content curation goals, bridging the gap between algorithmic recommendations and editorial decision-making.

Abstract: News recommender systems are increasingly driven by black-box models, offering little transparency for editorial decision-making. In this work, we introduce a transparent recommender system that uses fuzzy neural networks to learn human-readable rules from behavioral data for predicting article clicks. By extracting the rules at configurable thresholds, we can control rule complexity and thus, the level of interpretability. We evaluate our approach on two publicly available news datasets (i.e., MIND and EB-NeRD) and show that we can accurately predict click behavior compared to several established baselines, while learning human-readable rules. Furthermore, we show that the learned rules reveal news consumption patterns, enabling editors to align content curation goals with target audience behavior.

Viktor Martinek, Roland Herzog

Main category: cs.LG

TL;DR: Extends symbolic regression to handle multiple categorical variables with intermediate parameter sharing levels, reducing parameter count while revealing more problem structure.

Details

Motivation: To improve symbolic regression for scientific discovery by handling multiple categorical variables and introducing flexible parameter sharing schemes that reduce parameter count while extracting more structural information about problems.

Method: Extends existing symbolic regression approaches by considering multiple categorical variables and introducing intermediate levels of parameter sharing (parameters shared across one category but varying across another). Tests approach on synthetic fitting-only examples and applies to astrophysics dataset previously studied with only one categorical variable.

Result: Achieves similar fit quality as previous approaches but requires significantly fewer individual parameters. Demonstrates data requirement reduction and transfer learning benefits in synthetic tests. Extracts additional information about problem structure from astrophysics dataset.

Conclusion: The proposed multi-categorical variable approach with intermediate parameter sharing reduces parameter complexity while revealing more structural information, making symbolic regression more efficient and informative for scientific discovery applications.

Abstract: Symbolic Regression aims to find symbolic expressions that describe datasets. Due to better interpretability, it is a machine learning paradigm particularly powerful for scientific discovery. In recent years, several works have expanded the concept to allow the description of similar phenomena using a single expression with varying sets of parameters, thereby introducing categorical variables. Some previous works allow only “non-shared” (category-value-specific) parameters, and others also incorporate “shared” (category-value-agnostic) parameters. We expand upon those efforts by considering multiple categorical variables, and introducing intermediate levels of parameter sharing. With two categorical variables, an intermediate level of parameter sharing emerges, i.e., parameters which are shared across either category but change across the other. The new approach potentially decreases the number of parameters, while revealing additional information about the problem. Using a synthetic, fitting-only example, we test the limits of this setup in terms of data requirement reduction and transfer learning. As a real-world symbolic regression example, we demonstrate the benefits of the proposed approach on an astrophysics dataset used in a previous study, which considered only one categorical variable. We achieve a similar fit quality but require significantly fewer individual parameters, and extract additional information about the problem.

[430] LinkD: AutoRegressive Diffusion Model for Mechanical Linkage Synthesis

Yayati Jadhav, Amir Barati Farimani

Main category: cs.LG

TL;DR: Autoregressive diffusion framework for inverse design of mechanical linkages that sequentially constructs linkage graphs to match target trajectories, enabling scalable synthesis of complex mechanisms with up to 20 nodes.

Details

Motivation: Traditional linkage design is challenging due to the intricate coupling between continuous node placements, discrete topological configurations, and nonlinear kinematic constraints. The highly nonlinear motion-to-configuration relationship means small perturbations drastically alter trajectories, and the combinatorially expanding design space makes conventional optimization and heuristic methods computationally intractable.

Method: Introduces an autoregressive diffusion framework that exploits the dyadic nature of linkage assembly by representing mechanisms as sequentially constructed graphs (nodes as joints, edges as rigid links). Combines a causal transformer with a Denoising Diffusion Probabilistic Model (DDPM), both conditioned on target trajectories encoded via a transformer encoder. The causal transformer autoregressively predicts discrete topology node-by-node, while the DDPM refines each node’s spatial coordinates and edge connectivity to previously generated nodes.

Result: The framework enables adaptive trial-and-error synthesis where problematic nodes exhibiting kinematic locking or collisions can be selectively regenerated, allowing autonomous correction of degenerate configurations during design. Successfully synthesizes linkage systems containing up to 20 nodes with extensibility to N-node architectures, surpassing traditional optimization approaches.

Conclusion: The graph-based, data-driven methodology enables scalable inverse design that generalizes to mechanisms with arbitrary node counts. This work advances autoregressive graph generation methodologies and computational kinematic synthesis, establishing new paradigms for scalable inverse design of complex mechanical systems.

Abstract: Designing mechanical linkages to achieve target end-effector trajectories presents a fundamental challenge due to the intricate coupling between continuous node placements, discrete topological configurations, and nonlinear kinematic constraints. The highly nonlinear motion-to-configuration relationship means small perturbations in joint positions drastically alter trajectories, while the combinatorially expanding design space renders conventional optimization and heuristic methods computationally intractable. We introduce an autoregressive diffusion framework that exploits the dyadic nature of linkage assembly by representing mechanisms as sequentially constructed graphs, where nodes correspond to joints and edges to rigid links. Our approach combines a causal transformer with a Denoising Diffusion Probabilistic Model (DDPM), both conditioned on target trajectories encoded via a transformer encoder. The causal transformer autoregressively predicts discrete topology node-by-node, while the DDPM refines each node’s spatial coordinates and edge connectivity to previously generated nodes. This sequential generation enables adaptive trial-and-error synthesis where problematic nodes exhibiting kinematic locking or collisions can be selectively regenerated, allowing autonomous correction of degenerate configurations during design. Our graph-based, data-driven methodology surpasses traditional optimization approaches, enabling scalable inverse design that generalizes to mechanisms with arbitrary node counts. We demonstrate successful synthesis of linkage systems containing up to 20 nodes with extensibility to N-node architectures. This work advances autoregressive graph generation methodologies and computational kinematic synthesis, establishing new paradigms for scalable inverse design of complex mechanical systems.

[431] Using Legacy Polysomnography Data to Train a Radar System to Quantify Sleep in Older Adults and People living with Dementia

M. Yin, K. G. Ravindran, C. Hadjipanayi, A. Bannon, A. Rapeaux, C. Della Monica, T. S. Lande, Derk-Jan Dijk, T. G. Constandinou

Main category: cs.LG

TL;DR: Deep transfer learning framework using adversarial domain adaptation improves sleep stage classification from UWB radar data by bridging knowledge gap between PSG and radar signals.

Details

Motivation: Limited availability of radar sleep data makes it challenging to build robust models that generalize across diverse cohorts and environments for unobtrusive in-home sleep monitoring.

Method: End-to-end neural network trained on combined PSG and radar data, using adversarial domain adaptation to bridge knowledge gap between PSG and radar signals. Validated on radar dataset of 47 older adults including Alzheimer’s patients.

Result: Achieved 79.5% accuracy with Kappa 0.65 for classifying wakefulness, REM, light sleep, and deep sleep. Deep transfer learning significantly enhanced sleep staging performance in target domain.

Conclusion: Method effectively addresses data variability and limited sample size challenges, improving reliability of automatic sleep staging models, especially when radar data is limited.

Abstract: Objective: Ultra-wideband radar technology offers a promising solution for unobtrusive and cost-effective in-home sleep monitoring. However, the limited availability of radar sleep data poses challenges in building robust models that generalize across diverse cohorts and environments. This study proposes a novel deep transfer learning framework to enhance sleep stage classification using radar data. Methods: An end-to-end neural network was developed to classify sleep stages based on nocturnal respiratory and motion signals. The network was trained using a combination of large-scale polysomnography (PSG) datasets and radar data. A domain adaptation approach employing adversarial learning was utilized to bridge the knowledge gap between PSG and radar signals. Validation was performed on a radar dataset of 47 older adults (mean age: 71.2), including 18 participants with prodromal or mild Alzheimer disease. Results: The proposed network structure achieves an accuracy of 79.5% with a Kappa value of 0.65 when classifying wakefulness, rapid eye movement, light sleep and deep sleep. Experimental results confirm that our deep transfer learning approach significantly enhances automatic sleep staging performance in the target domain. Conclusion: This method effectively addresses challenges associated with data variability and limited sample size, substantially improving the reliability of automatic sleep staging models, especially in contexts where radar data is limited. Significance: The findings underscore the viability of UWB radar as a nonintrusive, forward-looking sleep assessment tool that could significantly benefit care for older people and people with neurodegenerative disorders.

[432] Minimum distance classification for nonlinear dynamical systems

Dominique Martinez

Main category: cs.LG

TL;DR: Dynafit: A kernel-based method for classifying trajectory data from nonlinear dynamical systems by learning a distance metric between training trajectories and underlying dynamics using Koopman operator approximation.

Details

Motivation: To address the problem of classifying trajectory data generated by nonlinear dynamics, where each class corresponds to a distinct dynamical system, requiring a method that can handle nonlinearities and potentially infinite-dimensional feature spaces.

Method: Proposes Dynafit, a kernel-based method that learns a distance metric between training trajectories and underlying dynamics by approximating the Koopman operator, which linearizes dynamics in a kernel feature space. Uses kernel trick to handle high-dimensional spaces and can incorporate partial knowledge of dynamics when available.

Result: Demonstrates effectiveness on three examples: chaos detection with logistic map, recognition of handwritten dynamics, and recognition of visual dynamic textures, showing applicability to various classification tasks involving nonlinear dynamical systems.

Conclusion: Dynafit provides an effective kernel-based approach for classifying trajectory data from nonlinear dynamical systems by learning dynamics-based distance metrics through Koopman operator approximation, with flexibility to incorporate prior knowledge and applicability to diverse domains.

Abstract: We address the problem of classifying trajectory data generated by some nonlinear dynamics, where each class corresponds to a distinct dynamical system. We propose Dynafit, a kernel-based method for learning a distance metric between training trajectories and the underlying dynamics. New observations are assigned to the class with the most similar dynamics according to the learned metric. The learning algorithm approximates the Koopman operator which globally linearizes the dynamics in a (potentially infinite) feature space associated with a kernel function. The distance metric is computed in feature space independently of its dimensionality by using the kernel trick common in machine learning. We also show that the kernel function can be tailored to incorporate partial knowledge of the dynamics when available. Dynafit is applicable to various classification tasks involving nonlinear dynamical systems and sensors. We illustrate its effectiveness on three examples: chaos detection with the logistic map, recognition of handwritten dynamics and of visual dynamic textures.

[433] Causal Data Augmentation for Robust Fine-Tuning of Tabular Foundation Models

Magnus Bühler, Lennart Purucker, Frank Hutter

Main category: cs.LG

TL;DR: CausalMixFT improves fine-tuning of tabular foundation models under data scarcity by generating causally-informed synthetic data, enhancing robustness and validation reliability.

Details

Motivation: Fine-tuning tabular foundation models with limited data is challenging because early stopping on scarce validation data often fails to reflect true generalization performance, leading to unreliable model selection.

Method: Proposes CausalMixFT which uses Structural Causal Models (SCMs) fitted on target datasets to generate structurally consistent synthetic samples that preserve feature dependencies, augmenting limited real data with causally-informed examples.

Result: Evaluated on 33 classification datasets with over 2300 fine-tuning runs, CausalMixFT improved median normalized ROC-AUC from 0.10 to 0.12, outperforming statistical generators (CTGAN, TabEBM, TableAugment) and reduced validation-test correlation gap from 0.67 to 0.30.

Conclusion: Incorporating causal structure into data augmentation provides an effective and principled approach for fine-tuning tabular foundation models in low-data regimes, enabling more reliable validation-based early stopping and improved fine-tuning stability.

Abstract: Fine-tuning tabular foundation models (TFMs) under data scarcity is challenging, as early stopping on even scarcer validation data often fails to capture true generalization performance. We propose CausalMixFT, a method that enhances fine-tuning robustness and downstream performance by generating structurally consistent synthetic samples using Structural Causal Models (SCMs) fitted on the target dataset. This approach augments limited real data with causally informed synthetic examples, preserving feature dependencies while expanding training diversity. Evaluated across 33 classification datasets from TabArena and over 2300 fine-tuning runs, our CausalMixFT method consistently improves median normalized ROC-AUC from 0.10 (standard fine-tuning) to 0.12, outperforming purely statistical generators such as CTGAN (-0.01), TabEBM (-0.04), and TableAugment (-0.09). Moreover, it narrows the median validation-test performance correlation gap from 0.67 to 0.30, enabling more reliable validation-based early stopping, a key step toward improving fine-tuning stability under data scarcity. These results demonstrate that incorporating causal structure into data augmentation provides an effective and principled route to fine-tuning tabular foundation models in low-data regimes.

[434] MORPHFED: Federated Learning for Cross-institutional Blood Morphology Analysis

Gabriel Ansah, Eden Ruffell, Delmiro Fernandez-Reyes, Petru Manescu

Main category: cs.LG

TL;DR: Federated learning framework for white blood cell morphology analysis that enables collaborative training across institutions without data sharing, addressing staining variability and privacy concerns in LMICs.

Details

Motivation: Automated blood morphology analysis in LMICs faces challenges from dataset shifts (staining variability, imaging differences, rare morphologies) and privacy/data-sharing restrictions that prevent building centralized datasets.

Method: Introduce a federated learning framework for white blood cell morphology analysis that enables collaborative training across multiple clinical sites without exchanging training data, using both convolutional and transformer-based architectures.

Result: Federated models achieve strong cross-site performance and improved generalization to unseen institutions compared to centralized training, learning robust domain-invariant representations while preserving complete data privacy.

Conclusion: Federated learning is a practical, privacy-preserving approach for developing equitable, scalable, and generalizable medical imaging AI in resource-limited healthcare environments.

Abstract: Automated blood morphology analysis can support hematological diagnostics in low- and middle-income countries (LMICs) but remains sensitive to dataset shifts from staining variability, imaging differences, and rare morphologies. Building centralized datasets to capture this diversity is often infeasible due to privacy regulations and data-sharing restrictions. We introduce a federated learning framework for white blood cell morphology analysis that enables collaborative training across institutions without exchanging training data. Using blood films from multiple clinical sites, our federated models learn robust, domain-invariant representations while preserving complete data privacy. Evaluations across convolutional and transformer-based architectures show that federated training achieves strong cross-site performance and improved generalization to unseen institutions compared to centralized training. These findings highlight federated learning as a practical and privacy-preserving approach for developing equitable, scalable, and generalizable medical imaging AI in resource-limited healthcare environments.

[435] Clinical Data Goes MEDS? Let’s OWL make sense of it

Alberto Marfoglia, Jong Ho Jhee, Adrien Coulet

Main category: cs.LG

TL;DR: MEDS-OWL is an OWL ontology that bridges the Medical Event Data Standard (MEDS) with Semantic Web technologies, enabling representation of clinical event data as FAIR-aligned RDF graphs.

Details

Motivation: Machine learning in healthcare faces interoperability and reproducibility challenges due to lack of standardized, semantically explicit data representations. While MEDS provides an event-centric data model, it lacks integration with Semantic Web ecosystems.

Method: Developed MEDS-OWL ontology with 13 classes, 10 object properties, 20 data properties, and 24 axioms. Created meds2rdf Python library to convert MEDS events to RDF graphs. Validated using synthetic clinical dataset for ruptured intracranial aneurysms with SHACL constraints.

Result: Successfully transformed MEDS data into FAIR-aligned RDF datasets. The ontology enables provenance-aware publishing and interoperability of event-based clinical data. Demonstrated on synthetic patient care pathway data.

Conclusion: MEDS-OWL bridges MEDS with Semantic Web, providing reusable semantic layer for event-based clinical data and establishing foundation for graph-based analytics, enhancing interoperability and reproducibility in healthcare ML workflows.

Abstract: The application of machine learning on healthcare data is often hindered by the lack of standardized and semantically explicit representation, leading to limited interoperability and reproducibility across datasets and experiments. The Medical Event Data Standard (MEDS) addresses these issues by introducing a minimal, event-centric data model designed for reproducible machine-learning workflows from health data. However, MEDS is defined as a data-format specification and does not natively provide integration with the Semantic Web ecosystem. In this article, we introduce MEDS-OWL, a lightweight OWL ontology that provides formal concepts and relations to enable representing MEDS datasets as RDF graphs. Additionally, we implemented meds2rdf, a Python conversion library that transforms MEDS events into RDF graphs, ensuring conformance with the ontology. We demonstrate the approach on a synthetic clinical dataset that describes patient care pathways for ruptured intracranial aneurysms and validate the resulting graph using SHACL constraints. The first release of MEDS-OWL comprises 13 classes, 10 object properties, 20 data properties, and 24 OWL axioms. Combined with meds2rdf, it enables data transformation into FAIR-aligned datasets, provenance-aware publishing, and interoperability of event-based clinical data. By bridging MEDS with the Semantic Web, this work contributes a reusable semantic layer for event-based clinical data and establishes a robust foundation for subsequent graph-based analytics.

[436] Agentic Rubrics as Contextual Verifiers for SWE Agents

Mohit Raghavendra, Anisha Gunjal, Bing Liu, Yunzhong He

Main category: cs.LG

TL;DR: Agentic Rubrics: Expert agents create context-grounded rubric checklists for software patches, enabling scalable verification without test execution, achieving significant gains over baselines on SWE-Bench.

Details

Motivation: Current verification methods for software engineering agents rely on code execution (hard to scale) or less-grounded alternatives like patch classifiers and heuristics. There's a need for scalable, context-grounded verification that doesn't require test execution.

Method: Agentic Rubrics: An expert agent interacts with the repository to create a context-grounded rubric checklist. Candidate patches are then scored against this rubric without requiring test execution, providing a scalable verification signal.

Result: Achieved 54.2% on Qwen3-Coder-30B-A3B and 40.6% on Qwen3-32B on SWE-Bench Verified under parallel TTS evaluation, with at least +3.5 percentage-point gain over strongest baselines. Rubric scores align with ground-truth tests while also flagging issues tests miss.

Conclusion: Agentic Rubrics provide an efficient, scalable, and granular verification signal for SWE agents, with agentic context gathering being essential for producing codebase-specific, unambiguous criteria.

Abstract: Verification is critical for improving agents: it provides the reward signal for Reinforcement Learning and enables inference-time gains through Test-Time Scaling (TTS). Despite its importance, verification in software engineering (SWE) agent settings often relies on code execution, which can be difficult to scale due to environment setup overhead. Scalable alternatives such as patch classifiers and heuristic methods exist, but they are less grounded in codebase context and harder to interpret. To this end, we explore Agentic Rubrics: an expert agent interacts with the repository to create a context-grounded rubric checklist, and candidate patches are then scored against it without requiring test execution. On SWE-Bench Verified under parallel TTS evaluation, Agentic Rubrics achieve a score of 54.2% on Qwen3-Coder-30B-A3B and 40.6% on Qwen3-32B, with at least a +3.5 percentage-point gain over the strongest baseline in our comparison set. We further analyze rubric behavior, showing that rubric scores are consistent with ground-truth tests while also flagging issues that tests do not capture. Our ablations show that agentic context gathering is essential for producing codebase-specific, unambiguous criteria. Together, these results suggest that Agentic Rubrics provide an efficient, scalable, and granular verification signal for SWE agents.

[437] Robust Physics Discovery from Highly Corrupted Data: A PINN Framework Applied to the Nonlinear Schrödinger Equation

Pietro de Oliveira Esteves

Main category: cs.LG

TL;DR: PINNs with automatic differentiation recover NLSE physical parameters from noisy data with <0.2% error, outperforming traditional methods.

Details

Motivation: Traditional finite difference methods fail under severe noise conditions due to noise amplification in numerical derivatives, especially when experimental data is scarce and noisy in spatiotemporal dynamics.

Method: Physics-Informed Neural Networks (PINNs) integrated with automatic differentiation to recover physical parameters from the Nonlinear Schrodinger Equation using only sparse, randomly sampled noisy data points.

Result: Achieved <0.2% relative error for nonlinear coefficient beta using 500 data points with 20% Gaussian noise, with consistent sub-1% accuracy across different regimes (beta 0.5-2.0) and data sizes (100-1000 points). Robust with <0.15% standard deviation, runs in ~80 minutes on Tesla T4 GPU.

Conclusion: Physics-based regularization effectively filters high measurement uncertainty, positioning PINNs as viable alternative to traditional optimization for inverse problems in spatiotemporal dynamics with scarce, noisy experimental data.

Abstract: We demonstrate a deep learning framework capable of recovering physical parameters from the Nonlinear Schrodinger Equation (NLSE) under severe noise conditions. By integrating Physics-Informed Neural Networks (PINNs) with automatic differentiation, we achieve reconstruction of the nonlinear coefficient beta with less than 0.2 percent relative error using only 500 sparse, randomly sampled data points corrupted by 20 percent additive Gaussian noise, a regime where traditional finite difference methods typically fail due to noise amplification in numerical derivatives. We validate the method’s generalization capabilities across different physical regimes (beta between 0.5 and 2.0) and varying data availability (between 100 and 1000 training points), demonstrating consistent sub-1 percent accuracy. Statistical analysis over multiple independent runs confirms robustness (standard deviation less than 0.15 percent for beta equals 1.0). The complete pipeline executes in approximately 80 minutes on modest cloud GPU resources (NVIDIA Tesla T4), making the approach accessible for widespread adoption. Our results indicate that physics-based regularization acts as an effective filter against high measurement uncertainty, positioning PINNs as a viable alternative to traditional optimization methods for inverse problems in spatiotemporal dynamics where experimental data is scarce and noisy. All code is made publicly available to facilitate reproducibility.

[438] Lightweight Test-Time Adaptation for EMG-Based Gesture Recognition

Nia Touko, Matthew O A Ellis, Cristiano Capone, Alessio Burrello, Elisa Donati, Luca Manneschi

Main category: cs.LG

TL;DR: Lightweight test-time adaptation framework for EMG signal drift using TCN backbone with three deployment-ready strategies for robust long-term prosthetic control.

Details

Motivation: Surface EMG signals suffer from drift due to electrode shifts, muscle fatigue, and posture changes, causing performance degradation in existing models. Current solutions require large datasets or high-compute pipelines impractical for energy-efficient wearables.

Method: Proposes a lightweight TTA framework with TCN backbone and three strategies: (1) causal adaptive batch normalization for real-time statistical alignment, (2) GMM alignment with experience replay to prevent forgetting, and (3) meta-learning for rapid few-shot calibration.

Result: Framework significantly bridges inter-session accuracy gap with minimal overhead on NinaPro DB6 dataset. Experience-replay updates yield superior stability under limited data, while meta-learning achieves competitive performance in one- and two-shot regimes with fraction of data required by benchmarks.

Conclusion: Establishes path toward robust “plug-and-play” myoelectric control for long-term prosthetic use with lightweight, deployment-ready adaptation strategies.

Abstract: Reliable long-term decoding of surface electromyography (EMG) is hindered by signal drift caused by electrode shifts, muscle fatigue, and posture changes. While state-of-the-art models achieve high intra-session accuracy, their performance often degrades sharply. Existing solutions typically demand large datasets or high-compute pipelines that are impractical for energy-efficient wearables. We propose a lightweight framework for Test-Time Adaptation (TTA) using a Temporal Convolutional Network (TCN) backbone. We introduce three deployment-ready strategies: (i) causal adaptive batch normalization for real-time statistical alignment; (ii) a Gaussian Mixture Model (GMM) alignment with experience replay to prevent forgetting; and (iii) meta-learning for rapid, few-shot calibration. Evaluated on the NinaPro DB6 multi-session dataset, our framework significantly bridges the inter-session accuracy gap with minimal overhead. Our results show that experience-replay updates yield superior stability under limited data, while meta-learning achieves competitive performance in one- and two-shot regimes using only a fraction of the data required by current benchmarks. This work establishes a path toward robust, “plug-and-play” myoelectric control for long-term prosthetic use.

[439] Practitioner Motives to Use Different Hyperparameter Optimization Methods

Niclas Kannengießer, Niklas Hasebrook, Felix Morsbach, Marc-André Zöller, Jörg Franke, Marius Lindauer, Frank Hutter, Ali Sunyaev

Main category: cs.LG

TL;DR: Study investigates why practitioners choose less efficient hyperparameter optimization methods like grid search over more efficient ones like Bayesian optimization, revealing practitioner motives and contextual factors.

Details

Motivation: Despite programmatic HPO methods being more sample-efficient, practitioners often use less efficient methods like grid search, leading to under-optimized models. The research aims to understand practitioner-specific motives to improve user-centered HPO tool development.

Method: Conducted 20 semi-structured interviews and an online survey with 49 ML experts to uncover practitioner motives for selecting different HPO methods.

Result: Identified main goals (e.g., increasing ML model understanding) and contextual factors (e.g., available computer resources) that affect practitioners’ selection of HPO methods.

Conclusion: Provides conceptual foundation for understanding why practitioners use different HPO methods, supporting development of more user-centered and context-adaptive HPO tools in automated ML.

Abstract: Programmatic hyperparameter optimization (HPO) methods, such as Bayesian optimization and evolutionary algorithms, are highly sample-efficient in identifying optimal hyperparameter configurations for machine learning (ML) models. However, practitioners frequently use less efficient methods, such as grid search, which can lead to under-optimized models. We suspect this behavior is driven by a range of practitioner-specific motives. Practitioner motives, however, still need to be clarified to enhance user-centered development of HPO tools. To uncover practitioner motives to use different HPO methods, we conducted 20 semi-structured interviews and an online survey with 49 ML experts. By presenting main goals (e.g., increase ML model understanding) and contextual factors affecting practitioners’ selection of HPO methods (e.g., available computer resources), this study offers a conceptual foundation to better understand why practitioners use different HPO methods, supporting development of more user-centered and context-adaptive HPO tools in automated ML.

[440] Discovering the Representation Bottleneck of Graph Neural Networks

Fang Wu, Siyuan Li, Stan Z. Li

Main category: cs.LG

TL;DR: GNNs suffer from a “representation bottleneck” where they fail to capture optimal node interaction complexity for different tasks, caused by inductive biases in graph construction. The paper proposes a dynamic graph rewiring approach to adjust node receptive fields based on learned interaction patterns.

Details

Motivation: Different graph learning tasks require different ranges and complexities of node interactions, but GNNs often fail to capture the most informative interaction styles due to limitations in existing graph construction mechanisms, creating a "representation bottleneck."

Method: Proposes a novel graph rewiring approach that dynamically adjusts each node’s receptive fields based on interaction patterns learned by GNNs, allowing the model to adapt to appropriate interaction complexity for different tasks.

Result: Extensive experiments on real-world and synthetic datasets show the method effectively alleviates the representation bottleneck and outperforms state-of-the-art graph rewiring baselines in enhancing GNN performance.

Conclusion: The proposed dynamic graph rewiring approach successfully addresses GNNs’ representation bottleneck by enabling adaptive adjustment of node receptive fields, leading to improved performance across diverse graph learning tasks.

Abstract: Graph neural networks (GNNs) rely mainly on the message-passing paradigm to propagate node features and build interactions, and different graph learning problems require different ranges of node interactions. In this work, we explore the capacity of GNNs to capture node interactions under contexts of different complexities. We discover that GNNs usually fail to capture the most informative kinds of interaction styles for diverse graph learning tasks, and thus name this phenomenon GNNs’ representation bottleneck. As a response, we demonstrate that the inductive bias introduced by existing graph construction mechanisms can result in this representation bottleneck, \emph{i.e.}, preventing GNNs from learning interactions of the most appropriate complexity. To address that limitation, we propose a novel graph rewiring approach based on interaction patterns learned by GNNs to dynamically adjust each node’s receptive fields. Extensive experiments on both real-world and synthetic datasets prove the effectiveness of our algorithm in alleviating the representation bottleneck and its superiority in enhancing the performance of GNNs over state-of-the-art graph rewiring baselines.

[441] Instructor-inspired Machine Learning for Robust Molecular Property Prediction

Fang Wu, Shuting Jin, Siyuan Li, Stan Z. Li

Main category: cs.LG

TL;DR: InstructMol is an instructive learning algorithm that measures pseudo-label reliability to help target models leverage large-scale unlabeled biochemical data without requiring domain transfer knowledge.

Details

Motivation: Machine learning in chemical/biological science faces data sparsity challenges due to labor-intensive annotation of biochemical data, creating a need for methods that can effectively utilize unlabeled data.

Method: InstructMol uses instructive learning to measure pseudo-labels’ reliability and helps target models leverage large-scale unlabeled data without requiring knowledge transfer between domains, avoiding pretraining-finetuning gaps.

Result: Demonstrated high accuracy on several real-world molecular datasets and out-of-distribution (OOD) benchmarks.

Conclusion: InstructMol provides an effective solution for data sparsity in biochemical ML by enabling reliable use of unlabeled data without domain transfer requirements, with code publicly available.

Abstract: Machine learning catalyzes a revolution in chemical and biological science. However, its efficacy heavily depends on the availability of labeled data, and annotating biochemical data is extremely laborious. To surmount this data sparsity challenge, we present an instructive learning algorithm named InstructMol to measure pseudo-labels’ reliability and help the target model leverage large-scale unlabeled data. InstructMol does not require transferring knowledge between multiple domains, which avoids the potential gap between the pretraining and fine-tuning stages. We demonstrated the high accuracy of InstructMol on several real-world molecular datasets and out-of-distribution (OOD) benchmarks. Code is available at~ https://github.com/smiles724/InstructMol.

[442] Tipping Point Forecasting in Non-Stationary Dynamics on Function Spaces

Miguel Liu-Schiaffini, Clare E. Singer, Nikola Kovachki, Sze Chai Leung, Tapio Schneider, Hyunji Jane Bae, Kamyar Azizzadenesheli, Anima Anandkumar

Main category: cs.LG

TL;DR: A novel recurrent neural operator (RNO) learns non-stationary dynamical systems and uses conformal prediction with physics constraints to forecast tipping points with uncertainty quantification.

Details

Motivation: Tipping points represent abrupt, irreversible changes in chaotic systems (like climate change effects), but forecasting them is challenging due to non-stationarity and limited pre-tipping data.

Method: Develop recurrent neural operator (RNO) to learn function space mappings from pre-tipping dynamics. Use conformal prediction framework to monitor deviations from physics constraints (conserved quantities, PDEs) for tipping point detection with uncertainty quantification.

Result: Method successfully forecasts tipping points in Lorenz-63, Kuramoto-Sivashinsky equations, climate stratocumulus cloud cover, and airfoil wake/stall transitions. Shows zero-shot generalization to multiple tipping points under varying Reynolds numbers.

Conclusion: Even partial or approximate physics constraints enable accurate tipping point forecasting. The RNO with conformal prediction provides rigorous uncertainty quantification for early warning of abrupt system changes.

Abstract: Tipping points are abrupt, drastic, and often irreversible changes in the evolution of non-stationary and chaotic dynamical systems. For instance, increased greenhouse gas concentrations are predicted to lead to drastic decreases in low cloud cover, referred to as a climatological tipping point. In this paper, we learn the evolution of such non-stationary dynamical systems using a novel recurrent neural operator (RNO), which learns mappings between function spaces. After training RNO on only the pre-tipping dynamics, we employ it to detect future tipping points using an uncertainty-based approach. In particular, we propose a conformal prediction framework to forecast tipping points by monitoring deviations from physics constraints (such as conserved quantities and partial differential equations), enabling forecasting of these abrupt changes along with a rigorous measure of uncertainty. We illustrate our proposed methodology on non-stationary ordinary and partial differential equations, such as the Lorenz-63 and Kuramoto-Sivashinsky equations. We also apply our methods to forecast a climate tipping point in stratocumulus cloud cover and airfoil wake and stall transitions using only limited knowledge of the governing equations. For the latter, we show that our proposed method zero-shot generalizes to forecasting multiple future tipping points under varying Reynolds numbers. In our experiments, we demonstrate that even partial or approximate physics constraints can be used to accurately forecast future tipping points.

[443] Graph Reinforcement Learning for Power Grids: A Comprehensive Survey

Mohamed Hassouna, Clara Holzhüter, Pawel Lytaev, Josephine Thomas, Bernhard Sick, Christoph Scholz

Main category: cs.LG

TL;DR: Review paper analyzing how Graph Reinforcement Learning (Graph RL) can improve power grid control by combining Graph Neural Networks with Reinforcement Learning for better representation learning and decision-making.

Details

Motivation: Increasing renewable energy share and distributed generation require more flexible grid control approaches beyond traditional methods. Graph RL offers promise due to its ability to handle graph-structured data and learn control strategies.

Method: Review analysis of Graph RL approaches for power grids, examining three key aspects: graph structure representation, Graph Neural Network architectures, and Reinforcement Learning methods used in existing research.

Result: Graph RL shows adaptability to unpredictable events and noisy data, but current stage is primarily proof-of-concept. Not yet deployable for real-world applications due to open challenges and limitations.

Conclusion: While Graph Reinforcement Learning demonstrates potential for power grid control, significant research gaps remain before practical deployment. The paper identifies key challenges that need to be addressed for real-world implementation.

Abstract: The increasing share of renewable energy and distributed electricity generation requires the development of deep learning approaches to address the lack of flexibility inherent in traditional power grid methods. In this context, Graph Neural Networks are a promising solution due to their ability to learn from graph-structured data. Combined with Reinforcement Learning, they can be used as control approaches to determine remedial actions. This review analyses how Graph Reinforcement Learning can improve representation learning and decision-making in power grid applications, particularly transmission and distribution grids. We analyze the reviewed approaches in terms of the graph structure, the Graph Neural Network architecture, and the Reinforcement Learning approach. Although Graph Reinforcement Learning has demonstrated adaptability to unpredictable events and noisy data, its current stage is primarily proof-of-concept, and it is not yet deployable to real-world applications. We highlight the open challenges and limitations for real-world applications.

[444] Federated Clustering: An Unsupervised Cluster-Wise Training for Decentralized Data Distributions

Mirko Nardi, Lorenzo Valerio, Andrea Passarella

Main category: cs.LG

TL;DR: FedCRef is an unsupervised federated learning method for identifying all data categories across clients in label-free, non-uniform distributions through federated clustering and collaborative model refinement.

Details

Motivation: FL is mainly explored in supervised settings, but its potential in unsupervised scenarios is underexplored. There's a need for methods that can identify complete data categories across multiple clients with privacy constraints and non-uniform data distributions without labels.

Method: FedCRef (Federated Cluster-Wise Refinement): Clients with diverse local data distributions train models on their clusters to generate compressed representations. Local models are shared and compared via reconstruction error analysis to form federated groups. Within groups, clients collaboratively train shared models for each data distribution while continuously refining local clusters through iterative refinement.

Result: The approach successfully identifies all potential data distributions across the network and develops robust representation models for each. Experiments on EMNIST and KMNIST datasets show FedCRef’s ability to refine and align cluster models with actual data distributions, significantly improving data representation precision in unsupervised federated settings.

Conclusion: FedCRef provides an effective distributed solution for unsupervised federated learning that outperforms traditional centralized methods, enabling accurate identification of global data categories while maintaining data privacy in decentralized environments.

Abstract: Federated Learning (FL) is a pivotal approach in decentralized machine learning, especially when data privacy is crucial and direct data sharing is impractical. While FL is typically associated with supervised learning, its potential in unsupervised scenarios is underexplored. This paper introduces a novel unsupervised federated learning methodology designed to identify the complete set of categories (global K) across multiple clients within label-free, non-uniform data distributions, a process known as Federated Clustering. Our approach, Federated Cluster-Wise Refinement (FedCRef), involves clients that collaboratively train models on clusters with similar data distributions. Initially, clients with diverse local data distributions (local K) train models on their clusters to generate compressed data representations. These local models are then shared across the network, enabling clients to compare them through reconstruction error analysis, leading to the formation of federated groups.In these groups, clients collaboratively train a shared model representing each data distribution, while continuously refining their local clusters to enhance data association accuracy. This iterative process allows our system to identify all potential data distributions across the network and develop robust representation models for each. To validate our approach, we compare it with traditional centralized methods, establishing a performance baseline and showcasing the advantages of our distributed solution. We also conduct experiments on the EMNIST and KMNIST datasets, demonstrating FedCRef’s ability to refine and align cluster models with actual data distributions, significantly improving data representation precision in unsupervised federated settings.

[445] CktGen: Automated Analog Circuit Design with Generative Artificial Intelligence

Yuxuan Hou, Hehe Fan, Jianrong Zhang, Yue Zhang, Hua Chen, Min Zhou, Faxin Yu, Roger Zimmermann, Yi Yang

Main category: cs.LG

TL;DR: CktGen: A variational autoencoder approach for specification-conditioned analog circuit generation that handles one-to-many relationships between specifications and valid circuits.

Details

Motivation: Existing analog circuit synthesis methods treat the problem as single-objective optimization, ignoring that design specifications vary widely across applications. The goal is to leverage existing well-designed circuits to improve automation in analog circuit design.

Method: CktGen uses a variational autoencoder that maps discretized specifications and circuits into a joint latent space. It decouples encoding of circuits and specifications, aligns their latent spaces, uses contrastive training with filter masks, and employs classifier guidance with latent feature alignment to handle one-to-many relationships.

Result: Experimental results on open circuit benchmarks show CktGen achieves substantial improvements over state-of-the-art methods, with new metrics introduced to evaluate cross-model consistency.

Conclusion: The proposed specification-conditioned analog circuit generation approach effectively handles the one-to-many mapping problem and demonstrates superior performance in generating circuits that meet target specifications.

Abstract: The automatic synthesis of analog circuits presents significant challenges. Most existing approaches formulate the problem as a single-objective optimization task, overlooking that design specifications for a given circuit type vary widely across applications. To address this, we introduce specification-conditioned analog circuit generation, a task that directly generates analog circuits based on target specifications. The motivation is to leverage existing well-designed circuits to improve automation in analog circuit design. Specifically, we propose CktGen, a simple yet effective variational autoencoder that maps discretized specifications and circuits into a joint latent space and reconstructs the circuit from that latent vector. Notably, as a single specification may correspond to multiple valid circuits, naively fusing specification information into the generative model does not capture these one-to-many relationships. To address this, we decouple the encoding of circuits and specifications and align their mapped latent space. Then, we employ contrastive training with a filter mask to maximize differences between encoded circuits and specifications. Furthermore, classifier guidance along with latent feature alignment promotes the clustering of circuits sharing the same specification, avoiding model collapse into trivial one-to-one mappings. By canonicalizing the latent space with respect to specifications, we can search for an optimal circuit that meets valid target specifications. We conduct comprehensive experiments on the open circuit benchmark and introduce metrics to evaluate cross-model consistency. Experimental results demonstrate that CktGen achieves substantial improvements over state-of-the-art methods.

[446] An Overview of Prototype Formulations for Interpretable Deep Learning

Maximilian Xiling Li, Korbinian Franz Rudolf, Paul Mattes, Nils Blank, Rudolf Lioutikov

Main category: cs.LG

TL;DR: HyperPG introduces probabilistic hyperspherical prototypes that outperform Euclidean prototypes on fine-grained classification datasets with simpler training requirements.

Details

Motivation: To provide interpretable alternatives to black-box deep learning models through prototypical part networks, and to comprehensively analyze different prototype formulations to identify more effective and training-friendly approaches.

Method: Introduces HyperPG, a probabilistic prototype representation using Gaussian distributions on hyperspheres. Compares point-based and probabilistic approaches in both Euclidean and hyperspherical latent spaces across multiple datasets.

Result: Hyperspherical prototypes outperform standard Euclidean formulations on CUB-200-2011, Stanford Cars, and Oxford Flowers datasets. Hyperspherical prototypes maintain competitive performance with simplified training schemes, while Euclidean prototypes require extensive hyperparameter tuning.

Conclusion: Hyperspherical prototype representations offer superior performance and training efficiency compared to Euclidean formulations, making them more practical for interpretable deep learning models while maintaining competitive accuracy.

Abstract: Prototypical part networks offer interpretable alternatives to black-box deep learning models by learning visual prototypes for classification. This work provides a comprehensive analysis of prototype formulations, comparing point-based and probabilistic approaches in both Euclidean and hyperspherical latent spaces. We introduce HyperPG, a probabilistic prototype representation using Gaussian distributions on hyperspheres. Experiments on CUB-200-2011, Stanford Cars, and Oxford Flowers datasets show that hyperspherical prototypes outperform standard Euclidean formulations. Critically, hyperspherical prototypes maintain competitive performance under simplified training schemes, while Euclidean prototypes require extensive hyperparameter tuning.

[447] FedDUAL: A Dual-Strategy with Adaptive Loss and Dynamic Aggregation for Mitigating Data Heterogeneity in Federated Learning

Pranab Sahoo, Ashutosh Tripathi, Sriparna Saha, Samrat Mondal

Main category: cs.LG

TL;DR: The paper proposes a dual-strategy approach to address label skew challenges in Federated Learning, featuring adaptive client loss functions and dynamic server aggregation for improved performance and convergence.

Details

Motivation: Federated Learning faces significant challenges with data heterogeneity, particularly label skew in image classification, leading to performance degradation, slower convergence, and reduced model robustness despite preserving data privacy.

Method: 1) Adaptive loss function for client training that preserves previous knowledge while balancing local optimization and global coherence. 2) Dynamic aggregation strategy at server that adapts to each client’s unique learning patterns to handle diverse data distributions.

Result: Comprehensive evaluation across three diverse real-world datasets with theoretical convergence guarantees demonstrates superior efficacy compared to several established state-of-the-art approaches.

Conclusion: The proposed dual-strategy approach effectively addresses label skew challenges in Federated Learning, improving performance, convergence, and robustness while maintaining data privacy benefits.

Abstract: Federated Learning (FL) marks a transformative approach to distributed model training by combining locally optimized models from various clients into a unified global model. While FL preserves data privacy by eliminating centralized storage, it encounters significant challenges such as performance degradation, slower convergence, and reduced robustness of the global model due to the heterogeneity in client data distributions. Among the various forms of data heterogeneity, label skew emerges as a particularly formidable and prevalent issue, especially in domains such as image classification. To address these challenges, we begin with comprehensive experiments to pinpoint the underlying issues in the FL training process. Based on our findings, we then introduce an innovative dual-strategy approach designed to effectively resolve these issues. First, we introduce an adaptive loss function for client-side training, meticulously crafted to preserve previously acquired knowledge while maintaining an optimal equilibrium between local optimization and global model coherence. Secondly, we develop a dynamic aggregation strategy for aggregating client models at the server. This approach adapts to each client’s unique learning patterns, effectively addressing the challenges of diverse data across the network. Our comprehensive evaluation, conducted across three diverse real-world datasets, coupled with theoretical convergence guarantees, demonstrates the superior efficacy of our method compared to several established state-of-the-art approaches.

[448] Context-Alignment: Activating and Enhancing LLM Capabilities in Time Series

Yuxiao Hu, Qian Li, Dongxiao Zhang, Jinyue Yan, Yuntian Chen

Main category: cs.LG

TL;DR: Proposes Context-Alignment (CA) paradigm to align time series data with linguistic components in LLMs’ familiar environments, enabling LLMs to better understand and process time series through structural and logical alignment using Dual-Scale Context-Alignment GNNs.

Details

Motivation: Existing methods for using LLMs on time series tasks focus on token-level alignment but overlook LLMs' core strength in understanding linguistic logic and structure. Need to activate LLMs' capabilities by aligning TS data with linguistic contexts they're familiar with.

Method: Context-Alignment paradigm with structural alignment (dual-scale nodes for hierarchical structure) and logical alignment (directed edges for logical relationships) via DSCA-GNNs. FSCA instantiation integrates this into pre-trained LLMs through few-shot prompting to enhance logic/structure awareness.

Result: Extensive experiments show effectiveness of FSCA and importance of Context-Alignment across tasks, particularly in few-shot and zero-shot forecasting. Confirms that Context-Alignment provides powerful prior knowledge on context.

Conclusion: Context-Alignment paradigm successfully activates LLMs’ capabilities for time series tasks by aligning TS data with linguistic environments, enabling better comprehension through structural and logical alignment. FSCA demonstrates flexible integration into LLMs to enhance performance.

Abstract: Recently, leveraging pre-trained Large Language Models (LLMs) for time series (TS) tasks has gained increasing attention, which involves activating and enhancing LLMs’ capabilities. Many methods aim to activate LLMs’ capabilities based on token-level alignment, but overlook LLMs’ inherent strength in natural language processing – \textit{their deep understanding of linguistic logic and structure rather than superficial embedding processing.} We propose Context-Alignment (CA), a new paradigm that aligns TS with a linguistic component in the language environments familiar to LLMs to enable LLMs to contextualize and comprehend TS data, thereby activating their capabilities. Specifically, such context-level alignment comprises structural alignment and logical alignment, which is achieved by Dual-Scale Context-Alignment GNNs (DSCA-GNNs) applied to TS-language multimodal inputs. Structural alignment utilizes dual-scale nodes to describe hierarchical structure in TS-language, enabling LLMs to treat long TS data as a whole linguistic component while preserving intrinsic token features. Logical alignment uses directed edges to guide logical relationships, ensuring coherence in the contextual semantics. Following the DSCA-GNNs framework, we propose an instantiation method of CA, termed Few-Shot prompting Context-Alignment (FSCA), to enhance the capabilities of pre-trained LLMs in handling TS tasks. FSCA can be flexibly and repeatedly integrated into various layers of pre-trained LLMs to improve awareness of logic and structure, thereby enhancing performance. Extensive experiments show the effectiveness of FSCA and the importance of Context-Alignment across tasks, particularly in few-shot and zero-shot forecasting, confirming that Context-Alignment provides powerful prior knowledge on context. The code is open-sourced at https://github.com/tokaka22/ICLR25-FSCA.

[449] EquiTabPFN: A Target-Permutation Equivariant Prior Fitted Networks

Michael Arbel, David Salinas, Frank Hutter

Main category: cs.LG

TL;DR: A target-equivariant architecture for tabular data models that eliminates the “equivariance gap” caused by permutation sensitivity in target dimensions, improving stability and performance on multi-class tasks.

Details

Motivation: Existing tabular foundation models like TabPFN have limitations with target dimension ordering - they lack target equivariance, meaning permuting target dimensions changes predictions, creating an irreducible "equivariance gap" that causes prediction instability.

Method: Design a fully target-equivariant architecture using equivariant encoders, decoders, and a bi-attention mechanism to ensure permutation invariance in target dimensions.

Result: On datasets with more classes than seen during pre-training, the model matches or surpasses existing methods while having lower computational overhead.

Conclusion: Target equivariance is crucial for tabular models, and eliminating the equivariance gap improves stability and performance, especially for multi-class classification tasks beyond pre-training scope.

Abstract: Recent foundational models for tabular data, such as TabPFN, excel at adapting to new tasks via in-context learning, but remain constrained to a fixed, pre-defined number of target dimensions-often necessitating costly ensembling strategies. We trace this constraint to a deeper architectural shortcoming: these models lack target equivariance, so that permuting target dimension orderings alters their predictions. This deficiency gives rise to an irreducible “equivariance gap”, an error term that introduces instability in predictions. We eliminate this gap by designing a fully target-equivariant architecture-ensuring permutation invariance via equivariant encoders, decoders, and a bi-attention mechanism. Empirical evaluation on standard classification benchmarks shows that, on datasets with more classes than those seen during pre-training, our model matches or surpasses existing methods while incurring lower computational overhead.

[450] TimeDistill: Efficient Long-Term Time Series Forecasting with MLP via Cross-Architecture Distillation

Juntong Ni, Zewen Liu, Shiyu Wang, Ming Jin, Wei Jin

Main category: cs.LG

TL;DR: TimeDistill: A knowledge distillation framework that transfers multi-scale/multi-period patterns from Transformers/CNNs to lightweight MLPs for efficient long-term time series forecasting.

Details

Motivation: Transformer and CNN models have strong forecasting performance but high computational/storage costs limit large-scale deployment. Need lightweight alternatives that maintain accuracy.

Method: Cross-architecture knowledge distillation framework that transfers complementary patterns (multi-scale, multi-period in temporal/frequency domains) from teacher models (Transformers/CNNs) to student MLP models. Includes theoretical analysis showing KD as specialized mixup augmentation.

Result: Improves MLP performance by up to 18.6%, surpassing teacher models on 8 datasets. Achieves 7X faster inference and 130X fewer parameters. Demonstrates versatility and effectiveness through extensive evaluations.

Conclusion: TimeDistill successfully bridges the efficiency-accuracy gap by distilling complex patterns from heavy models to lightweight MLPs, enabling efficient deployment while maintaining or exceeding teacher performance.

Abstract: Transformer-based and CNN-based methods demonstrate strong performance in long-term time series forecasting. However, their high computational and storage requirements can hinder large-scale deployment. To address this limitation, we propose integrating lightweight MLP with advanced architectures using knowledge distillation (KD). Our preliminary study reveals different models can capture complementary patterns, particularly multi-scale and multi-period patterns in the temporal and frequency domains. Based on this observation, we introduce TimeDistill, a cross-architecture KD framework that transfers these patterns from teacher models (e.g., Transformers, CNNs) to MLP. Additionally, we provide a theoretical analysis, demonstrating that our KD approach can be interpreted as a specialized form of mixup data augmentation. TimeDistill improves MLP performance by up to 18.6%, surpassing teacher models on eight datasets. It also achieves up to 7X faster inference and requires 130X fewer parameters. Furthermore, we conduct extensive evaluations to highlight the versatility and effectiveness of TimeDistill.

[451] Correcting Mode Proportion Bias in Generalized Bayesian Inference via a Weighted Kernel Stein Discrepancy

Elham Afzali, Saman Muthukumarana, Liqun Wang

Main category: cs.LG

TL;DR: The paper proposes a weighted Kernelized Stein Discrepancy method to address the mode insensitivity problem in Generalized Bayesian Inference for intractable multimodal posteriors.

Details

Motivation: Generalized Bayesian Inference (GBI) enhances robustness to model misspecification but suffers from intractable likelihoods. While KSD-Bayes addresses intractability by using gradient information, it has critical pathologies including insensitivity to well-separated modes in multimodal posteriors.

Method: The authors propose a weighted KSD method that retains computational efficiency while effectively capturing multimodal structures. This method improves the GBI framework for handling intractable multimodal posteriors.

Result: Experimental results show the method substantially improves mode sensitivity compared to standard KSD-Bayes, while maintaining robust performance in unimodal settings and in the presence of outliers.

Conclusion: The proposed weighted KSD method successfully addresses the mode insensitivity limitation of KSD-Bayes while preserving key theoretical properties like posterior consistency and asymptotic normality, making GBI more effective for intractable multimodal posteriors.

Abstract: Generalized Bayesian Inference (GBI) provides a flexible framework for updating prior distributions using various loss functions instead of the traditional likelihoods, thereby enhancing the model robustness to model misspecification. However, GBI often suffers the problem associated with intractable likelihoods. Kernelized Stein Discrepancy (KSD), as utilized in a recent study, addresses this challenge by relying only on the gradient of the log-likelihood. Despite this innovation, KSD-Bayes suffers from critical pathologies, including insensitivity to well-separated modes in multimodal posteriors. To address this limitation, we propose a weighted KSD method that retains computational efficiency while effectively capturing multimodal structures. Our method improves the GBI framework for handling intractable multimodal posteriors while maintaining key theoretical properties such as posterior consistency and asymptotic normality. Experimental results demonstrate that our method substantially improves mode sensitivity compared to standard KSD-Bayes, while retaining robust performance in unimodal settings and in the presence of outliers.

[452] Active operator learning with predictive uncertainty quantification for partial differential equations

Nick Winovich, Mitchell Daneker, Lu Lu, Guang Lin

Main category: cs.LG

TL;DR: A lightweight uncertainty quantification method for neural operators (DeepONets/FNO) that provides fast, accurate uncertainty estimates for PDE solutions, enabling efficient outer-loop analyses like Bayesian optimization and active learning.

Details

Motivation: Neural operators need reliable uncertainty quantification for scientific applications, but existing UQ methods (ensembles/Bayesian) are computationally expensive for both training and inference.

Method: Proposes a lightweight predictive UQ method tailored for DeepONets that generalizes to other operator networks. Includes inference optimization via precomputed trunk outputs and sparse placement matrix for 5x speedup. Extends to Fourier Neural Operators for active learning.

Result: Method provides unbiased uncertainty estimates and accurate out-of-distribution predictions with sufficient training data. Enables fast inference for outer-loop analyses (Bayesian optimization, active learning) that would be prohibitively expensive with conventional solvers.

Conclusion: The framework offers a practical route to uncertainty-aware operator learning in time-sensitive settings, with demonstrated applications in Bayesian optimization and active learning for improved accuracy and data-efficiency.

Abstract: With the increased prevalence of neural operators being used to provide rapid solutions to partial differential equations (PDEs), understanding the accuracy of model predictions and the associated error levels is necessary for deploying reliable surrogate models in scientific applications. Existing uncertainty quantification (UQ) frameworks employ ensembles or Bayesian methods, which can incur substantial computational costs during both training and inference. We propose a lightweight predictive UQ method tailored for Deep operator networks (DeepONets) that also generalizes to other operator networks. Numerical experiments on linear and nonlinear PDEs demonstrate that the framework’s uncertainty estimates are unbiased and provide accurate out-of-distribution uncertainty predictions with a sufficiently large training dataset. Our framework provides fast inference and uncertainty estimates that can efficiently drive outer-loop analyses that would be prohibitively expensive with conventional solvers. We demonstrate how predictive uncertainties can be used in the context of Bayesian optimization and active learning problems to yield improvements in accuracy and data-efficiency for outer-loop optimization procedures. In the active learning setup, we extend the framework to Fourier Neural Operators (FNO) and describe a generalized method for other operator networks. To enable real-time deployment, we introduce an inference strategy based on precomputed trunk outputs and a sparse placement matrix, reducing evaluation time by more than a factor of five. Our method provides a practical route to uncertainty-aware operator learning in time-sensitive settings.

[453] TransMamba: A Sequence-Level Hybrid Transformer-Mamba Language Model

Yixing Li, Ruobing Xie, Zhen Yang, Xingwu Sun, Shuaipeng Li, Weidong Han, Zhanhui Kang, Yu Cheng, Chengzhong Xu, Di Wang, Jie Jiang

Main category: cs.LG

TL;DR: TransMamba is a sequence-level hybrid framework that unifies Transformer and Mamba models through shared parameter matrices, enabling dynamic switching between attention and state space mechanisms based on token lengths and layers.

Details

Motivation: Transformers have quadratic complexity limiting long-sequence efficiency, while Mamba offers linear complexity but suffers from unstable contextual learning and multitask generalization. Existing layer-level hybrids don't fully leverage both paradigms.

Method: Proposes TransMamba with shared QKV and CBx parameter matrices between Transformer and Mamba, Memory Converter to bridge attention outputs to SSM states, and TransPoint scheduling for balancing effectiveness and efficiency.

Result: Extensive experiments show TransMamba achieves superior training efficiency and performance compared to single and hybrid baselines, validating deeper consistency between Transformer and Mamba paradigms at sequence level.

Conclusion: TransMamba offers a scalable solution for next-generation language modeling by dynamically leveraging both attention and SSM mechanisms through unified parameter sharing and seamless information flow.

Abstract: Transformers are the cornerstone of modern large language models, but their quadratic computational complexity limits efficiency in long-sequence processing. Recent advancements in Mamba, a state space model (SSM) with linear complexity, offer promising efficiency gains but suffer from unstable contextual learning and multitask generalization. Some works conduct layer-level hybrid structures that combine Transformer and Mamba layers, aiming to make full use of both advantages. This paper proposes TransMamba, a novel sequence-level hybrid framework that unifies Transformer and Mamba through shared parameter matrices (QKV and CBx), and thus could dynamically switch between attention and SSM mechanisms at different token lengths and layers. We design the Memory Converter to bridge Transformer and Mamba by converting attention outputs into SSM-compatible states, ensuring seamless information flow at TransPoints where the transformation happens. The TransPoint scheduling is also thoroughly explored for balancing effectiveness and efficiency. We conducted extensive experiments demonstrating that TransMamba achieves superior training efficiency and performance compared to single and hybrid baselines, and validated the deeper consistency between Transformer and Mamba paradigms at sequence level, offering a scalable solution for next-generation language modeling. Code and data are available at https://github.com/Yixing-Li/TransMamba

[454] Architecture independent generalization bounds for overparametrized deep ReLU networks

Anandatheertha Bapu, Thomas Chen, Chun-Kai Kevin Chien, Patricia Muñoz Ewald, Andrew G. Moore

Main category: cs.LG

TL;DR: Overparametrized neural networks generalize with test error independent of overparametrization and VC dimension, depending only on data geometry, activation regularity, and weight norms.

Details

Motivation: To understand why overparametrized neural networks generalize well despite having huge capacity, and to provide theoretical guarantees that don't depend on traditional complexity measures like VC dimension.

Method: Theoretical analysis proving explicit generalization bounds based on metric geometry of data, activation function regularity, and weight/bias norms. For deep ReLU networks, explicit construction of zero-loss minimizers without gradient descent, with uniform generalization bounds independent of architecture.

Result: Proved test error independent of overparametrization level and VC dimension. For bounded training sample size, constructed zero-loss minimizers and proved uniform generalization bounds. Experimental validation on MNIST showed agreement with true test error within 22% margin on average.

Conclusion: Overparametrized neural networks can generalize well with bounds that depend on data geometry and network properties rather than traditional complexity measures, providing theoretical justification for the success of modern deep learning architectures.

Abstract: We prove that overparametrized neural networks are able to generalize with a test error that is independent of the level of overparametrization, and independent of the Vapnik-Chervonenkis (VC) dimension. We prove explicit bounds that only depend on the metric geometry of the test and training sets, on the regularity properties of the activation function, and on the operator norms of the weights and norms of biases. For overparametrized deep ReLU networks with a training sample size bounded by the input space dimension, we explicitly construct zero loss minimizers without use of gradient descent, and prove a uniform generalization bound that is independent of the network architecture. We perform computational experiments of our theoretical results with MNIST, and obtain agreement with the true test error within a 22 % margin on average.

[455] The Cost of Dynamic Reasoning: Demystifying AI Agents and Test-Time Scaling from an AI Infrastructure Perspective

Jiin Kim, Byeongjun Shin, Jinha Chung, Minsoo Rhu

Main category: cs.LG

TL;DR: First comprehensive system-level analysis of LLM-based AI agents reveals unsustainable computational demands and diminishing returns despite improved accuracy, calling for compute-efficient agent design.

Details

Motivation: While LLM-based AI agents show impressive versatility through dynamic reasoning and multi-turn workflows, this shift introduces serious concerns about system-level cost, efficiency, and sustainability that need comprehensive analysis.

Method: Conducted first comprehensive system-level analysis quantifying resource usage, latency behavior, energy consumption, and datacenter-wide power consumption across diverse agent designs and test-time scaling strategies; characterized how design choices impact accuracy-cost tradeoffs.

Result: Agents improve accuracy with increased compute but suffer from rapidly diminishing returns, widening latency variance, and unsustainable infrastructure costs; evaluation reveals profound computational demands and a looming sustainability crisis.

Conclusion: Calls for paradigm shift in agent design toward compute-efficient reasoning to balance performance with deployability under real-world constraints, addressing the sustainability challenges identified.

Abstract: Large-language-model (LLM)-based AI agents have recently showcased impressive versatility by employing dynamic reasoning, an adaptive, multi-step process that coordinates with external tools. This shift from static, single-turn inference to agentic, multi-turn workflows broadens task generalization and behavioral flexibility, but it also introduces serious concerns about system-level cost, efficiency, and sustainability. This paper presents the first comprehensive system-level analysis of AI agents, quantifying their resource usage, latency behavior, energy consumption, and datacenter-wide power consumption demands across diverse agent designs and test-time scaling strategies. We further characterize how AI agent design choices, such as few-shot prompting, reflection depth, and parallel reasoning, impact accuracy-cost tradeoffs. Our findings reveal that while agents improve accuracy with increased compute, they suffer from rapidly diminishing returns, widening latency variance, and unsustainable infrastructure costs. Through detailed evaluation of representative agents, we highlight the profound computational demands introduced by AI agent workflows, uncovering a looming sustainability crisis. These results call for a paradigm shift in agent design toward compute-efficient reasoning, balancing performance with deployability under real-world constraints.

[456] Reward Is Enough: LLMs Are In-Context Reinforcement Learners

Kefan Song, Amir Moeini, Peng Wang, Lei Gong, Rohan Chandra, Shangtong Zhang, Yanjun Qi

Main category: cs.LG

TL;DR: LLMs can perform reinforcement learning during inference through multi-round prompting with reward feedback, enabling self-improvement without training.

Details

Motivation: To discover if LLMs can exhibit RL-like behavior during inference time, enabling test-time self-improvement through reward optimization without model updates.

Method: ICRL prompting: multi-round framework where LLM receives numerical reward feedback after each response, with context concatenating prior responses and rewards for iterative improvement.

Result: Response quality consistently improves with growing context across Game of 24, creative writing, ScienceWorld, and math competitions, outperforming Self-Refine and Reflexion baselines.

Conclusion: LLMs can perform in-context RL during inference, enabling effective test-time scaling and self-improvement through reward optimization, even with self-generated rewards.

Abstract: Reinforcement learning (RL) is a framework for solving sequential decision-making problems. In this work, we demonstrate that, surprisingly, RL emerges during the inference time of large language models (LLMs), a phenomenon we term in-context RL (ICRL). To reveal this capability, we introduce a simple multi-round prompting framework, we call ICRL prompting, for inference-time self-improvement. The goal of ICRL prompting is to guide LLMs to perform reinforcement learning during inference for self-improvement on a given task. After each response, the model receives numerical scalar feedback, denoted as a reward. In the next round, we prompt the LLM again together with a context that concatenates all prior responses and their associated rewards. We consistently observe that response quality improves as the context grows. In other words, the LLM can optimize scalar reward signals during inference, exhibiting behavior analogous to reinforcement learning. We evaluate ICRL prompting on Game of 24, creative writing, ScienceWorld, and Olympiad-level math competitions (AIME and HMMT), demonstrating significant improvements over baselines such as Self-Refine and Reflexion. Notably, even when the reward signals are generated by the same LLM, ICRL prompting still improves performance, highlighting a promising new paradigm for test-time scaling.

[457] AsFT: Anchoring Safety During LLM Fine-Tuning Within Narrow Safety Basin

Shuo Yang, Qihui Zhang, Yuyang Liu, Xiaojun Jia, Kunpeng Ning, Jiayu Yao, Jigang Wang, Hailiang Dai, Yibing Song, Li Yuan

Main category: cs.LG

TL;DR: Fine-tuning LLMs creates safety vulnerabilities; perturbations orthogonal to alignment direction compromise safety, while updates along alignment preserve it. Proposed AsFT method constrains updates to maintain safety, reducing harmful behaviors by up to 7.60% and improving task performance.

Details

Motivation: Fine-tuning large language models improves performance but introduces critical safety vulnerabilities where even minimal harmful data can severely compromise safety measures. There's a need to maintain safety while allowing beneficial fine-tuning.

Method: Proposed AsFT (Anchoring Safety in Fine-Tuning) explicitly constrains update directions during fine-tuning by penalizing updates orthogonal to the alignment direction (defined by weight differences between aligned and unaligned models). This keeps the model within the “narrow safety basin” revealed by the analysis.

Result: AsFT reduces harmful behaviors by up to 7.60%, improves task performance by 3.44%, and consistently outperforms existing methods across multiple tasks and datasets.

Conclusion: The parameter space forms a “narrow safety basin” where safety is fragile to orthogonal perturbations. AsFT effectively maintains safety during fine-tuning by constraining update directions, offering a practical solution to the safety-performance trade-off in LLM fine-tuning.

Abstract: Fine-tuning large language models (LLMs) improves performance but introduces critical safety vulnerabilities: even minimal harmful data can severely compromise safety measures. We observe that perturbations orthogonal to the alignment direction - defined by weight differences between aligned (safe) and unaligned models - rapidly compromise model safety. In contrast, updates along the alignment direction largely preserve it, revealing the parameter space as a “narrow safety basin”. To address this, we propose AsFT (Anchoring Safety in Fine-Tuning) to maintain safety by explicitly constraining update directions during fine-tuning. By penalizing updates orthogonal to the alignment direction, AsFT effectively constrains the model within the “narrow safety basin,” thus preserving its inherent safety. Extensive experiments on multiple datasets and models show that AsFT reduces harmful behaviors by up to 7.60%, improves task performance by 3.44%, and consistently outperforms existing methods across multiple tasks.

[458] Uncovering Bias Paths with LLM-guided Causal Discovery: An Active Learning and Dynamic Scoring Approach

Khadija Zanna, Akane Sano

Main category: cs.LG

TL;DR: LLM-guided causal discovery framework improves fairness pathway recovery in noisy data by combining statistical measures with semantic priors from variable metadata.

Details

Motivation: Existing causal discovery methods often fail to identify fairness-relevant causal pathways (e.g., race/gender influencing outcomes) in noisy, confounded, or corrupted data, limiting effective fairness auditing in high-stakes domains.

Method: Hybrid LLM-guided causal discovery framework using breadth-first search with active learning and dynamic scoring. Combines mutual information, partial correlation, and LLM confidence into composite scores to prioritize variable pairs for querying, enabling more efficient structure discovery.

Result: LLM-guided methods, especially the active dynamically-scored variant, outperform baselines in recovering fairness-relevant causal structure under noisy conditions, particularly for fairness-critical paths like sex→education→income.

Conclusion: LLM semantic priors complement statistical dependencies for causal discovery in fairness contexts, offering improved pathway recovery for fairness auditing in high-stakes applications where traditional methods struggle with noise and confounding.

Abstract: Ensuring fairness in machine learning requires understanding how sensitive attributes like race or gender causally influence outcomes. Existing causal discovery (CD) methods often struggle to recover fairness-relevant pathways in the presence of noise, confounding, or data corruption. Large language models (LLMs) offer a complementary signal by leveraging semantic priors from variable metadata. We propose a hybrid LLM-guided CD framework that extends a breadth-first search strategy with active learning and dynamic scoring. Variable pairs are prioritized for querying using a composite score combining mutual information, partial correlation, and LLM confidence, enabling more efficient and robust structure discovery. To evaluate fairness sensitivity, we introduce a semi-synthetic benchmark based on the UCI Adult dataset, embedding domain-informed bias pathways alongside noise and latent confounders. We assess how well CD methods recover both global graph structure and fairness-critical paths (e.g., sex–>education–>income). Our results demonstrate that LLM-guided methods, including our active, dynamically scored variant, outperform baselines in recovering fairness-relevant structure under noisy conditions. We analyze when LLM-driven insights complement statistical dependencies and discuss implications for fairness auditing in high-stakes domains.

[459] Machine Learning Model Integration with Open World Temporal Logic for Process Automation

Dyuman Aditya, Colton Payne, Mario Leiva, Paulo Shakarian

Main category: cs.LG

TL;DR: Integration of ML model outputs with PyReason temporal logic programming framework for real-time, explainable decision-making in complex workflows.

Details

Motivation: There's a gap between ML models' perceptual/extractive capabilities and actionable, explainable decisions in complex operational workflows. Current ML outputs need to be translated into decisions that can be understood and acted upon within organizational processes.

Method: Integrate diverse ML model outputs (probabilities, confidence scores) with PyReason, an open-world temporal logic programming reasoning engine. PyReason converts ML outputs into logical facts with truth intervals, continuously polls ML models, dynamically recomputes minimal models, and supports temporal reasoning and knowledge graph integration.

Result: A system that combines ML perception/extraction with logical deduction and transparency, enabling real-time decision-making with full explainability through interface traces. The integration handles time-sensitive process data and organizational knowledge.

Conclusion: The PyReason-ML integration creates a powerful system for automating complex processes across domains like manufacturing, healthcare, and business operations by bridging ML capabilities with logical reasoning and explainability.

Abstract: Recent advances in Machine Learning (ML) have produced models that extract structured information from complex data. However, a significant challenge lies in translating these perceptual or extractive outputs into actionable and explainable decisions within complex operational workflows. To address these challenges, this paper introduces a novel approach that integrates the outputs of various machine learning models directly with the PyReason framework, an open-world temporal logic programming reasoning engine. PyReason’s foundation in generalized annotated logic allows for the incorporation of real-valued outputs (e.g., probabilities, confidence scores) from a diverse set of ML models, treating them as truth intervals within its logical framework. Crucially, PyReason provides mechanisms, implemented in Python, to continuously poll ML model outputs, convert them into logical facts, and dynamically recompute the minimal model to enable decision-making in real-time. Furthermore, its native support for temporal reasoning, knowledge graph integration, and fully explainable interface traces enables an analysis of time-sensitive process data and existing organizational knowledge. By combining the strengths of perception and extraction from ML models with the logical deduction and transparency of PyReason, we aim to create a powerful system for automating complex processes. This integration is well suited for use cases in numerous domains, including manufacturing, healthcare, and business operations.

Lydia T. Liu, Inioluwa Deborah Raji, Angela Zhou, Luke Guerdan, Jessica Hullman, Daniel Malinsky, Bryan Wilder, Simone Zhang, Hammaad Adam, Amanda Coston, Ben Laufer, Ezinne Nwankwo, Michael Zanger-Tishler, Eli Ben-Michael, Solon Barocas, Avi Feller, Marissa Gerchick, Talia Gillis, Shion Guha, Daniel Ho, Lily Hu, Kosuke Imai, Sayash Kapoor, Joshua Loftus, Razieh Nabi, Arvind Narayanan, Ben Recht, Juan Carlos Perdomo, Matthew Salganik, Mark Sendak, Alexander Tolbert, Berk Ustun, Suresh Venkatasubramanian, Angelina Wang, Ashia Wilson

Main category: cs.LG

TL;DR: The paper argues for shifting from prediction-focused to intervention-oriented paradigms in automated decision systems (ADS), emphasizing that ADS operationalize policy interventions and shape population outcomes, requiring new problem setups beyond just prediction tasks.

Details

Motivation: Current ADS are designed for prediction problems but in reality operationalize holistic policy interventions that shape population outcomes. There's a gap between prediction-focused design and the actual intervention-oriented impact of deployed systems within social systems.

Method: Proposes a paradigm shift from prediction-focused to intervention-oriented approach, advocating for new default problem setups that consider predictions as decision support, final decisions, and outcomes. Uses modern statistical frameworks and tools to study ADS design, implementation, and evaluation.

Result: Characterizes limitations of isolated prediction tasks and lays foundation for more intervention-oriented approach to developing and deploying ADS. Provides unified perspective that connects statistical frameworks with practical ADS implementation.

Conclusion: ADS must shift from prediction-focused to intervention-oriented paradigms, requiring new problem setups that account for how predictions become decisions and shape outcomes within social systems, with implications for research directions and practical implementation.

Abstract: Many automated decision systems (ADS) are designed to solve prediction problems – where the goal is to learn patterns from a sample of the population and apply them to individuals from the same population. In reality, these prediction systems operationalize holistic policy interventions in deployment. Once deployed, ADS can shape impacted population outcomes through an effective policy change in how decision-makers operate, while also being defined by past and present interactions between stakeholders and the limitations of existing organizational, as well as societal, infrastructure and context. In this work, we consider the ways in which we must shift from a prediction-focused paradigm to an intervention-oriented paradigm when considering the impact of ADS within social systems. We argue this requires a new default problem setup for ADS beyond prediction, to instead consider predictions as decision support, final decisions, and outcomes. We highlight how this perspective unifies modern statistical frameworks and other tools to study the design, implementation, and evaluation of ADS systems, and point to the research directions necessary to operationalize this paradigm shift. Using these tools, we characterize the limitations of focusing on isolated prediction tasks, and lay the foundation for a more intervention-oriented approach to developing and deploying ADS.

[461] Low Resource Reconstruction Attacks Through Benign Prompts

Sol Yarkoni, Mahmood Sharif, Roi Livni

Main category: cs.LG

TL;DR: A low-resource attack that identifies benign prompts causing unintended image reconstruction from diffusion models trained on scraped e-commerce data, revealing privacy risks even for non-expert users.

Details

Motivation: To address privacy, copyright, and data stewardship concerns in generative models by developing a practical attack that requires minimal resources and training data access, unlike existing computationally intensive methods.

Method: Leverages domain knowledge about scraped e-commerce data where templated layouts and images are tied to pattern-like textual prompts. Identifies seemingly benign prompts that trigger memorized visual elements without requiring specialized knowledge or extensive computational resources.

Result: Demonstrates that even simple prompts like “blue Unisex T-Shirt” can generate real individuals’ faces from training data. Shows reconstructions occur unintentionally and combines identified vulnerabilities with real-world prompt data to discover prompts that reproduce memorized visual elements.

Conclusion: Reveals fundamental vulnerabilities in models trained on scraped e-commerce data, where pattern-like prompts can trigger unintended image reconstruction, posing significant privacy risks even for non-expert users. Provides publicly available code for the attack.

Abstract: Recent advances in generative models, such as diffusion models, have raised concerns related to privacy, copyright infringement, and data stewardship. To better understand and control these risks, prior work has introduced techniques and attacks that reconstruct images, or parts of images, from training data. While these results demonstrate that training data can be recovered, existing methods often rely on high computational resources, partial access to the training set, or carefully engineered prompts. In this work, we present a new attack that requires low resources, assumes little to no access to the training data, and identifies seemingly benign prompts that can lead to potentially risky image reconstruction. We further show that such reconstructions may occur unintentionally, even for users without specialized knowledge. For example, we observe that for one existing model, the prompt ``blue Unisex T-Shirt’’ generates the face of a real individual. Moreover, by combining the identified vulnerabilities with real-world prompt data, we discover prompts that reproduce memorized visual elements. Our approach builds on insights from prior work and leverages domain knowledge to expose a fundamental vulnerability arising from the use of scraped e-commerce data, where templated layouts and images are closely tied to pattern-like textual prompts. The code for our attack is publicly available at https://github.com/TheSolY/lr-tmi.

[462] The Invisible Leash: Why RLVR May or May Not Escape Its Origin

Fang Wu, Weihao Xuan, Ximing Lu, Mingjie Liu, Yi Dong, Zaid Harchaoui, Yejin Choi

Main category: cs.LG

TL;DR: Current RLVR practice improves precision but may not expand reasoning boundaries, instead amplifying existing high-reward outputs while narrowing exploration and potentially missing correct underrepresented solutions.

Details

Motivation: To investigate whether Reinforcement Learning with Verifiable Rewards (RLVR) truly expands LLMs' reasoning capabilities or merely amplifies high-reward outputs that the base model already knows, addressing concerns about potential limitations in current RLVR approaches.

Method: Empirical investigation examining RLVR as a support-constrained optimization mechanism, analyzing entropy-reward trade-offs, and conducting extensive experiments to measure pass@1 improvements versus support shrinkage/expansion under different sampling budgets.

Result: RLVR consistently improves pass@1 but shrinks empirical support more than expands it, failing to recover previously accessible correct answers. While token-level entropy sometimes increases, answer-level entropy declines, indicating generation paths converge onto fewer distinct answers.

Conclusion: Current RLVR recipe has limits in extending reasoning horizons, acting as an “invisible leash” that constrains discovery of novel solutions. Future innovations need explicit exploration mechanisms or hybrid strategies allocating probability mass to underrepresented solution regions.

Abstract: Recent advances in LLMs highlight Reinforcement Learning with Verifiable Rewards (RLVR) as a promising method for enhancing AI capabilities, particularly in solving complex logical tasks. However, it remains unclear whether the current practice of RLVR truly expands a model’s reasoning boundary or mainly amplifies high-reward outputs that the base model already knows, leading to improved precision. This study presents an empirical investigation that provides new insights into the potential limits of the common RLVR recipe. We examine how, under current training conditions, RLVR can operate as a support-constrained optimization mechanism that may restrict the discovery of entirely novel solutions, remaining constrained by the base model’s initial distribution. We also identify an entropy-reward trade-off: while the current RLVR recipe reliably enhances precision, it may progressively narrow exploration and potentially overlook correct yet underrepresented solutions. Extensive empirical experiments show that although the current RLVR recipe consistently improves pass@1, the shrinkage of empirical support generally outweighs the expansion of empirical support under larger sampling budgets, failing to recover correct answers that were previously accessible to the base model. Interestingly, we also observe that while RLVR sometimes increases token-level entropy, it leads to greater uncertainty at each generation step but declining answer-level entropy. This suggests that these seemingly more uncertain generation paths ultimately converge onto a smaller set of distinct answers. Taken together, our findings reveal potential limits of the current RLVR recipe in extending reasoning horizons. Breaking this invisible leash may require future algorithmic innovations, such as explicit exploration mechanisms or hybrid strategies that allocate probability mass to underrepresented solution regions.

[463] Neural Network Quantization for Microcontrollers: A Comprehensive Survey of Methods, Platforms, and Applications

Hamza A. Abushahla, Dara Varam, Ariel Justine N. Panopio, Mohamed I. AlHajri

Main category: cs.LG

TL;DR: A survey paper reviewing neural network quantization methods for microcontroller-based edge devices, focusing on hardware-software co-design, MCU platforms, real-world deployments, and future challenges in TinyML.

Details

Motivation: To address the fundamental challenges of deploying quantized neural networks on resource-constrained edge devices like microcontrollers, balancing model performance, computational complexity, and memory constraints in TinyML applications.

Method: Provides a hardware-oriented survey approach, systematically reviewing quantization methods relevant to MCUs, analyzing hardware capabilities (memory hierarchies, numerical representations, accelerators), reviewing contemporary MCU platforms (ARM, RISC-V, NPU-integrated), and examining supporting software stacks.

Result: The survey consolidates knowledge on quantization techniques for extreme-edge devices, analyzes real-world MCU deployments of quantized models, identifies application domains where such systems are used, and provides a comprehensive overview of the current TinyML ecosystem.

Conclusion: The paper outlines open challenges and promising future directions for scalable, energy-efficient, and sustainable AI deployment on edge devices, emphasizing the need for continued hardware-software co-design in TinyML.

Abstract: The deployment of Quantized Neural Networks (QNNs) on resource-constrained edge devices, such as microcontrollers (MCUs), introduces fundamental challenges in balancing model performance, computational complexity, and memory constraints. Tiny Machine Learning (TinyML) addresses these issues by jointly advancing machine learning algorithms, hardware architectures, and software optimization techniques to enable deep neural network inference on embedded systems. This survey provides a hardware-oriented perspective on neural network quantization, systematically reviewing the quantization methods most relevant to MCUs and extreme-edge devices. Particular emphasis is placed on the critical trade-offs between model performance and the capabilities of MCU-class hardware, including memory hierarchies, numerical representations, and accelerator support. The survey further reviews contemporary MCU hardware platforms, including ARM-based and RISC-V-based designs, as well as MCUs integrating neural processing units (NPUs) for low-precision inference, together with the supporting software stacks. In addition, we analyze real-world deployments of quantized models on MCUs and consolidate the application domains in which such systems are used. Finally, we discuss open challenges and outline promising future directions toward scalable, energy-efficient, and sustainable AI deployment on edge devices.

[464] BiListing: Modality Alignment for Listings

Guillaume Guy, Mihajlo Grbovic, Chun How Tan, Han Zhao

Main category: cs.LG

TL;DR: BiListing is a bimodal embedding approach that aligns text and photos of Airbnb listings using LLMs and vision-language models, enabling single-vector representations per listing/modality for improved search ranking and recommendation.

Details

Motivation: Airbnb historically relied on structured data due to complexity of extracting meaningful information from unstructured text and images. With representation learning advances, leveraging rich unstructured data (multiple images, titles, descriptions, reviews) became possible but challenging to combine into single representations.

Method: BiListing (Bimodal Listing) aligns text and photos by leveraging large-language models and pretrained language-image models to create single embedding vectors per listing and modality, enabling cross-modal understanding and search.

Result: Deployed in production with 0.425% NDCG gain in search ranking, driving tens of millions in incremental revenue. Enables zero-shot semantic search, overcomes cold start problem, and supports listing-to-listing search across modalities.

Conclusion: BiListing successfully addresses the challenge of combining diverse unstructured data from Airbnb listings into unified representations, demonstrating significant business impact through improved search ranking and revenue generation.

Abstract: Airbnb is a leader in offering travel accommodations. Airbnb has historically relied on structured data to understand, rank, and recommend listings to guests due to the limited capabilities and associated complexity arising from extracting meaningful information from text and images. With the rise of representation learning, leveraging rich information from text and photos has become easier. A popular approach has been to create embeddings for text documents and images to enable use cases of computing similarities between listings or using embeddings as features in an ML model. However, an Airbnb listing has diverse unstructured data: multiple images, various unstructured text documents such as title, description, and reviews, making this approach challenging. Specifically, it is a non-trivial task to combine multiple embeddings of different pieces of information to reach a single representation. This paper proposes BiListing, for Bimodal Listing, an approach to align text and photos of a listing by leveraging large-language models and pretrained language-image models. The BiListing approach has several favorable characteristics: capturing unstructured data into a single embedding vector per listing and modality, enabling zero-shot capability to search inventory efficiently in user-friendly semantics, overcoming the cold start problem, and enabling listing-to-listing search along a single modality, or both. We conducted offline and online tests to leverage the BiListing embeddings in the Airbnb search ranking model, and successfully deployed it in production, achieved 0.425% of NDCB gain, and drove tens of millions in incremental revenue.

[465] MyGO: Memory Yielding Generative Offline-consolidation for Lifelong Learning Systems

Shihao Ji, Zihui Song

Main category: cs.LG

TL;DR: MyGO is a lifelong learning framework using generative memory models instead of storing raw data, inspired by biological wake-sleep cycles to prevent catastrophic forgetting while maintaining privacy and storage efficiency.

Details

Motivation: Address limitations of existing continual learning approaches that rely on storing previous task samples (experience replay) or complex regularization, which face challenges with data privacy, storage limitations, and performance degradation on dissimilar tasks.

Method: Two-phase approach: 1) “Wake” phase - rapidly learn new task and train compact generative model (G-mem) to capture data distribution; 2) “Sleep” phase - offline consolidation using all learned G-mem models to generate pseudo-data (“dreams”) and consolidate knowledge into core feature extractor via knowledge distillation.

Result: MyGO significantly mitigates catastrophic forgetting and maintains high average accuracy across tasks on computer vision (Split-MNIST) and natural language processing (Split-AG News) benchmarks compared to sequential fine-tuning baseline.

Conclusion: MyGO provides an effective, domain-general lifelong learning framework that eliminates raw data storage needs, offers privacy and storage efficiency advantages, and successfully prevents catastrophic forgetting through generative memory consolidation.

Abstract: Continual or Lifelong Learning aims to develop models capable of acquiring new knowledge from a sequence of tasks without catastrophically forgetting what has been learned before. Existing approaches often rely on storing samples from previous tasks (experience replay) or employing complex regularization terms to protect learned weights. However, these methods face challenges related to data privacy, storage limitations, and performance degradation when tasks are dissimilar. To address these challenges, we introduce MyGO (Memory Yielding Generative Offline-consolidation), a novel lifelong learning framework inspired by the biological wake-sleep cycle. During the “wake” phase, the system rapidly learns a new task and trains a compact generative model (Generative Memory, G-mem) to capture its data distribution. During the “sleep” phase, the system enters an offline state, using all learned G-mem models to generate pseudo-data (“dreams”) and consolidate new and old knowledge into a core feature extractor via knowledge distillation. This approach obviates the need to store any raw data, retaining only compact generative models, which offers significant advantages in privacy and storage efficiency. We evaluate MyGO on computer vision (Split-MNIST) and natural language processing (Split-AG News) benchmarks, comparing it against a sequential fine-tuning baseline. The results demonstrate that MyGO significantly mitigates catastrophic forgetting and maintains high average accuracy across tasks, proving the framework’s effectiveness and domain-generality.

[466] CaTS-Bench: Can Language Models Describe Time Series?

Luca Zhou, Pratham Yashwante, Marshall Fisher, Alessio Sampieri, Zihao Zhou, Fabio Galasso, Rose Yu

Main category: cs.LG

TL;DR: CaTS-Bench is a new benchmark for context-aware time series captioning with human-rewritten captions across 11 domains, plus synthetic data generation and evaluation tools.

Details

Motivation: Existing time series captioning benchmarks have limitations: they use synthetic or generic captions, neglect metadata and visual representations, and lack human-annotated quality data for proper evaluation.

Method: 1) Created CaTS-Bench with 1746 human-rewritten captions across 11 diverse domains for gold-standard evaluation. 2) Developed scalable pipeline for generating high-fidelity synthetic captions to address data scarcity. 3) Built diagnostic suite with 910 multiple-choice questions and tailored numeric metrics.

Result: Proprietary vision-language models struggle with numeric nuances in temporal descriptions. Finetuning open-source models on synthetic data yields substantial performance gains. The synthetic caption quality was validated.

Conclusion: CaTS-Bench establishes a reliable foundation for grounded, multimodal language generation in numeric domains, addressing key gaps in time series captioning evaluation and enabling better model development.

Abstract: Time series captioning, the task of describing time series in natural language, requires numeric and temporal reasoning, trend interpretation, and contextual understanding. Existing benchmarks, however, often rely on fully synthetic or generic captions, and typically neglect metadata and visual representations. We introduce \textbf{CaTS-Bench}, a comprehensive benchmark for \textbf{C}ontext-\textbf{a}ware \textbf{T}ime \textbf{S}eries reasoning across $11$ diverse domains, centered on a gold-standard evaluation set of $1746$ human-rewritten captions that measure how effectively models translate numeric trends into immediately interpretable narratives. To address the scarcity of human-annotated data, we also propose a scalable pipeline for generating high-fidelity synthetic captions, the quality of which we validate. We evaluate leading Vision-Language Models on our benchmark, revealing that even proprietary models struggle to capture numeric nuances in temporal descriptions, while finetuning open-source models on synthetic data yields substantial performance gains. Finally, we release a diagnostic suite of $910$ multiple-choice questions and tailored numeric metrics to gauge time-series-specific reasoning capabilities, establishing CaTS-Bench as a reliable foundation for grounded, multimodal language generation in numeric domains.

[467] Towards Understanding Feature Learning in Parameter Transfer

Hua Yuan, Xuran Meng, Qiufeng Wang, Shiyu Xia, Ning Xu, Xu Yang, Jing Wang, Xin Geng, Yong Rui

Main category: cs.LG

TL;DR: Theoretical analysis of partial parameter transfer in ReLU CNNs, identifying conditions for beneficial transfer and factors causing negative transfer.

Details

Motivation: There's a lack of theoretical understanding about when partial parameter transfer is beneficial and what factors govern its effectiveness, despite parameter transfer being central to transfer learning.

Method: Analyze ReLU convolutional neural networks (CNNs) in a theoretical framework, characterizing how inherited parameters act as universal knowledge carriers and identifying key factors affecting transfer effectiveness.

Result: Provides first dynamic analysis for parameter transfer and first theoretical proof of negative transfer existence. Numerical and real-world experiments validate theoretical findings.

Conclusion: Theoretical framework explains conditions for beneficial parameter transfer and reasons for negative transfer, advancing understanding of transfer learning mechanisms.

Abstract: Parameter transfer is a central paradigm in transfer learning, enabling knowledge reuse across tasks and domains by sharing model parameters between upstream and downstream models. However, when only a subset of parameters from the upstream model is transferred to the downstream model, there remains a lack of theoretical understanding of the conditions under which such partial parameter reuse is beneficial and of the factors that govern its effectiveness. To address this gap, we analyze a setting in which both the upstream and downstream models are ReLU convolutional neural networks (CNNs). Within this theoretical framework, we characterize how the inherited parameters act as carriers of universal knowledge and identify key factors that amplify their beneficial impact on the target task. Furthermore, our analysis provides insight into why, in certain cases, transferring parameters can lead to lower test accuracy on the target task than training a new model from scratch. To our best knowledge, our theory is the first to provide a dynamic analysis for parameter transfer and also the first to prove the existence of negative transfer theoretically. Numerical experiments and real-world data experiments are conducted to empirically validate our theoretical findings.

[468] Operational early warning of thunderstorm-driven power outages from open data: a two-stage machine learning approach

Iryna Stanishevska, Seth Guikema

Main category: cs.LG

TL;DR: 48-hour early-warning model for thunderstorm power outages using only public data, with two-stage LSTM architecture that filters noise and focuses on severe events.

Details

Motivation: Thunderstorm-driven power outages are difficult to predict due to chaotic convective processes, noisy public data, and understudied nature despite rising economic losses from severe convective storms.

Method: Two-stage LSTM-based architecture using only open-source outage (EAGLE-I) and weather (METAR) data. First stage uses logistic gate to filter routine periods, second stage uses LSTM regressor. Features include parameter-specific kriging to preserve convective extremes, rolling/k-NN inverse-distance aggregates to capture moisture advection, wind shifts, and pressure drops.

Result: Model detects more outage peaks (≥50,000 customers) with only one additional false alarm, reduces peak-conditional MASE (cMASE), and provides event-focused early warnings without utility-specific data.

Conclusion: The two-stage model effectively provides early warnings for thunderstorm-induced power outages using only public data, addressing the challenge of predicting rare but impactful convective events.

Abstract: Thunderstorm-driven power outages are difficult to predict because most storms do not cause damage, convective processes occur rapidly and chaotically, and the available public data are noisy and incomplete. Severe convective storms now account for a large and rising share of U.S. weather losses, yet thunderstorm-induced outages remain understudied. We develop a 48-hour early-warning model for summer thunderstorm-related outages in Michigan using only open-source outage (EAGLE-I) and weather (METAR) data. Relative to prior work, we (i) rely solely on public data, (ii) preserve convective extremes from a sparse station network via parameter-specific kriging and causal spatiotemporal features, and (iii) use a multi-level LSTM-based architecture evaluated on event-centric peak metrics. The pipeline builds rolling and k-NN inverse-distance aggregates to capture moisture advection, wind shifts, and pressure drops. A two-stage design uses a logistic gate followed by a long short-term memory (LSTM) regressor to filter routine periods and limit noise exposure. Evaluation focuses on state-level peaks of at least 50,000 customers without power, using hits, misses, false alarms, and peak-conditional MASE (cMASE) within 48-hour windows, with uncertainty quantified by block bootstrapping. On the test sample, the Two-Stage model detects more peaks with only one additional false alarm and reduces cMASE near peaks, providing event-focused early warnings without the utility-specific data.

[469] WebGym: Scaling Training Environments for Visual Web Agents with Realistic Tasks

Hao Bai, Alexey Taymanov, Tong Zhang, Aviral Kumar, Spencer Whitehead

Main category: cs.LG

TL;DR: WebGym is a large-scale open-source environment with nearly 300K tasks for training visual web agents on real websites, featuring a high-throughput rollout system and showing significant performance improvements over proprietary models on unseen websites.

Details

Motivation: Real websites are non-stationary and diverse, making artificial or small-scale task sets insufficient for robust policy learning. There's a need for large-scale, realistic environments to train visual web agents effectively.

Method: 1) Created WebGym with nearly 300,000 tasks across diverse real-world websites with rubric-based evaluations. 2) Developed a high-throughput asynchronous rollout system achieving 4-5x speedup for web agent sampling. 3) Used simple RL recipe training on agent’s own interaction traces with task rewards as feedback. 4) Fine-tuned Qwen-3-VL-8B-Instruct vision-language model on WebGym.

Result: Fine-tuning on WebGym improved success rate from 26.2% to 42.9% on out-of-distribution test set (websites never seen during training), significantly outperforming GPT-4o (27.1%) and GPT-5-Thinking (29.8%). The system achieved 4-5x rollout speedup compared to naive implementations.

Conclusion: WebGym enables effective training of visual web agents through large-scale, diverse real-world tasks and efficient rollout systems, demonstrating that open-source models can outperform proprietary ones when trained on appropriate large-scale environments.

Abstract: We present WebGym, the largest-to-date open-source environment for training realistic visual web agents. Real websites are non-stationary and diverse, making artificial or small-scale task sets insufficient for robust policy learning. WebGym contains nearly 300,000 tasks with rubric-based evaluations across diverse, real-world websites and difficulty levels. We train agents with a simple reinforcement learning (RL) recipe, which trains on the agent’s own interaction traces (rollouts), using task rewards as feedback to guide learning. To enable scaling RL, we speed up sampling of trajectories in WebGym by developing a high-throughput asynchronous rollout system, designed specifically for web agents. Our system achieves a 4-5x rollout speedup compared to naive implementations. Second, we scale the task set breadth, depth, and size, which results in continued performance improvement. Fine-tuning a strong base vision-language model, Qwen-3-VL-8B-Instruct, on WebGym results in an improvement in success rate on an out-of-distribution test set from 26.2% to 42.9%, significantly outperforming agents based on proprietary models such as GPT-4o and GPT-5-Thinking that achieve 27.1% and 29.8%, respectively. This improvement is substantial because our test set consists only of tasks on websites never seen during training, unlike many other prior works on training visual web agents.

[470] A Hybrid Computational Intelligence Framework with Metaheuristic Optimization for Drug-Drug Interaction Prediction

Maryam Abdollahi Shamami, Babak Teimourpour, Farshad Sharifi

Main category: cs.LG

TL;DR: Interpretable ML framework combining molecular embeddings and clinical knowledge for improved drug-drug interaction prediction, achieving high accuracy and clinical relevance.

Details

Motivation: DDIs cause preventable adverse events and increase healthcare costs, while knowing non-interacting drugs is equally important for safer prescriptions and better patient outcomes.

Method: Combines Mol2Vec (fragment-level structure) and SMILES-BERT (contextual chemical features) embeddings with rule-based clinical score (RBScore) for pharmacological knowledge. Uses lightweight neural classifier optimized with three-stage metaheuristic (RSmpl-ACO-PSO) balancing exploration and refinement.

Result: Achieves high predictive accuracy (ROC-AUC 0.911, PR-AUC 0.867 on DrugBank) and generalizes well to Type 2 Diabetes Mellitus cohort. Studies show contributions of embedding fusion, RBScore, and optimizer to precision and robustness.

Conclusion: Provides practical pathway for building reliable, interpretable, and computationally efficient models to support safer drug therapies and clinical decision-making.

Abstract: Drug-drug interactions (DDIs) are a leading cause of preventable adverse events, often complicating treatment and increasing healthcare costs. At the same time, knowing which drugs do not interact is equally important, as such knowledge supports safer prescriptions and better patient outcomes. In this study, we propose an interpretable and efficient framework that blends modern machine learning with domain knowledge to improve DDI prediction. Our approach combines two complementary molecular embeddings - Mol2Vec, which captures fragment-level structural patterns, and SMILES-BERT, which learns contextual chemical features - together with a leakage-free, rule-based clinical score (RBScore) that injects pharmacological knowledge without relying on interaction labels. A lightweight neural classifier is then optimized using a novel three-stage metaheuristic strategy (RSmpl-ACO-PSO), which balances global exploration and local refinement for stable performance. Experiments on real-world datasets demonstrate that the model achieves high predictive accuracy (ROC-AUC 0.911, PR-AUC 0.867 on DrugBank) and generalizes well to a clinically relevant Type 2 Diabetes Mellitus cohort. Beyond raw performance, studies show how embedding fusion, RBScore, and the optimizer each contribute to precision and robustness. Together, these results highlight a practical pathway for building reliable, interpretable, and computationally efficient models that can support safer drug therapies and clinical decision-making.

[471] Reinforcement Learning for Tool-Integrated Interleaved Thinking towards Cross-Domain Generalization

Zhengyu Chen, Jinluan Yang, Teng Xiao, Ruochen Zhou, Luan Zhang, Xiangyu Xi, Xiaowei Shi, Wei Wang, Jinggang Wang

Main category: cs.LG

TL;DR: RITE framework enables LLM agents trained only on math tasks to generalize tool usage across diverse domains through continuous Plan-Action-Reflection cycles and robust optimization.

Details

Motivation: Current tool-augmented RL approaches struggle with cross-domain generalization, treating tool usage as linear/isolated events that become brittle when transferring from restricted domains (like math) to open-ended tasks.

Method: Proposes RITE (Reinforcement Learning for Interleaved Tool Execution) with continuous Plan-Action-Reflection cycles, Dr. GRPO optimization objective for token-level loss aggregation with importance sampling, dual-component reward system, and dynamic curriculum via online rollout filtering.

Result: Achieves state-of-the-art performance across diverse reasoning domains despite being trained solely on math tasks, demonstrating high token efficiency and strong generalization capabilities.

Conclusion: The RITE framework enables robust cross-domain generalization of tool-augmented LLM agents through interleaved execution and sophisticated optimization techniques, overcoming limitations of traditional linear tool usage paradigms.

Abstract: Recent advances in large language models (LLMs) have demonstrated remarkable capabilities in reasoning and tool utilization. However, the generalization of tool-augmented reinforcement learning (RL) across diverse domains remains a significant challenge. Standard paradigms often treat tool usage as a linear or isolated event, which becomes brittle when transferring skills from restricted domains (e.g., mathematics) to open-ended tasks. In this work, we investigate the cross-domain generalization of an LLM agent trained exclusively on mathematical problem-solving. To facilitate robust skill transfer, we propose a {\textbf{R}einforcement Learning for \textbf{I}nterleaved \textbf{T}ool \textbf{E}xecution (RITE)}. Unlike traditional methods, RITE enforces a continuous ``Plan-Action-Reflection’’ cycle, allowing the model to ground its reasoning in intermediate tool outputs and self-correct during long-horizon tasks. To effectively train this complex interleaved policy, we introduce {Dr. GRPO}, a robust optimization objective that utilizes token-level loss aggregation with importance sampling to mitigate reward sparsity and high-variance credit assignment. Furthermore, we employ a dual-component reward system and dynamic curriculum via online rollout filtering to ensure structural integrity and sample efficiency. Extensive experiments reveal that our approach, despite being trained solely on math tasks, achieves state-of-the-art performance across diverse reasoning domains, demonstrating high token efficiency and strong generalization capabilities.

[472] L-MoE: End-to-End Training of a Lightweight Mixture of Low-Rank Adaptation Experts

Shihao Ji, Zihui Song

Main category: cs.LG

TL;DR: L-MoE: A Lightweight Mixture of LoRA Experts that combines MoE architecture with LoRA fine-tuning into an end-to-end trainable framework using low-rank adapters as experts and differentiable gating.

Details

Motivation: To create a parameter-efficient MoE model that combines the scalability benefits of Mixture of Experts (sparse activation) with the parameter efficiency of Low-Rank Adaptation (LoRA) for specialized task fine-tuning, enabling dynamic skill composition and end-to-end training.

Method: Redefines MoE experts as collections of task-specialized low-rank adapters instead of dense feed-forward networks. Uses a lightweight gating network trained jointly with experts to compute weighted averages of LoRA parameters for each input token. The composition is fully differentiable, allowing gradients to flow through the entire architecture.

Result: A novel framework called L-MoE that provides formal mathematical framework for differentiable routing mechanism and joint optimization, creating modular, parameter-efficient MoE models that can be trained end-to-end.

Conclusion: L-MoE offers a new path toward building more efficient, scalable, and specialized language models by unifying MoE and LoRA paradigms into a single trainable framework that enables dynamic skill composition and parameter efficiency.

Abstract: The Mixture of Experts (MoE) architecture enables the scaling of Large Language Models (LLMs) to trillions of parameters by activating a sparse subset of weights for each input, maintaining constant computational cost during inference. Concurrently, Low-Rank Adaptation (LoRA) has emerged as a dominant technique for parameter-efficiently fine-tuning LLMs on specialized tasks. In this work, we unify these two paradigms into a novel, end-to-end trainable framework named L-MoE: a Lightweight Mixture of LoRA Experts. L-MoE redefines MoE experts not as dense feed-forward networks, but as a collection of task-specialized, low-rank adapters. A lightweight gating network, trained jointly with the experts, learns to dynamically compose these LoRA adapters by computing a weighted average of their parameters for each input token. This composition is fully differentiable, allowing gradients from a standard auto-regressive language modeling objective to flow back through the entire architecture, simultaneously refining both the expert adapters and the routing strategy. This approach creates a highly parameter-efficient MoE model that is modular by design, allows for dynamic skill composition, and is trainable from end-to-end. We present the formal mathematical framework for L-MoE, detailing the differentiable routing mechanism and the joint optimization objective, thereby providing a new path toward building more efficient, scalable, and specialized language models.

[473] Think Outside the Policy: In-Context Steered Policy Optimization

Hsiu-Yuan Huang, Chenming Tang, Weijie Liu, Clive Bai, Saiyong Yang, Yunfang Wu

Main category: cs.LG

TL;DR: ICPO is a new RLVR framework that uses in-context learning to provide expert guidance from existing datasets, improving exploration and training stability without needing advanced model trajectories.

Details

Motivation: Existing RLVR methods like GRPO have limited exploration due to on-policy rollouts confined to current policy distribution. Recent approaches use expert model trajectories but increase computational costs and require inaccessible advanced models.

Method: ICPO leverages LRMs’ in-context learning capability to provide expert guidance from existing datasets. It introduces mixed-policy GRPO with implicit expert forcing, expert region reject sampling to filter unreliable trajectories, and annealed expert-bonus reward shaping.

Result: ICPO consistently enhances RLVR performance and training stability on mathematical reasoning benchmarks, demonstrating a scalable and effective RLVR paradigm for LRMs.

Conclusion: ICPO provides a unified framework that expands exploration beyond current policy distribution without requiring advanced LRM trajectories, offering a more accessible and stable approach to RLVR for large reasoning models.

Abstract: Existing Reinforcement Learning from Verifiable Rewards (RLVR) methods, such as Group Relative Policy Optimization (GRPO), have achieved remarkable progress in improving the reasoning capabilities of Large Reasoning Models (LRMs). However, they exhibit limited exploration due to reliance on on-policy rollouts which are confined to the current policy’s distribution, resulting in narrow trajectory diversity. Recent approaches attempt to expand policy coverage by incorporating trajectories generated from stronger expert models, yet this reliance increases computational cost and such advanced models are often inaccessible. To address these issues, we propose In-Context Steered Policy Optimization (ICPO), a unified framework that leverages the inherent in-context learning capability of LRMs to provide expert guidance using existing datasets. ICPO introduces mixed-policy GRPO with implicit expert forcing, which expands exploration beyond the current policy distribution without requiring advanced LRM trajectories. To further stabilize optimization, ICPO integrates expert region reject sampling to filter unreliable off-policy trajectories and annealed expert-bonus reward shaping to balance early expert guidance with later autonomous improvement. Results demonstrate that ICPO consistently enhances RLVR performance and training stability on mathematical reasoning benchmarks, revealing a scalable and effective RLVR paradigm for LRMs. Our code is available at https://anonymous.4open.science/r/ICPO.

[474] Transolver is a Linear Transformer: Revisiting Physics-Attention through the Lens of Linear Attention

Wenjie Hu, Sidun Liu, Peng Qiao, Zhenglun Sun, Yong Dou

Main category: cs.LG

TL;DR: The paper proposes Linear Attention Neural Operator (LinearNO), which reformulates Physics-Attention from Transolver as canonical linear attention, achieving better performance with fewer parameters and lower computational cost across multiple PDE benchmarks.

Details

Motivation: The authors observe that Physics-Attention in Transolver can be reformulated as a special case of linear attention, and that its slice attention mechanism may actually hurt model performance. They argue that the effectiveness of Physics-Attention primarily comes from slice and deslice operations rather than interactions between slices.

Method: The authors propose a two-step transformation to redesign Physics-Attention into a canonical linear attention, creating Linear Attention Neural Operator (LinearNO). This approach eliminates the potentially harmful slice attention while maintaining the beneficial aspects of the original architecture.

Result: LinearNO achieves state-of-the-art performance on six standard PDE benchmarks while reducing parameters by 40.0% and computational cost by 36.2% on average. It also delivers superior performance on two challenging industrial-level datasets: AirfRANS and Shape-Net Car.

Conclusion: The paper demonstrates that reformulating Physics-Attention as canonical linear attention leads to more efficient and effective neural operators for PDE solving, challenging the conventional wisdom about the importance of slice interactions in Physics-Attention.

Abstract: Recent advances in Transformer-based Neural Operators have enabled significant progress in data-driven solvers for Partial Differential Equations (PDEs). Most current research has focused on reducing the quadratic complexity of attention to address the resulting low training and inference efficiency. Among these works, Transolver stands out as a representative method that introduces Physics-Attention to reduce computational costs. Physics-Attention projects grid points into slices for slice attention, then maps them back through deslicing. However, we observe that Physics-Attention can be reformulated as a special case of linear attention, and that the slice attention may even hurt the model performance. Based on these observations, we argue that its effectiveness primarily arises from the slice and deslice operations rather than interactions between slices. Building on this insight, we propose a two-step transformation to redesign Physics-Attention into a canonical linear attention, which we call Linear Attention Neural Operator (LinearNO). Our method achieves state-of-the-art performance on six standard PDE benchmarks, while reducing the number of parameters by an average of 40.0% and computational cost by 36.2%. Additionally, it delivers superior performance on two challenging, industrial-level datasets: AirfRANS and Shape-Net Car.

[475] Parametric Expensive Multi-Objective Optimization via Generative Solution Modeling

Tingyang Wei, Jiao Liu, Abhishek Gupta, Chin Chun Ooi, Puay Siew Tan, Yew-Soon Ong

Main category: cs.LG

TL;DR: Novel Bayesian optimization framework for parametric expensive multi-objective problems that learns an inverse model to predict optimized solutions for any task-preference query without expensive re-evaluation.

Details

Motivation: Real-world applications often involve families of expensive multi-objective optimization problems under varying conditions, creating parametric problems with infinite distinct instances. Current methods require separate expensive evaluations for each task, which is impractical for continuous parameter spaces.

Method: Parametric multi-task multi-objective Bayesian optimizer that alternates between: (1) acquisition-driven search leveraging inter-task synergies using task-aware Gaussian processes, and (2) generative solution sampling via conditional generative models.

Result: Theoretical justification for faster convergence through inter-task synergies, and empirical verification on synthetic and real-world benchmarks showing effectiveness of the generative alternating framework.

Conclusion: The proposed approach enables efficient optimization across related tasks and achieves direct solution prediction for unseen parameterized EMOPs without additional expensive evaluations, addressing the fundamental challenge of infinite distinct problems in continuous parameter spaces.

Abstract: Many real-world applications require solving families of expensive multi-objective optimization problems~(EMOPs) under varying operational conditions. This gives rise to parametric expensive multi-objective optimization problems (P-EMOPs) where each task parameter defines a distinct optimization instance. Current multi-objective Bayesian optimization methods have been widely used for finding finite sets of Pareto optimal solutions for individual tasks. However, P-EMOPs present a fundamental challenge: the continuous task parameter space can contain infinite distinct problems, each requiring separate expensive evaluations. This demands learning an inverse model that can directly predict optimized solutions for any task-preference query without expensive re-evaluation. This paper introduces a novel parametric multi-task multi-objective Bayesian optimizer that learns this inverse model by alternating between (1) acquisition-driven search leveraging inter-task synergies and (2) generative solution sampling via conditional generative models. This approach enables efficient optimization across related tasks and finally achieves direct solution prediction for unseen parameterized EMOPs without additional expensive evaluations. We theoretically justify the faster convergence by leveraging inter-task synergies through task-aware Gaussian processes. Meanwhile, based on that, empirical studies of our optimizer and inverse model in synthetic and real-world benchmarks further verify the effectiveness of the proposed generative alternating framework.

[476] CausalProfiler: Generating Synthetic Benchmarks for Rigorous and Transparent Evaluation of Causal Machine Learning

Panayiotis Panayiotou, Audrey Poinsot, Alessandro Leite, Nicolas Chesneau, Marc Schoenauer, Özgür Şimşek

Main category: cs.LG

TL;DR: CausalProfiler is a synthetic benchmark generator for Causal ML that randomly samples causal models, data, queries, and ground truths to enable rigorous evaluation of methods across observation, intervention, and counterfactual reasoning levels.

Details

Motivation: Current Causal ML evaluation practices are limited, relying on few hand-crafted or semi-synthetic datasets, leading to brittle and non-generalizable conclusions. There's a need for more comprehensive benchmarking tools.

Method: CausalProfiler randomly samples causal models, data, queries, and ground truths based on explicit design choices about causal models, queries, and data classes. It operates across three levels of causal reasoning: observation, intervention, and counterfactual.

Result: The paper demonstrates CausalProfiler’s utility by evaluating several state-of-the-art Causal ML methods under diverse conditions and assumptions, both in and out of the identification regime, showing the types of analyses it enables.

Conclusion: CausalProfiler provides the first random generator of synthetic causal benchmarks with coverage guarantees and transparent assumptions, enabling rigorous and transparent evaluation of Causal ML methods across different causal reasoning levels.

Abstract: Causal machine learning (Causal ML) aims to answer “what if” questions using machine learning algorithms, making it a promising tool for high-stakes decision-making. Yet, empirical evaluation practices in Causal ML remain limited. Existing benchmarks often rely on a handful of hand-crafted or semi-synthetic datasets, leading to brittle, non-generalizable conclusions. To bridge this gap, we introduce CausalProfiler, a synthetic benchmark generator for Causal ML methods. Based on a set of explicit design choices about the class of causal models, queries, and data considered, the CausalProfiler randomly samples causal models, data, queries, and ground truths constituting the synthetic causal benchmarks. In this way, Causal ML methods can be rigorously and transparently evaluated under a variety of conditions. This work offers the first random generator of synthetic causal benchmarks with coverage guarantees and transparent assumptions operating on the three levels of causal reasoning: observation, intervention, and counterfactual. We demonstrate its utility by evaluating several state-of-the-art methods under diverse conditions and assumptions, both in and out of the identification regime, illustrating the types of analyses and insights the CausalProfiler enables.

[477] The Mean-Field Dynamics of Transformers

Philippe Rigollet

Main category: cs.LG

TL;DR: Transformers are modeled as interacting particle systems with continuum limits connecting to Wasserstein gradient flows, synchronization models, and clustering dynamics, revealing global clustering phenomena and phase transitions in attention mechanisms.

Details

Motivation: To develop a mathematical framework for understanding Transformer attention dynamics through continuum limits and particle system interpretations, connecting to established mathematical theories to analyze clustering behavior and representation collapse in deep architectures.

Method: Develop mathematical framework interpreting Transformer attention as interacting particle system; study continuum (mean-field) limits; idealize attention on sphere; connect to Wasserstein gradient flows, Kuramoto synchronization models, and mean-shift clustering; analyze equiangular reduction for exact clustering rates; examine normalization scheme effects.

Result: Identifies global clustering phenomenon with tokens asymptotically clustering after long metastable states; obtains exact clustering rates via equiangular reduction; shows normalization schemes alter contraction speeds; identifies phase transition for long-context attention; reveals mechanisms driving representation collapse and regimes preserving expressive multi-cluster structure.

Conclusion: The mathematical framework provides deep insights into Transformer dynamics, connecting attention mechanisms to established mathematical theories, revealing clustering phenomena and phase transitions that explain both representation collapse and preservation of expressive structure in deep attention architectures.

Abstract: We develop a mathematical framework that interprets Transformer attention as an interacting particle system and studies its continuum (mean-field) limits. By idealizing attention on the sphere, we connect Transformer dynamics to Wasserstein gradient flows, synchronization models (Kuramoto), and mean-shift clustering. Central to our results is a global clustering phenomenon whereby tokens cluster asymptotically after long metastable states where they are arranged into multiple clusters. We further analyze a tractable equiangular reduction to obtain exact clustering rates, show how commonly used normalization schemes alter contraction speeds, and identify a phase transition for long-context attention. The results highlight both the mechanisms that drive representation collapse and the regimes that preserve expressive, multi-cluster structure in deep attention architectures.

[478] GatedFWA: Linear Flash Windowed Attention with Gated Associative Memory

Jiaxu Liu, Yuhe Bai, Xiangyu Yin, Christos-Savvas Bouganis

Main category: cs.LG

TL;DR: GatedFWA is a memory-gated flash windowed attention mechanism that combines the efficiency of sliding window attention with stable memory updates and controllable gradient flow, addressing limitations of both softmax attention and SWA.

Details

Motivation: Softmax full attention scales quadratically with sequence length, while Sliding Window Attention (SWA) achieves linear-time efficiency but has unbounded training objectives under associative memory interpretation. Softmax attention suffers from memory shrinkage and gradient vanishing.

Method: GatedFWA accumulates per-token/head gates into decay biases added to attention logits, acting as learnable contraction in memory recurrence. It uses fused one-pass gate preprocessing and a FlashAttention-compatible kernel with sliding mask injection for I/O efficiency and numerical stability.

Result: GatedFWA delivers competitive throughput with negligible overhead, better utilization of global context, clean integration with token compression/selection methods like NSA, and generalization to various autoregressive domains.

Conclusion: GatedFWA preserves SWA’s efficiency while stabilizing memory updates and making gradient flow controllable, offering a practical solution for efficient and stable attention mechanisms in autoregressive models.

Abstract: Modern autoregressive models rely on attention, yet the Softmax full attention in Transformers scales quadratically with sequence length. Sliding Window Attention (SWA) achieves linear-time encoding/decoding by constraining the attention pattern, but under an \textit{Associative Memory} interpretation, its difference-style update renders the training objective effectively \emph{unbounded}. In contrast, Softmax attention normalizes updates, leading to \emph{memory shrinkage and gradient vanishing}. We propose GatedFWA: a Memory-\underline{Gated} (\underline{F}lash) \underline{W}indowed \underline{A}ttention mechanism that preserves SWAs efficiency while stabilizing memory updates and making gradient flow controllable. In essence, GatedFWA accumulate a per-token/head gate into a decay bias added to the attention logits, acting as a learnable contraction in the memory recurrence. We implement a fused one-pass gate preprocessing and a FlashAttention-compatible kernel that injects the gate under a sliding mask, ensuring I/O efficiency and numerical stability. On language modelling benchmarks, GatedFWA delivers competitive throughput with negligible overhead and better use of global context, and it integrates cleanly with token compression/selection methods such as NSA and generalizes to various autoregressive domains.

[479] Error-Free Linear Attention is a Free Lunch: Exact Solution from Continuous-Time Dynamics

Jingdi Lei, Di Zhang, Soujanya Poria

Main category: cs.LG

TL;DR: EFLA is a numerically stable, error-free linear attention mechanism that solves the quadratic cost bottleneck in long-context models while maintaining full parallelism and linear-time complexity.

Details

Motivation: To address the quadratic computational cost bottleneck of softmax attention in long-context language models, and to create a more stable and theoretically sound alternative to existing linear-time attention and State Space Models.

Method: Formulates online learning update as continuous-time dynamical system, leverages rank-1 structure of dynamics matrix to derive exact closed-form solution corresponding to infinite-order Runge-Kutta method, achieving error-free computation.

Result: EFLA achieves lower language modeling perplexity and superior downstream benchmark performance than DeltaNet without additional parameters, with robust performance in noisy environments.

Conclusion: Provides a new theoretical foundation for building high-fidelity, scalable linear-time attention models with error-free computation and full parallelism.

Abstract: Linear-time attention and State Space Models (SSMs) promise to solve the quadratic cost bottleneck in long-context language models employing softmax attention. We introduce Error-Free Linear Attention (EFLA), a numerically stable, fully parallelism and generalized formulation of the delta rule. Specifically, we formulate the online learning update as a continuous-time dynamical system and prove that its exact solution is not only attainable but also computable in linear time with full parallelism. By leveraging the rank-1 structure of the dynamics matrix, we directly derive the exact closed-form solution effectively corresponding to the infinite-order Runge-Kutta method. This attention mechanism is theoretically free from error accumulation, perfectly capturing the continuous dynamics while preserving the linear-time complexity. Through an extensive suite of experiments, we show that EFLA enables robust performance in noisy environments, achieving lower language modeling perplexity and superior downstream benchmark performance than DeltaNet without introducing additional parameters. Our work provides a new theoretical foundation for building high-fidelity, scalable linear-time attention models.

[480] Measuring Uncertainty Calibration

Kamil Ciosek, Nicolò Felicioni, Sina Ghiassian, Juan Elenter Litwin, Francesco Tonolini, David Gustafsson, Eva Garcia-Martin, Carmen Barcena Gonzalez, Raphaëlle Bertrand-Lalo

Main category: cs.LG

TL;DR: The paper provides non-asymptotic, distribution-free methods for estimating L₁ calibration error of binary classifiers with bounded variation assumptions and practical modification techniques.

Details

Motivation: There's a need for practical, finite-sample methods to estimate calibration error of binary classifiers without restrictive assumptions, as existing approaches may have limitations in real-world applications.

Method: Two main contributions: 1) An upper bound for classifiers with bounded variation calibration functions, 2) A method to modify any classifier to enable efficient upper bounding of calibration error without significantly impacting performance.

Result: The paper provides non-asymptotic, distribution-free results with practical procedures that can be implemented on real-world datasets with modest computational overhead.

Conclusion: The methods yield practical calibration error measurement procedures and provide advice on how to measure calibration error effectively in practice.

Abstract: We make two contributions to the problem of estimating the $L_1$ calibration error of a binary classifier from a finite dataset. First, we provide an upper bound for any classifier where the calibration function has bounded variation. Second, we provide a method of modifying any classifier so that its calibration error can be upper bounded efficiently without significantly impacting classifier performance and without any restrictive assumptions. All our results are non-asymptotic and distribution-free. We conclude by providing advice on how to measure calibration error in practice. Our methods yield practical procedures that can be run on real-world datasets with modest overhead.

Nathan Buskulic, Luca Calatroni, Lorenzo Rosasco, Silvia Villa

Main category: cs.LG

TL;DR: Theoretical analysis of learning in blind inverse problems using Linear Minimum Mean Square Estimators (LMMSE), establishing connections to Tikhonov regularization and providing convergence guarantees with finite-sample error bounds.

Details

Motivation: Blind inverse problems (where forward operator is unknown) lack interpretable data-driven methods with theoretical guarantees, limiting reliability in applied domains like imaging.

Method: Analyze blind inverse problems within LMMSE framework, derive closed-form optimal estimators, establish equivalence with Tikhonov-regularized formulations, prove convergence under source conditions, and derive finite-sample error bounds.

Result: Established theoretical equivalences between LMMSE and Tikhonov regularization, proved convergence results, derived rigorous finite-sample error bounds quantifying impact of operator randomness, and validated with numerical experiments.

Conclusion: Provides rigorous theoretical foundation for learning in blind inverse problems, offering interpretable connections to regularization methods and explicit performance guarantees that address limitations of purely empirical approaches.

Abstract: Blind inverse problems arise in many experimental settings where the forward operator is partially or entirely unknown. In this context, methods developed for the non-blind case cannot be adapted in a straightforward manner. Recently, data-driven approaches have been proposed to address blind inverse problems, demonstrating strong empirical performance and adaptability. However, these methods often lack interpretability and are not supported by rigorous theoretical guarantees, limiting their reliability in applied domains such as imaging inverse problems. In this work, we shed light on learning in blind inverse problems within the simplified yet insightful framework of Linear Minimum Mean Square Estimators (LMMSEs). We provide an in-depth theoretical analysis, deriving closed-form expressions for optimal estimators and extending classical results. In particular, we establish equivalences with suitably chosen Tikhonov-regularized formulations, where the regularization depends explicitly on the distributions of the unknown signal, the noise, and the random forward operators. We also prove convergence results under appropriate source condition assumptions. Furthermore, we derive rigorous finite-sample error bounds that characterize the performance of learned estimators as a function of the noise level, problem conditioning, and number of available samples. These bounds explicitly quantify the impact of operator randomness and reveal the associated convergence rates as this randomness vanishes. Finally, we validate our theoretical findings through illustrative numerical experiments that confirm the predicted convergence behavior.

[482] Neural Optimal Design of Experiment for Inverse Problems

John E. Darges, Babak Maboudi Afkham, Matthias Chung

Main category: cs.LG

TL;DR: NODE is a learning-based framework for optimal experimental design that jointly trains neural reconstruction models with continuous design variables, avoiding classical bilevel optimization and indirect sparsity regularization.

Details

Motivation: Traditional optimal experimental design methods rely on bilevel optimization and indirect sparsity regularization (like l1 tuning), which are computationally expensive and require careful parameter tuning. There's a need for a more efficient approach that directly optimizes measurement locations while enforcing sparsity naturally.

Method: NODE jointly trains a neural reconstruction model and continuous design variables (representing sensor locations, sampling times, or measurement angles) within a single optimization loop. It directly optimizes measurement locations rather than weighting dense candidate grids, enforcing sparsity by design without l1 regularization.

Result: NODE outperforms baseline approaches on three benchmarks: an analytically tractable exponential growth problem, MNIST image sampling, and a real-world sparse view X-ray CT example. It demonstrates improved reconstruction accuracy and task-specific performance while reducing computational complexity.

Conclusion: NODE provides an effective learning-based alternative to classical optimal experimental design methods, offering improved performance, reduced computational complexity, and eliminating the need for l1 tuning by enforcing sparsity directly through continuous design variable optimization.

Abstract: We introduce Neural Optimal Design of Experiments, a learning-based framework for optimal experimental design in inverse problems that avoids classical bilevel optimization and indirect sparsity regularization. NODE jointly trains a neural reconstruction model and a fixed-budget set of continuous design variables representing sensor locations, sampling times, or measurement angles, within a single optimization loop. By optimizing measurement locations directly rather than weighting a dense grid of candidates, the proposed approach enforces sparsity by design, eliminates the need for l1 tuning, and substantially reduces computational complexity. We validate NODE on an analytically tractable exponential growth benchmark, on MNIST image sampling, and illustrate its effectiveness on a real world sparse view X ray CT example. In all cases, NODE outperforms baseline approaches, demonstrating improved reconstruction accuracy and task-specific performance.

[483] Cycling Race Time Prediction: A Personalized Machine Learning Approach Using Route Topology and Training Load

Francisco Aguilera Moreno

Main category: cs.LG

TL;DR: Machine learning model predicts cycling ride duration using route topology and athlete fitness metrics, achieving 6.60 min MAE and 0.922 R², outperforming topology-only approaches by 14%.

Details

Motivation: Existing physics-based cycling duration prediction models require impractical parameters like aerodynamic drag coefficients and real-time wind forecasts that are inaccessible to most amateur cyclists. There's a need for simpler, data-driven approaches that can leverage available historical performance data.

Method: Machine learning approach using Lasso regression with route topology features combined with athlete fitness state derived from training load metrics (Chronic Training Load - CTL, Acute Training Load - ATL). The model learns athlete-specific patterns from historical ride data in an N-of-1 study design with rigorous feature engineering to prevent data leakage.

Result: The model achieved MAE=6.60 minutes and R²=0.922 on a single-athlete dataset (N=96 rides). Integrating fitness metrics reduced error by 14% compared to topology-only approach (MAE=7.66 min). Progressive checkpoint predictions enable dynamic race planning.

Conclusion: Machine learning with topology and fitness features provides accurate cycling duration predictions without complex physical measurements, demonstrating that physiological state meaningfully constrains performance even in self-paced efforts. The approach offers practical value for amateur cyclists’ training planning and event preparation.

Abstract: Predicting cycling duration for a given route is essential for training planning and event preparation. Existing solutions rely on physics-based models that require extensive parameterization, including aerodynamic drag coefficients and real-time wind forecasts, parameters impractical for most amateur cyclists. This work presents a machine learning approach that predicts ride duration using route topology features combined with the athlete’s current fitness state derived from training load metrics. The model learns athlete-specific performance patterns from historical data, substituting complex physical measurements with historical performance proxies. We evaluate the approach using a single-athlete dataset (N=96 rides) in an N-of-1 study design. After rigorous feature engineering to eliminate data leakage, we find that Lasso regression with Topology + Fitness features achieves MAE=6.60 minutes and R2=0.922. Notably, integrating fitness metrics (Chronic Training Load (CTL), Acute Training Load (ATL)) reduces error by 14% compared to topology alone (MAE=7.66 min), demonstrating that physiological state meaningfully constrains performance even in self-paced efforts. Progressive checkpoint predictions enable dynamic race planning as route difficulty becomes apparent.

[484] Intrinsic-Metric Physics-Informed Neural Networks (IM-PINN) for Reaction-Diffusion Dynamics on Complex Riemannian Manifolds

Julian Evan Chrisnanto, Salsabila Rahma Alia, Nurfauzi Fadillah, Yulison Herry Chrisnanto

Main category: cs.LG

TL;DR: IM-PINN: A mesh-free neural network framework that solves PDEs on complex curved surfaces by embedding Riemannian metrics into automatic differentiation, enabling accurate simulation of reaction-diffusion patterns without geometric discretization constraints.

Details

Motivation: Traditional methods for simulating reaction-diffusion dynamics on complex non-Euclidean manifolds face challenges with high-fidelity mesh generation costs and symplectic drift in time-stepping schemes, limiting their ability to handle extreme curvature fluctuations and anisotropic pattern formation.

Method: Intrinsic-Metric Physics-Informed Neural Network (IM-PINN) embeds the Riemannian metric tensor directly into the automatic differentiation graph to analytically reconstruct the Laplace-Beltrami operator. Uses dual-stream architecture with Fourier feature embeddings to mitigate spectral bias, operating directly in continuous parametric domain without mesh discretization.

Result: Successfully simulates Gray-Scott model on “Stochastic Cloth” manifold with extreme curvature fluctuations (K ∈ [-2489, 3580]), recovering “splitting spot” and “labyrinthine” regimes. Outperforms Surface Finite Element Method with global mass conservation error of 0.157 vs 0.258, eliminating mass drift inherent in semi-implicit integration.

Conclusion: IM-PINN provides a memory-efficient, resolution-independent paradigm for simulating biological pattern formation on evolving surfaces, bridging differential geometry with physics-informed machine learning while maintaining thermodynamic consistency and eliminating discretization artifacts.

Abstract: Simulating nonlinear reaction-diffusion dynamics on complex, non-Euclidean manifolds remains a fundamental challenge in computational morphogenesis, constrained by high-fidelity mesh generation costs and symplectic drift in discrete time-stepping schemes. This study introduces the Intrinsic-Metric Physics-Informed Neural Network (IM-PINN), a mesh-free geometric deep learning framework that solves partial differential equations directly in the continuous parametric domain. By embedding the Riemannian metric tensor into the automatic differentiation graph, our architecture analytically reconstructs the Laplace-Beltrami operator, decoupling solution complexity from geometric discretization. We validate the framework on a “Stochastic Cloth” manifold with extreme Gaussian curvature fluctuations ($K \in [-2489, 3580]$), where traditional adaptive refinement fails to resolve anisotropic Turing instabilities. Using a dual-stream architecture with Fourier feature embeddings to mitigate spectral bias, the IM-PINN recovers the “splitting spot” and “labyrinthine” regimes of the Gray-Scott model. Benchmarking against the Surface Finite Element Method (SFEM) reveals superior physical rigor: the IM-PINN achieves global mass conservation error of $\mathcal{E}_{mass} \approx 0.157$ versus SFEM’s $0.258$, acting as a thermodynamically consistent global solver that eliminates mass drift inherent in semi-implicit integration. The framework offers a memory-efficient, resolution-independent paradigm for simulating biological pattern formation on evolving surfaces, bridging differential geometry and physics-informed machine learning.

[485] Attention Needs to Focus: A Unified Perspective on Attention Allocation

Zichuan Fu, Wentao Song, Guojing Li, Yejing Wang, Xian Wu, Yimin Deng, Hanyu Yan, Yefeng Zheng, Xiangyu Zhao

Main category: cs.LG

TL;DR: Lazy Attention addresses attention overload and underload issues in Transformers through positional discrimination and Elastic-Softmax, achieving competitive performance with high sparsity.

Details

Motivation: Standard attention mechanisms suffer from representational collapse and attention sink problems, which prior work addresses in isolation without recognizing their common root in improper attention allocation.

Method: Proposes Lazy Attention with two components: 1) positional discrimination across heads and dimensions to sharpen token distinctions (mitigates overload), and 2) Elastic-Softmax normalization that relaxes softmax constraints to suppress attention on irrelevant tokens (mitigates underload).

Result: Experiments on FineWeb-Edu corpus across nine benchmarks show Lazy Attention successfully mitigates attention sink, achieves competitive performance compared to standard attention and modern architectures, and reaches up to 59.58% attention sparsity.

Conclusion: Both representational collapse and attention sink stem from improper attention allocation, and Lazy Attention provides a unified solution that addresses both failure modes while maintaining performance and achieving high sparsity.

Abstract: The Transformer architecture, a cornerstone of modern Large Language Models (LLMs), has achieved extraordinary success in sequence modeling, primarily due to its attention mechanism. However, despite its power, the standard attention mechanism is plagued by well-documented issues: representational collapse and attention sink. Although prior work has proposed approaches for these issues, they are often studied in isolation, obscuring their deeper connection. In this paper, we present a unified perspective, arguing that both can be traced to a common root – improper attention allocation. We identify two failure modes: 1) Attention Overload, where tokens receive comparable high weights, blurring semantic features that lead to representational collapse; 2) Attention Underload, where no token is semantically relevant, yet attention is still forced to distribute, resulting in spurious focus such as attention sink. Building on this insight, we introduce Lazy Attention, a novel mechanism designed for a more focused attention distribution. To mitigate overload, it employs positional discrimination across both heads and dimensions to sharpen token distinctions. To counteract underload, it incorporates Elastic-Softmax, a modified normalization function that relaxes the standard softmax constraint to suppress attention on irrelevant tokens. Experiments on the FineWeb-Edu corpus, evaluated across nine diverse benchmarks, demonstrate that Lazy Attention successfully mitigates attention sink and achieves competitive performance compared to both standard attention and modern architectures, while reaching up to 59.58% attention sparsity.

[486] Wittgenstein’s Family Resemblance Clustering Algorithm

Golbahar Amanpour, Benyamin Ghojogh

Main category: cs.LG

TL;DR: The paper introduces a novel clustering algorithm called Wittgenstein’s Family Resemblance (WFR) that applies Wittgenstein’s philosophical concept of family resemblance to machine learning, creating a graph-based clustering method that doesn’t require knowing the number of clusters or their shapes beforehand.

Details

Motivation: The motivation is to bridge analytic philosophy and machine learning by applying Wittgenstein's concept of family resemblance to clustering problems. Traditional clustering often requires knowing the number of clusters or making assumptions about cluster shapes, which can be limiting. The philosophical insight that categories are defined by overlapping similarities rather than rigid definitions offers a more flexible approach to clustering.

Method: The method develops the Wittgenstein’s Family Resemblance (WFR) clustering algorithm and its kernel variant (kernel WFR). It computes resemblance scores between neighboring data instances, thresholds these scores to construct a resemblance graph, and then identifies connected components of this graph as clusters. This graph-based approach naturally implements the philosophical concept of family resemblance through overlapping similarity chains.

Result: Simulations on benchmark datasets demonstrate that WFR is an effective nonlinear clustering algorithm. The algorithm successfully clusters data without requiring prior knowledge of the number of clusters or assumptions about their shapes, validating the application of philosophical concepts to machine learning problems.

Conclusion: The paper successfully bridges analytic philosophy and machine learning by developing a practical clustering algorithm based on Wittgenstein’s family resemblance concept. The WFR algorithm provides a flexible, assumption-free approach to clustering that aligns with philosophical insights about how categories are naturally formed through overlapping similarities rather than rigid definitions.

Abstract: This paper, introducing a novel method in philomatics, draws on Wittgenstein’s concept of family resemblance from analytic philosophy to develop a clustering algorithm for machine learning. According to Wittgenstein’s Philosophical Investigations (1953), family resemblance holds that members of a concept or category are connected by overlapping similarities rather than a single defining property. Consequently, a family of entities forms a chain of items sharing overlapping traits. This philosophical idea naturally lends itself to a graph-based approach in machine learning. Accordingly, we propose the Wittgenstein’s Family Resemblance (WFR) clustering algorithm and its kernel variant, kernel WFR. This algorithm computes resemblance scores between neighboring data instances, and after thresholding these scores, a resemblance graph is constructed. The connected components of this graph define the resulting clusters. Simulations on benchmark datasets demonstrate that WFR is an effective nonlinear clustering algorithm that does not require prior knowledge of the number of clusters or assumptions about their shapes.

[487] Towards a Principled Muon under $μ\mathsf{P}$: Ensuring Spectral Conditions throughout Training

John Zhao

Main category: cs.LG

TL;DR: Muon++ ensures μ-parameterization spectral conditions for matrix-based optimizers throughout LLM training by controlling optimizer updates instead of weights, eliminating costly spectral normalization.

Details

Motivation: Existing methods for ensuring μP spectral conditions with matrix-based optimizers like Muon either fail to guarantee conditions throughout training or require computationally expensive repeated spectral normalization of both weights and updates.

Method: Developed Muon++ variant that maintains spectral control at optimizer update level rather than weight level, eliminating need for explicit weight spectral normalization. Also introduced adaptive spectral condition incorporating data-dependent effects.

Result: Muon++ reliably guarantees μP spectral conditions throughout entire training process, bridging gap between μP theory and practical deployment of matrix-based optimizers for long-horizon LLM training.

Conclusion: The work enables practical μP-compatible training with matrix-based optimizers by showing spectral control at update level is sufficient for moderately large models, with adaptive spectral conditions better suited for long-horizon training.

Abstract: The $μ$-parameterization ($μ$P) provides a principled foundation for large language model (LLM) training by prescribing width-independent learning dynamics, which in turn enables predictable scaling behavior and robust hyperparameter transfer across model sizes. A central requirement of $μ$P is the satisfaction of certain spectral conditions on weight matrices, which ensure consistent feature learning and optimization behavior as model width grows. While these conditions are well understood in theory, guaranteeing their validity in practical training for matrix-based optimizers such as Muon is still under studied. Existing works that study Muon under $μ$P exhibit important limitations: they either do not ensure that the spectral conditions hold throughout the entire training horizon, or require repeated spectral normalization (or Newton-Schulz iterations) applied to both weights and updates, leading to significant computational overhead and reduced practicality. In this work, we show how to reliably guarantee the spectral conditions required by $μ$P for Muon during the entire training process. Our key insight is that for moderately large models, maintaining spectral control at the level of optimizer updates alone is sufficient to preserve $μ$P-compatible scaling, eliminating the need for explicit spectral normalization of the weights. Based on this principle, we develop a variant of Muon, namely Muon++, that satisfies spectral condition throughout the training process. Our results bridge the gap between the theoretical promises of $μ$P and the practical deployment of matrix-based optimizers in long-horizon training. We also take the first step towards an adaptive spectral condition by incorporating data-dependent effects, making it better suited for long-horizon LLM training.

[488] A Differentiable Adversarial Framework for Task-Aware Data Subsampling

Jiacheng Lyu, Bihua Bao

Main category: cs.LG

TL;DR: ASSS is a differentiable, task-aware data subsampling framework that uses adversarial learning between selector and task networks to assign importance weights to samples, enabling intelligent data reduction while maintaining or improving model performance.

Details

Motivation: Traditional data subsampling methods are static and task-independent, often discarding critical information for downstream prediction. The computational challenge of large-scale datasets requires more intelligent, task-aware data reduction approaches.

Method: ASSS uses an adversarial game between selector network and task network. The selector network assigns continuous importance weights to samples via Gumbel-Softmax relaxation, directly optimizing to identify and retain samples with maximum information for specific tasks while balancing fidelity and sparsity.

Result: Experiments on four large-scale real-world datasets show ASSS consistently outperforms heuristic baselines (clustering, nearest neighbor thinning) in maintaining model performance. Notably, ASSS sometimes exceeds training performance of using the entire dataset, demonstrating intelligent denoising effects.

Conclusion: ASSS establishes task-aware data subsampling as a learnable component, providing a principled solution for effective large-scale data learning through differentiable end-to-end optimization that connects with information bottleneck principles.

Abstract: The proliferation of large-scale datasets poses a major computational challenge to model training. The traditional data subsampling method works as a static, task independent preprocessing step which usually discards information that is critical to downstream prediction. In this paper, we introduce the antagonistic soft selection subsampling (ASSS) framework as a novel paradigm that reconstructs data reduction into a differentiable end-to-end learning problem. ASSS uses the adversarial game between selector network and task network, and selector network learning assigns continuous importance weights to samples. This direct optimization implemented by Gumbel-Softmax relaxation allows the selector to identify and retain samples with the maximum amount of information for a specific task target under the guidance of the loss function that balances the fidelity and sparsity of the prediction. Theoretical analysis links this framework with the information bottleneck principle. Comprehensive experiments on four large-scale real world datasets show that ASSS has always been better than heuristic subsampling baselines such as clustering and nearest neighbor thinning in maintaining model performance. It is worth noting that ASSS can not only match, but also sometimes exceed the training performance of the entire dataset, showcasing the effect of intelligent denoising. This work establishes task aware data subsampling as a learnable component, providing a principled solution for effective large-scale data learning.

[489] ELLA: Efficient Lifelong Learning for Adapters in Large Language Models

Shristi Das Biswas, Yue Zhang, Anwesan Pal, Radhika Bhargava, Kaushik Roy

Main category: cs.LG

TL;DR: ELLA is a continual learning framework for LLMs that uses selective subspace de-correlation to prevent catastrophic forgetting without replay or architectural expansion, achieving SOTA performance with minimal memory overhead.

Details

Motivation: LLMs suffer severe catastrophic forgetting when adapted sequentially to new tasks. Existing approaches are limited: replay-based methods are impractical and violate privacy, while strict orthogonality methods reduce degrees of freedom and eliminate forward transfer by forbidding overlap in shared representations.

Method: ELLA uses selective subspace de-correlation - it characterizes past update structures and penalizes alignments along high-energy, task-specific directions while preserving freedom in low-energy residual subspaces. This is implemented via a lightweight regularizer on an aggregated update matrix, corresponding to an anisotropic shrinkage operator that bounds interference.

Result: Achieves state-of-the-art CL performance on three popular benchmarks with relative accuracy gains up to 9.6% and 35× smaller memory footprint. Scales robustly across architectures and enhances zero-shot generalization on unseen tasks.

Conclusion: ELLA provides a principled and scalable solution for constructive lifelong LLM adaptation without data replay, architectural expansion, or significant storage requirements, enabling effective continual learning while preserving transfer capabilities.

Abstract: Large Language Models (LLMs) suffer severe catastrophic forgetting when adapted sequentially to new tasks in a continual learning (CL) setting. Existing approaches are fundamentally limited: replay-based methods are impractical and privacy-violating, while strict orthogonality-based methods collapse under scale: each new task is projected onto an orthogonal complement, progressively reducing the residual degrees of freedom and eliminating forward transfer by forbidding overlap in shared representations. In this work, we introduce ELLA, a training framework built on the principle of selective subspace de-correlation. Rather than forbidding all overlap, ELLA explicitly characterizes the structure of past updates and penalizes alignments along their high-energy, task-specific directions, while preserving freedom in the low-energy residual subspaces to enable transfer. Formally, this is realized via a lightweight regularizer on a single aggregated update matrix. We prove this mechanism corresponds to an anisotropic shrinkage operator that bounds interference, yielding a penalty that is both memory- and compute-constant regardless of task sequence length. ELLA requires no data replay, no architectural expansion, and negligible storage. Empirically, it achieves state-of-the-art CL performance on three popular benchmarks, with relative accuracy gains of up to $9.6%$ and a $35\times$ smaller memory footprint. Further, ELLA scales robustly across architectures and actively enhances the model’s zero-shot generalization performance on unseen tasks, establishing a principled and scalable solution for constructive lifelong LLM adaptation.

cs.MA

[490] PC2P: Multi-Agent Path Finding via Personalized-Enhanced Communication and Crowd Perception

Guotao Li, Shaoyun Xu, Yuexing Hao, Yang Wang, Yuhui Sun

Main category: cs.MA

TL;DR: PC2P is a distributed MAPF method using Q-learning MARL with personalized communication, local crowd perception, and region-based deadlock breaking for improved scalability in diverse environments.

Details

Motivation: Existing distributed MAPF methods integrated with MARL have insufficient collaborative and perceptual capabilities, making them inadequate for scaling across diverse environmental conditions. There's a need for better communication mechanisms and perception to handle partially observable environments effectively.

Method: PC2P uses a Q-learning-based MARL framework with three key components: 1) Personalized-enhanced communication mechanism based on dynamic graph topology with three-stage operations (selection, generation, aggregation), 2) Local crowd perception to enrich heuristic observation by integrating static spatial constraints and dynamic occupancy changes, 3) Region-based deadlock-breaking strategy using expert guidance for coordination in confined areas.

Result: PC2P achieves superior performance compared to state-of-the-art distributed MAPF methods in varied environments. Ablation studies confirm the effectiveness of each module for overall performance.

Conclusion: PC2P successfully addresses scalability challenges in distributed MAPF through enhanced communication, improved perception, and effective deadlock resolution, demonstrating robust performance across diverse environmental conditions.

Abstract: Distributed Multi-Agent Path Finding (MAPF) integrated with Multi-Agent Reinforcement Learning (MARL) has emerged as a prominent research focus, enabling real-time cooperative decision-making in partially observable environments through inter-agent communication. However, due to insufficient collaborative and perceptual capabilities, existing methods are inadequate for scaling across diverse environmental conditions. To address these challenges, we propose PC2P, a novel distributed MAPF method derived from a Q-learning-based MARL framework. Initially, we introduce a personalized-enhanced communication mechanism based on dynamic graph topology, which ascertains the core aspects of who" and what” in interactive process through three-stage operations: selection, generation, and aggregation. Concurrently, we incorporate local crowd perception to enrich agents’ heuristic observation, thereby strengthening the model’s guidance for effective actions via the integration of static spatial constraints and dynamic occupancy changes. To resolve extreme deadlock issues, we propose a region-based deadlock-breaking strategy that leverages expert guidance to implement efficient coordination within confined areas. Experimental results demonstrate that PC2P achieves superior performance compared to state-of-the-art distributed MAPF methods in varied environments. Ablation studies further confirm the effectiveness of each module for overall performance.

[491] LLM-Enabled Multi-Agent Systems: Empirical Evaluation and Insights into Emerging Design Patterns & Paradigms

Harri Renney, Maxim N Nethercott, Nathan Renney, Peter Hayes

Main category: cs.MA

TL;DR: This paper analyzes design patterns for LLM-enabled multi-agent systems, testing them in three real-world case studies with promising development speed but highlighting reliability and scalability challenges.

Details

Motivation: To formalize emerging design patterns for LLM-enabled multi-agent systems and evaluate their practical utility across different domains, addressing the need for modular, domain-adaptive solutions.

Method: Defined key architectural components (agent orchestration, communication mechanisms, control-flow strategies) and tested three real-world case studies in controlled, containerized pilots in telecommunications security, national heritage asset management, and utilities customer service automation.

Result: Prototypes were delivered within two weeks and pilot-ready solutions within one month, showing reduced development overhead and improved user accessibility compared to conventional approaches. However, limitations include LLM behavior variability that challenges production maturity.

Conclusion: The paper outlines critical research directions for improving reliability, scalability, and governance in MAS architectures, emphasizing the need to mature MAS design patterns to mitigate inherent challenges in transitioning from prototype to production.

Abstract: This paper formalises the literature on emerging design patterns and paradigms for Large Language Model (LLM)-enabled multi-agent systems (MAS), evaluating their practical utility across various domains. We define key architectural components, including agent orchestration, communication mechanisms, and control-flow strategies, and demonstrate how these enable rapid development of modular, domain-adaptive solutions. Three real-world case studies are tested in controlled, containerised pilots in telecommunications security, national heritage asset management, and utilities customer service automation. Initial empirical results show that, for these case studies, prototypes were delivered within two weeks and pilot-ready solutions within one month, suggesting reduced development overhead compared to conventional approaches and improved user accessibility. However, findings also reinforce limitations documented in the literature, including variability in LLM behaviour that leads to challenges in transitioning from prototype to production maturity. We conclude by outlining critical research directions for improving reliability, scalability, and governance in MAS architectures and the further work needed to mature MAS design patterns to mitigate the inherent challenges.

[492] A Chromatographic Process Design and Optimization Platform Powered by Large Language Models: A Case Application on Extract of Ginkgo Biloba Leaf

Zhilong Tang, Shaohua Wu, Xinyan Zhao, Yu Wang, Xingchu Gong

Main category: cs.MA

TL;DR: ChromR is an LLM-driven platform that automates chromatographic process development using a domain-specific language model, multi-agent system, and automated experimental device to reduce expert dependency and development time.

Details

Motivation: Traditional chromatographic process development is human-dependent, relying heavily on expert experience, resulting in long development cycles and labor-intensive processes that need automation and efficiency improvements.

Method: Developed ChromR platform integrating: 1) ChromLLM (domain-specific LLM for chromatography), 2) Multi-agent system with four agents (domain knowledge answering, experimental design, experimental execution, data analysis), and 3) Automated chromatographic experimental device for end-to-end workflow automation.

Result: Successfully applied to Ginkgo biloba leaf extract purification case study, developing a chromatographic process in one week (vs. ~7 weeks conventional), meeting multiple objectives including fraction quality and production efficiency, reducing development time to approximately one-seventh.

Conclusion: Established an intelligent, automated, and universally applicable new paradigm for chromatographic process development that effectively reduces expert dependency while significantly decreasing labor input and development time.

Abstract: Chromatographic separation technology has been widely applied in pharmaceutical, chemical, and food industries due to its high efficiency. However, traditional human-dependent chromatographic process development faces challenges such as reliance on expert experience, long development cycles, and labor intensity. ChromR, a large language model (LLM)-driven platform for chromatographic process design and optimization, is presented in this work. The platform integrates ChromLLM, a domain-specific LLM trained for chromatography, along with a multi-agent system and an automated chromatographic experimental device. The multi-agent system comprises four agents: domain knowledge answering, experimental design, experimental execution, and data analysis. ChromR enables automatic completion of the entire workflow-including initial process parameter recommendation, experimental design, automated execution, data analysis, and multi-objective optimization. By utilizing ChromR, dependency on expert knowledge is effectively reduced, while labor input and development time are significantly decreased. Chromatographic purification of the extract of Ginkgo biloba leaf (EGBL) was selected as a case study. ChromR successfully developed a chromatographic process within one week that meets multiple objectives, including fraction quality and production efficiency, reducing development time to approximately one-seventh of that required by the conventional paradigm. An intelligent, automated, and universally applicable new paradigm was established for chromatographic process development.

[493] When Numbers Start Talking: Implicit Numerical Coordination Among LLM-Based Agents

Alessio Buscemi, Daniele Proverbio, Alessandro Di Stefano, The Anh Han, German Castignani, Pietro Liò

Main category: cs.MA

TL;DR: This paper studies covert communication in LLM-driven multi-agent systems using game theory, examining how agents coordinate through indirect signals rather than explicit messages.

Details

Motivation: While existing research focuses on individual LLM agents or explicit communication, little is known about how interacting agents coordinate implicitly through covert communication using indirect or non-linguistic signals embedded in their actions.

Method: The paper uses a game-theoretic approach, analyzing interactions across four canonical game-theoretic settings under different communication regimes (explicit, restricted, absent). It considers heterogeneous agent personalities and both one-shot and repeated games to study when covert signals emerge.

Result: The research characterizes when covert signals emerge in LLM-driven multi-agent systems and how these signals shape coordination and strategic outcomes across different game settings and communication regimes.

Conclusion: The study provides insights into implicit coordination mechanisms in multi-agent LLM systems, revealing how covert communication through indirect signals affects strategic interactions and coordination outcomes in various game-theoretic contexts.

Abstract: LLMs-based agents increasingly operate in multi-agent environments where strategic interaction and coordination are required. While existing work has largely focused on individual agents or on interacting agents sharing explicit communication, less is known about how interacting agents coordinate implicitly. In particular, agents may engage in covert communication, relying on indirect or non-linguistic signals embedded in their actions rather than on explicit messages. This paper presents a game-theoretic study of covert communication in LLM-driven multi-agent systems. We analyse interactions across four canonical game-theoretic settings under different communication regimes, including explicit, restricted, and absent communication. Considering heterogeneous agent personalities and both one-shot and repeated games, we characterise when covert signals emerge and how they shape coordination and strategic outcomes.

[494] Computing Universal Plans for Partially Observable Multi-Agent Routing Using Answer Set Programming

Fengming Zhu, Fangzhen Lin

Main category: cs.MA

TL;DR: The paper proposes using universal plans (policies) instead of classical planning for multi-agent routing problems, implementing an ASP-based system to compute collision-free policies for partially observable agents in 2D environments.

Details

Motivation: Multi-agent routing problems have wide industrial applications (logistics, service robots). Classical planning approaches may be insufficient when agents are autonomous and face unforeseen situations, making universal planning (policies) more appropriate.

Method: The system uses Answer Set Programming (ASP) to translate 2D maps and agent goal profiles into logic programs, computing feasible universal plans (policies) that map agent observations to actions while ensuring collision avoidance.

Result: Experiments reveal insights about which goal profiles and environments yield feasible policies, how feasibility depends on agent sensors, and how action preferences can be customized to compute efficient (near-optimal) policies.

Conclusion: Universal planning with ASP is effective for multi-agent routing with autonomous agents, providing flexible policy computation that handles partial observability and collision avoidance while allowing optimization through preference customization.

Abstract: Multi-agent routing problems have gained significant attention recently due to their wide range of industrial applications, ranging from logistics warehouse automation to indoor service robots. Conventionally, they are modeled as classical planning problems. In this paper, we argue that it can be beneficial to formulate them as universal planning problems, particularly when the agents are autonomous entities and may encounter unforeseen situations. We therefore propose universal plans, also known as policies, as the solution concept, and implement a system based on Answer Set Programming (ASP) to compute them. Given an arbitrary two-dimensional map and a profile of goals for a group of partially observable agents, the system translates the problem configuration into logic programs and finds a feasible universal plan for each agent, mapping its observations to actions while ensuring that there are no collisions with other agents. We use the system to conduct experiments and obtain findings regarding the types of goal profiles and environments that lead to feasible policies, as well as how feasibility may depend on the agents’ sensors. We also demonstrate how users can customize action preferences to compute more efficient policies, even (near-)optimal ones. The code is available at https://github.com/Fernadoo/MAPF_ASP.

[495] The Combined Problem of Online Task Assignment and Lifelong Path Finding in Logistics Warehouses: Rule-Based Systems Matter

Fengming Zhu, Weijia Xu, Yifei Guo, Fangzhen Lin

Main category: cs.MA

TL;DR: Online integration of task assignment and lifelong path finding for warehouse logistics, achieving 16.23% faster execution than current systems and 40% agent reduction for same throughput.

Details

Motivation: Most existing work either focuses on lifelong path finding with given task assignment or studies offline versions where tasks are known in advance. To maximize system throughput, the online integration of both components needs to be tackled directly.

Method: Introduced formal framework for combined problem, designed rule-based lifelong planner that handles severe local congestion, and automated search for task assigner based on underlying path planner.

Result: Simulation in Meituan warehouse scenarios shows: (a) 83.77% execution time of currently deployed system (16.23% faster), outperforming other SOTA by 8.09%; (b) achieves same throughput with only 60% of current agents (40% reduction).

Conclusion: The integrated online approach to task assignment and lifelong path finding significantly improves both time and economic efficiency in warehouse logistics, with practical deployment potential at large-scale platforms like Meituan.

Abstract: We study the combined problem of online task assignment and lifelong path finding, which is crucial for the logistics industries. However, most literature either (1) focuses on lifelong path finding assuming a given task assigner, or (2) studies the offline version of this problem where tasks are known in advance. We argue that, to maximize the system throughput, the online version that integrates these two components should be tackled directly. To this end, we introduce a formal framework of the combined problem and its solution concept. Then, we design a rule-based lifelong planner under a practical robot model that works well even in environments with severe local congestion. Upon that, we automate the search for the task assigner with respect to the underlying path planner. Simulation experiments conducted in warehouse scenarios at Meituan, one of the largest shopping platforms in China, demonstrate that (a)in terms of time efficiency, our system requires only 83.77% of the execution time needed for the currently deployed system at Meituan, outperforming other SOTA algorithms by 8.09%; (b)in terms of economic efficiency, ours can achieve the same throughput with only 60% of the agents currently in use. The code and demos are available at https://github.com/Fernadoo/Online-TAPF.

[496] Computational Foundations for Strategic Coopetition: Formalizing Trust and Reputation Dynamics

Vik Pant, Eric Yu

Main category: cs.MA

TL;DR: This paper bridges conceptual modeling (i*) with computational trust models by developing a two-layer trust system for coopetitive multi-stakeholder environments, featuring asymmetric updating where cooperation builds trust gradually but violations erode it sharply.

Details

Motivation: Existing approaches have limitations: conceptual modeling languages like i* represent trust relationships qualitatively but lack computational mechanisms for analyzing trust evolution, while computational trust models from multi-agent systems provide algorithmic updating but lack grounding in conceptual models that capture strategic dependencies and mixed motives of actors in coopetitive relationships.

Method: Develops a computational trust model extending game-theoretic foundations for strategic coopetition with dynamic trust evolution. Introduces a two-layer trust system with immediate trust responding to current behavior and reputation tracking violation history. Trust evolves through asymmetric updating where cooperation builds trust gradually while violations erode it sharply, creating hysteresis effects and trust ceilings. Provides a structured translation framework to instantiate computational trust models from i* dependency networks.

Result: Comprehensive experimental validation across 78,125 parameter configurations establishes robust emergence of negativity bias, hysteresis effects, and cumulative damage amplification. Empirical validation using the Renault-Nissan Alliance case study (1999-2025) achieves 49/60 validation points (81.7%), successfully reproducing documented trust evolution across five distinct relationship phases including crisis and recovery periods. Companion work achieved 58/60 validation (96.7%) for logarithmic specifications.

Conclusion: The paper successfully bridges the gap between conceptual modeling and computational trust analysis, providing a practical framework for modeling trust evolution in coopetitive multi-stakeholder environments with validation demonstrating effectiveness in reproducing real-world trust dynamics.

Abstract: Modern socio-technical systems increasingly involve multi-stakeholder environments where actors simultaneously cooperate and compete. These coopetitive relationships exhibit dynamic trust evolution based on observed behavior over repeated interactions. While conceptual modeling languages like i* represent trust relationships qualitatively, they lack computational mechanisms for analyzing how trust changes with behavioral evidence. Conversely, computational trust models from multi-agent systems provide algorithmic updating but lack grounding in conceptual models that capture strategic dependencies covering mixed motives of actors. This technical report bridges this gap by developing a computational trust model that extends game-theoretic foundations for strategic coopetition with dynamic trust evolution. Building on companion work that achieved 58/60 validation (96.7%) for logarithmic specifications, we introduce trust as a two-layer system with immediate trust responding to current behavior and reputation tracking violation history. Trust evolves through asymmetric updating where cooperation builds trust gradually while violations erode it sharply, creating hysteresis effects and trust ceilings that constrain relationship recovery. We develop a structured translation framework enabling practitioners to instantiate computational trust models from i* dependency networks encompassing mixed motives of actors. Comprehensive experimental validation across 78,125 parameter configurations establishes robust emergence of negativity bias, hysteresis effects, and cumulative damage amplification. Empirical validation using the Renault-Nissan Alliance case study (1999-2025) achieves 49/60 validation points (81.7%), successfully reproducing documented trust evolution across five distinct relationship phases including crisis and recovery periods.

[497] FinPos: A Position-Aware Trading Agent System for Real Financial Markets

Bijia Liu, Ronghao Dang

Main category: cs.MA

TL;DR: FinPos is a position-aware trading agent system that enhances continuous position management in financial markets using LLMs, outperforming existing methods in realistic trading simulations.

Details

Motivation: Existing LLM-based trading agents operate as isolated directional actions without continuous position management awareness, lacking realism for actual market conditions.

Method: FinPos uses three key mechanisms: 1) professional interpretation of heterogeneous market data, 2) dual-agent structure separating directional reasoning from risk-aware position adjustment, and 3) multi-timescale reward signals for experiential learning.

Result: FinPos surpasses state-of-the-art trading agents in position-aware trading tasks that closely mirror real market conditions.

Conclusion: LLM-centered agent systems have significant unexplored potential in long-term market decision-making, with position awareness being crucial for realistic trading applications.

Abstract: The exceptional potential of large language models (LLMs) in handling text information has garnered significant attention in the field of financial trading. However, most existing trading agents operate under intraday, independent unit-based trading tasks, where decisions are made as isolated directional actions, and thus lack awareness of continuous position management. Therefore, we propose a position-aware trading task designed to simulate a more realistic market. To address this task, we propose FinPos, a position-aware trading agent system designed to explicitly model and manage continuous positions. FinPos enhances position awareness through three key mechanisms: (1) professional-level interpretation of heterogeneous market information; (2) a dual-agent decision structure that separates directional reasoning from risk-aware position adjustment; and (3) multi-timescale reward signals, allowing the agent to internalize position awareness through experiential feedback rather than static instructions alone. Extensive experiments demonstrate that FinPos surpasses state-of-the-art trading agents in the position-aware trading task, which closely mirrors real market conditions. More importantly, our findings reveal that LLM-centered agent systems exhibit a vast, largely unexplored potential in long-term market decision-making.

cs.MM

[498] Transforming Video Subjective Testing with Training, Engagement, and Real-Time Feedback

Kumar Rahul, Sriram Sethuraman, Andrew Segall, Yixu Chen

Main category: cs.MM

TL;DR: Proposed framework improves subjective video quality assessment through automated rater training, real-time attention scoring, and efficient pairwise comparisons to get better quality scores with fewer comparisons.

Details

Motivation: Traditional subjective video quality assessment protocols have limitations in capturing nuanced perceptual differences and ensuring reliable user input. Need better methods to improve rater training, maintain attention, and reduce the number of comparisons needed.

Method: Three-phase approach: 1) Automated training quiz to teach quality indicators and verify readiness, 2) Real-time attention scoring using “golden” video pairs with penalties for lapses, 3) Efficient chain-based pairwise comparisons yielding JOD (Just-Objectionable-Differences) units.

Result: Experiments with 80 participants across three groups showed training-quiz significantly improves data quality (golden unit accuracy, reduced tie rate). Real-time feedback further improves data quality and yields most monotonic quality ratings. Reduces non-monotonic cases on high-quality part of R-Q curve.

Conclusion: The integrated framework with training, quiz, and testing with feedback improves subjective video quality assessment reliability, reduces comparison count, and helps train better objective video quality metrics by addressing viewer preferences for slightly compressed less-grainy content.

Abstract: Subjective video quality assessment is crucial for optimizing streaming and compression, yet traditional protocols face limitations in capturing nuanced perceptual differences and ensuring reliable user input. We propose an integrated framework that enhances rater training, enforces attention through real-time scoring, and streamlines pairwise comparisons to recover quality scores with fewer comparisons. Participants first undergo an automated training quiz to learn key video quality indicators (e.g., compression artifacts) and verify their readiness. During the test, a real-time attention scoring mechanism, using “golden” video pairs, monitors and reinforces rater focus by applying penalties for lapses. An efficient chain-based pairwise comparison procedure is then employed, yielding quality scores in Just-Objectionable-Differences (JOD) units. Experiments comparing three groups (no training, training without feedback, and training with feedback) with 80 participants demonstrate that training-quiz significantly improves data quality in terms of golden unit accuracy and reduces tie rate, while real-time feedback further improves data quality and yields the most monotonic quality ratings. The new training, quiz, testing with feedback, 3-phase approach can significantly reduce the non-monotonic cases on the high quality part of the R-Q curve where normal viewer typically prefer the slightly compressed less-grainy content and help train a better objective video quality metric.

eess.AS

[499] Discriminating real and synthetic super-resolved audio samples using embedding-based classifiers

Mikhail Silaev, Konstantinos Drossos, Tuomas Virtanen

Main category: eess.AS

TL;DR: GANs and diffusion models produce high-quality audio super-resolution, but embedding-based classifiers can still perfectly distinguish real from synthetic audio, revealing a gap between perceptual quality and true distributional fidelity.

Details

Motivation: Existing evaluations of audio super-resolution models rely primarily on signal-level or perceptual metrics, leaving open the question of how closely the distributions of synthetic super-resolved and real wideband audio actually match.

Method: Analyze separability of real and super-resolved audio in various embedding spaces using linear classifiers trained to distinguish real from synthetic samples based on multiple types of audio embeddings. Test on middle-band (4→16 kHz) and full-band (16→48 kHz) upsampling tasks for speech and music.

Result: Embedding-based classifiers achieve near-perfect separation between real and synthetic audio, even when generated audio attains high perceptual quality and state-of-the-art metric scores. This behavior is consistent across datasets and models, including recent diffusion-based approaches.

Conclusion: There is a persistent gap between perceptual quality and true distributional fidelity in audio super-resolution models, highlighting that current evaluation metrics don’t capture distributional differences that classifiers can easily detect.

Abstract: Generative adversarial networks (GANs) and diffusion models have recently achieved state-of-the-art performance in audio super-resolution (ADSR), producing perceptually convincing wideband audio from narrowband inputs. However, existing evaluations primarily rely on signal-level or perceptual metrics, leaving open the question of how closely the distributions of synthetic super-resolved and real wideband audio match. Here we address this problem by analyzing the separability of real and super-resolved audio in various embedding spaces. We consider both middle-band ($4\to 16$~kHz) and full-band ($16\to 48$~kHz) upsampling tasks for speech and music, training linear classifiers to distinguish real from synthetic samples based on multiple types of audio embeddings. Comparisons with objective metrics and subjective listening tests reveal that embedding-based classifiers achieve near-perfect separation, even when the generated audio attains high perceptual quality and state-of-the-art metric scores. This behavior is consistent across datasets and models, including recent diffusion-based approaches, highlighting a persistent gap between perceptual quality and true distributional fidelity in ADSR models.

[500] Learning from Limited Labels: Transductive Graph Label Propagation for Indian Music Analysis

Parampreet Singh, Akshay Raina, Sayeedul Islam Sheikh, Vipul Arora

Main category: eess.AS

TL;DR: Label propagation using graph-based semi-supervised learning reduces annotation overhead for audio/music tasks by propagating labels from small labeled sets to larger unlabeled collections.

Details

Motivation: Audio and music domains lack large annotated datasets due to resource-intensive, laborious annotation processes requiring expert domain knowledge, creating a bottleneck for supervised machine learning approaches.

Method: Uses label propagation (LP), a graph-based semi-supervised learning technique, by constructing similarity graphs over audio embeddings to propagate limited label information from small annotated subsets to larger unlabeled corpora in transductive settings.

Result: LP significantly reduces labeling overhead and produces higher-quality annotations compared to conventional baseline methods, including pretrained inductive models, for both Raga identification and Instrument classification tasks in Indian Art Music.

Conclusion: Graph-based semi-supervised learning has potential to democratize data annotation and accelerate progress in music information retrieval by reducing dependency on extensive manual labeling.

Abstract: Supervised machine learning frameworks rely on extensive labeled datasets for robust performance on real-world tasks. However, there is a lack of large annotated datasets in audio and music domains, as annotating such recordings is resource-intensive, laborious, and often require expert domain knowledge. In this work, we explore the use of label propagation (LP), a graph-based semi-supervised learning technique, for automatically labeling the unlabeled set in an unsupervised manner. By constructing a similarity graph over audio embeddings, we propagate limited label information from a small annotated subset to a larger unlabeled corpus in a transductive, semi-supervised setting. We apply this method to two tasks in Indian Art Music (IAM): Raga identification and Instrument classification. For both these tasks, we integrate multiple public datasets along with additional recordings we acquire from Prasar Bharati Archives to perform LP. Our experiments demonstrate that LP significantly reduces labeling overhead and produces higher-quality annotations compared to conventional baseline methods, including those based on pretrained inductive models. These results highlight the potential of graph-based semi-supervised learning to democratize data annotation and accelerate progress in music information retrieval.

[501] ReStyle-TTS: Relative and Continuous Style Control for Zero-Shot Speech Synthesis

Haitao Li, Chunxiang Jin, Chenglin Li, Wenhao Guan, Zhengxing Huang, Xie Chen

Main category: eess.AS

TL;DR: ReStyle-TTS enables continuous, reference-relative style control in zero-shot TTS by reducing implicit reference style dependence and adding explicit control mechanisms.

Details

Motivation: Current zero-shot TTS models inherit speaking style from reference audio, requiring careful reference selection. Existing controllable TTS methods use absolute style targets and discrete textual prompts, lacking continuous and reference-relative control.

Method: Introduces Decoupled Classifier-Free Guidance (DCFG) to independently control text and reference guidance, reducing reference style reliance. Uses style-specific LoRAs with Orthogonal LoRA Fusion for continuous multi-attribute control, and Timbre Consistency Optimization to prevent timbre drift.

Result: Enables user-friendly continuous relative control over pitch, energy, and multiple emotions while maintaining intelligibility and speaker timbre. Performs robustly in challenging mismatched reference-target style scenarios.

Conclusion: ReStyle-TTS successfully addresses the limitation of zero-shot TTS by enabling continuous, reference-relative style control through decoupled guidance and specialized optimization techniques.

Abstract: Zero-shot text-to-speech models can clone a speaker’s timbre from a short reference audio, but they also strongly inherit the speaking style present in the reference. As a result, synthesizing speech with a desired style often requires carefully selecting reference audio, which is impractical when only limited or mismatched references are available. While recent controllable TTS methods attempt to address this issue, they typically rely on absolute style targets and discrete textual prompts, and therefore do not support continuous and reference-relative style control. We propose ReStyle-TTS, a framework that enables continuous and reference-relative style control in zero-shot TTS. Our key insight is that effective style control requires first reducing the model’s implicit dependence on reference style before introducing explicit control mechanisms. To this end, we introduce Decoupled Classifier-Free Guidance (DCFG), which independently controls text and reference guidance, reducing reliance on reference style while preserving text fidelity. On top of this, we apply style-specific LoRAs together with Orthogonal LoRA Fusion to enable continuous and disentangled multi-attribute control, and introduce a Timbre Consistency Optimization module to mitigate timbre drift caused by weakened reference guidance. Experiments show that ReStyle-TTS enables user-friendly, continuous, and relative control over pitch, energy, and multiple emotions while maintaining intelligibility and speaker timbre, and performs robustly in challenging mismatched reference-target style scenarios.

[502] TellWhisper: Tell Whisper Who Speaks When

Yifan Hu, Peiji Yang, Zhisheng Wang, Yicheng Zhong, Rui Liu

Main category: eess.AS

TL;DR: TellWhisper is a unified framework for multi-speaker ASR that jointly models speaker identity and temporal information using TS-RoPE (time-speaker rotary positional encoding) and Hyper-SD for speaker activity estimation in hyperbolic space.

Details

Motivation: Existing MASR approaches decouple temporal modeling and speaker modeling, causing irreversible information loss or entangled representations, which leads to degraded performance under rapid turn-taking and overlapping speech.

Method: Proposes TellWhisper with two key components: 1) TS-RoPE (time-speaker rotary positional encoding) that derives time coordinates from frame indices and speaker coordinates from speaker activity/pause cues, enabling attention to simultaneously attend to “when” and “who”; 2) Hyper-SD that casts speaker classification in hyperbolic space to enhance inter-class separation and refine speaker activity estimates.

Result: Extensive experiments demonstrate the effectiveness of the proposed approach, showing improved performance in multi-speaker ASR tasks.

Conclusion: TellWhisper provides a unified framework that jointly models speaker identity and temporal information within the speech encoder, addressing limitations of decoupled approaches and improving performance in challenging scenarios like rapid turn-taking and overlapping speech.

Abstract: Multi-speaker automatic speech recognition (MASR) aims to predict ‘‘who spoke when and what’’ from multi-speaker speech, a key technology for multi-party dialogue understanding. However, most existing approaches decouple temporal modeling and speaker modeling when addressing ‘‘when’’ and ‘‘who’’: some inject speaker cues before encoding (e.g., speaker masking), which can cause irreversible information loss; others fuse identity by mixing speaker posteriors after encoding, which may entangle acoustic content with speaker identity. This separation is brittle under rapid turn-taking and overlapping speech, often leading to degraded performance. To address these limitations, we propose TellWhisper, a unified framework that jointly models speaker identity and temporal within the speech encoder. Specifically, we design TS-RoPE, a time-speaker rotary positional encoding: time coordinates are derived from frame indices, while speaker coordinates are derived from speaker activity and pause cues. By applying region-specific rotation angles, the model explicitly captures per-speaker continuity, speaker-turn transitions, and state dynamics, enabling the attention mechanism to simultaneously attend to ‘‘when’’ and ‘‘who’’. Moreover, to estimate frame-level speaker activity, we develop Hyper-SD, which casts speaker classification in hyperbolic space to enhance inter-class separation and refine speaker-activity estimates. Extensive experiments demonstrate the effectiveness of the proposed approach.

[503] Sound Event Detection with Boundary-Aware Optimization and Inference

Florian Schmid, Chi Ian Tang, Sanjeel Parekh, Vamsi Krishna Ithapu, Juan Azcarreta Ortiz, Giacomo Ferroni, Yijun Qian, Arnoldas Jasonas, Cosmin Frateanu, Camilla Clark, Gerhard Widmer, Çağdaş Bilen

Main category: eess.AS

TL;DR: Proposes a new temporal event detection approach with explicit onset/offset modeling, boundary-aware optimization, and specialized layers (RED & EPN) that achieves state-of-the-art SED performance on AudioSet without post-processing tuning.

Details

Motivation: Temporal detection problems in various fields (time-series, activity recognition, SED) need better event boundary modeling. Traditional frame-wise SED models with post-processing have limitations in precise temporal event detection and require hyperparameter tuning.

Method: Introduces explicit modeling of event onsets and offsets with boundary-aware optimization and inference. Proposes two new temporal modeling layers: Recurrent Event Detection (RED) and Event Proposal Network (EPN), along with tailored loss functions for precise temporal event detection.

Result: Outperforms traditional frame-wise SED models with state-of-the-art post-processing on AudioSet Strong subset. Removes need for post-processing hyperparameter tuning. Achieves new state-of-the-art performance across all AudioSet Strong classes.

Conclusion: The proposed approach with explicit event boundary modeling and specialized layers provides more effective and precise temporal event detection, setting new benchmarks for SED while eliminating post-processing complexity.

Abstract: Temporal detection problems appear in many fields including time-series estimation, activity recognition and sound event detection (SED). In this work, we propose a new approach to temporal event modeling by explicitly modeling event onsets and offsets, and by introducing boundary-aware optimization and inference strategies that substantially enhance temporal event detection. The presented methodology incorporates new temporal modeling layers - Recurrent Event Detection (RED) and Event Proposal Network (EPN) - which, together with tailored loss functions, enable more effective and precise temporal event detection. We evaluate the proposed method in the SED domain using a subset of the temporally-strongly annotated portion of AudioSet. Experimental results show that our approach not only outperforms traditional frame-wise SED models with state-of-the-art post-processing, but also removes the need for post-processing hyperparameter tuning, and scales to achieve new state-of-the-art performance across all AudioSet Strong classes.

eess.IV

[504] Edit2Restore:Few-Shot Image Restoration via Parameter-Efficient Adaptation of Pre-trained Editing Models

M. Akın Yılmaz, Ahmet Bilican, Burak Can Biner, A. Murat Tekalp

Main category: eess.IV

TL;DR: Pre-trained text-conditioned image editing models (FLUX.1 Kontext) can be adapted for multiple restoration tasks using LoRA fine-tuning with only 16-128 images per task, guided by text prompts.

Details

Motivation: Traditional image restoration requires training specialized models on thousands of paired examples per degradation type. This work challenges that paradigm by leveraging pre-trained models' rich visual priors to dramatically reduce data requirements.

Method: Fine-tune LoRA adapters on FLUX.1 Kontext (12B parameter flow matching model) using only 16-128 paired images per task. Use text prompts to specify restoration operations (denoising, deraining, dehazing). A single unified adapter handles multiple degradations.

Result: The approach maintains high perceptual quality while using far fewer training examples than traditional methods. Comprehensive ablation studies analyze training set size impact, task-specific vs unified adapters, text encoder fine-tuning, and zero-shot performance.

Conclusion: Pre-trained image editing models, when properly adapted with LoRA and text prompts, offer a compelling data-efficient alternative to traditional restoration approaches, enabling few-shot, prompt-guided image enhancement.

Abstract: Image restoration has traditionally required training specialized models on thousands of paired examples per degradation type. We challenge this paradigm by demonstrating that powerful pre-trained text-conditioned image editing models can be efficiently adapted for multiple restoration tasks through parameter-efficient fine-tuning with remarkably few examples. Our approach fine-tunes LoRA adapters on FLUX.1 Kontext, a state-of-the-art 12B parameter flow matching model for image-to-image translation, using only 16-128 paired images per task, guided by simple text prompts that specify the restoration operation. Unlike existing methods that train specialized restoration networks from scratch with thousands of samples, we leverage the rich visual priors already encoded in large-scale pre-trained editing models, dramatically reducing data requirements while maintaining high perceptual quality. A single unified LoRA adapter, conditioned on task-specific text prompts, effectively handles multiple degradations including denoising, deraining, and dehazing. Through comprehensive ablation studies, we analyze: (i) the impact of training set size on restoration quality, (ii) trade-offs between task-specific versus unified multi-task adapters, (iii) the role of text encoder fine-tuning, and (iv) zero-shot baseline performance. While our method prioritizes perceptual quality over pixel-perfect reconstruction metrics like PSNR/SSIM, our results demonstrate that pre-trained image editing models, when properly adapted, offer a compelling and data-efficient alternative to traditional image restoration approaches, opening new avenues for few-shot, prompt-guided image enhancement. The code to reproduce our results are available at: https://github.com/makinyilmaz/Edit2Restore

[505] GeoDiff-SAR: A Geometric Prior Guided Diffusion Model for SAR Image Generation

Fan Zhang, Xuanting Wu, Fei Ma, Qiang Yin, Yuxin Hu

Main category: eess.IV

TL;DR: GeoDiff-SAR is a physics-guided diffusion model that generates high-fidelity SAR images by incorporating explicit geometric information through SAR point cloud simulation and multi-modal feature fusion.

Details

Motivation: Existing SAR image generation methods operate only in the image domain without explicit geometric information, leading to poor quality and inability to control critical parameters like azimuth angles.

Method: Three key components: 1) SAR point cloud simulation for geometric guidance, 2) FiLM-based feature fusion gating network for multi-modal information integration, 3) LoRA fine-tuning of Stable Diffusion 3.5 for SAR domain adaptation.

Result: GeoDiff-SAR generates high-fidelity SAR images that significantly improve downstream classification accuracy, especially for recognition across different azimuth angles.

Conclusion: Physics-guided generation with explicit geometric information is superior for SAR image synthesis, enabling precise parameter control and enhanced downstream task performance.

Abstract: Synthetic Aperture Radar (SAR) imaging results are highly sensitive to observation geometries and the geometric parameters of targets. However, existing generative methods primarily operate within the image domain, neglecting explicit geometric information. This limitation often leads to unsatisfactory generation quality and the inability to precisely control critical parameters such as azimuth angles. To address these challenges, we propose GeoDiff-SAR, a geometric prior guided diffusion model for high-fidelity SAR image generation. Specifically, GeoDiff-SAR first efficiently simulates the geometric structures and scattering relationships inherent in real SAR imaging by calculating SAR point clouds at specific azimuths, which serves as a robust physical guidance. Secondly, to effectively fuse multi-modal information, we employ a feature fusion gating network based on Feature-wise Linear Modulation (FiLM) to dynamically regulate the weight distribution of 3D physical information, image control parameters, and textual description parameters. Thirdly, we utilize the Low-Rank Adaptation (LoRA) architecture to perform lightweight fine-tuning on the advanced Stable Diffusion 3.5 (SD3.5) model, enabling it to rapidly adapt to the distribution characteristics of the SAR domain. To validate the effectiveness of GeoDiff-SAR, extensive comparative experiments were conducted on real-world SAR datasets. The results demonstrate that data generated by GeoDiff-SAR exhibits high fidelity and effectively enhances the accuracy of downstream classification tasks. In particular, it significantly improves recognition performance across different azimuth angles, thereby underscoring the superiority of physics-guided generation.

[506] Ensemble Models for Predicting Treatment Response in Pediatric Low-Grade Glioma Managed with Chemotherapy

Max Bengtsson, Elif Keles, Angela J. Waanders, Ulas Bagci

Main category: eess.IV

TL;DR: A novel pipeline combining MRI segmentation, radiomics, and clinical data with Swin UNETR and XGBoost ensemble for predicting chemotherapy response in pediatric brain tumors not amenable to complete surgical resection.

Details

Motivation: Pediatric brain tumors that cannot be completely surgically removed have lower progression-free survival and rely on chemotherapy as primary treatment. There's a need for non-invasive methods to predict chemotherapy response to enable personalized therapy for this challenging population.

Method: Integrates pediatric brain tumor segmentation framework to delineate four tumor subregions (enhancing tumor, non-enhancing tumor, cystic component, edema), extracts radiomic features, and combines with clinical data. Uses ensemble of Swin UNETR encoder (for MRI classification) and XGBoost classifier (for radiomics and clinical variables) to predict chemotherapy response.

Result: Swin-Ensemble achieved best performance: precision for non-effective cases=0.68, recall for non-effective cases=0.85, precision for chemotherapy effective cases=0.64, overall accuracy=0.69. Outperformed Mamba-FeatureFuse, Swin UNETR encoder, and Swin-FeatureFuse models.

Conclusion: The ensemble framework represents a promising step toward personalized therapy response prediction for pediatric low-grade glioma patients who need chemotherapy but are not suitable for complete surgical resection, potentially improving treatment outcomes for this high-risk population.

Abstract: In this paper, we introduce a novel pipeline for predicting chemotherapy response in pediatric brain tumors that are not amenable to complete surgical resection, using pre-treatment magnetic resonance imaging combined with clinical information. Our method integrates a state-of-the-art pediatric brain tumor segmentation framework with radiomic feature extraction and clinical data through an ensemble of a Swin UNETR encoder and XGBoost classifier. The segmentation model delineates four tumor subregions enhancing tumor, non-enhancing tumor, cystic component and edema which are used to extract imaging biomarkers and generate predictive features. The Swin UNETR network classifies the response to treatment directly from these segmented MRI scans, while XGBoost predicts response using radiomics and clinical variables including legal sex, ethnicity, race, age at event (in days), molecular subtype, tumor locations, initial surgery status, metastatic status, metastasis location, chemotherapy type, protocol name and chemotherapy agents. The ensemble output provides a non-invasive estimate of chemotherapy response in this historically challenging population characterized by lower progression-free survival. Among compared approaches, our Swin-Ensemble achieved the best performance (precision for non effective cases=0.68, recall for non effective cases=0.85, precision for chemotherapy effective cases=0.64 and overall accuracy=0.69), outperforming Mamba-FeatureFuse, Swin UNETR encoder, and Swin-FeatureFuse models. Our findings suggest that this ensemble framework represents a promising step toward personalized therapy response prediction for pediatric low-grade glioma patients in need of chemotherapy treatment who are not suitable for complete surgical resection, a population with significantly lower progression free survival and for whom chemotherapy remains the primary treatment option.

[507] A low-complexity method for efficient depth-guided image deblurring

Ziyao Yi, Diego Valsesia, Tiziano Bianchi, Enrico Magli

Main category: eess.IV

TL;DR: A low-complexity neural network using wavelet transform and depth guidance for efficient image deblurring, achieving competitive quality with 100x lower complexity.

Details

Motivation: Image deblurring is computationally intensive for deep learning models, making them impractical for mobile/edge devices. Mobile Lidars provide depth maps that can enhance deblurring quality while reducing computational complexity.

Method: A novel neural network using wavelet transform to separate structural details and reduce spatial redundancy, combined with efficient feature conditioning on depth information from mobile Lidars.

Result: Achieves competitive image quality against recent state-of-the-art models while reducing computational complexity by up to two orders of magnitude (100x reduction).

Conclusion: Wavelet transform for detail separation and efficient depth feature conditioning are essential for developing low-complexity depth-guided image deblurring models suitable for mobile/edge deployment.

Abstract: Image deblurring is a challenging problem in imaging due to its highly ill-posed nature. Deep learning models have shown great success in tackling this problem but the quest for the best image quality has brought their computational complexity up, making them impractical on anything but powerful servers. Meanwhile, recent works have shown that mobile Lidars can provide complementary information in the form of depth maps that enhance deblurring quality. In this paper, we introduce a novel low-complexity neural network for depth-guided image deblurring. We show that the use of the wavelet transform to separate structural details and reduce spatial redundancy as well as efficient feature conditioning on the depth information are essential ingredients in developing a low-complexity model. Experimental results show competitive image quality against recent state-of-the-art models while reducing complexity by up to two orders of magnitude.

[508] Scanner-Induced Domain Shifts Undermine the Robustness of Pathology Foundation Models

Erik Thiringer, Fredrik K. Gustafsson, Kajsa Ledesma Eriksson, Mattias Rantalainen

Main category: eess.IV

TL;DR: PFMs show poor robustness to scanner variability despite good benchmark performance, with scanner-dependent bias affecting embedding spaces and calibration that could impact clinical reliability.

Details

Motivation: To systematically evaluate the robustness of pathology foundation models to real-world technical domain shifts, specifically scanner-induced variability, which remains poorly understood despite strong benchmark performance.

Method: Evaluated 14 PFMs using a multiscanner dataset of 384 breast cancer WSIs from 5 devices, isolating scanner effects from biological/laboratory confounders. Used complementary unsupervised embedding analyses and clinicopathological supervised prediction tasks.

Result: Current PFMs are not invariant to scanner-induced domain shifts - most encode pronounced scanner-specific variability. While AUC often remains stable, this masks critical failure: scanner variability systematically alters embedding space and impacts calibration, causing scanner-dependent bias. Robustness not correlated with training data scale, model size, or recency.

Conclusion: PFM development/evaluation must move beyond accuracy-centric benchmarks to explicitly evaluate and optimize embedding stability and calibration under realistic acquisition variability for reliable clinical use.

Abstract: Pathology foundation models (PFMs) have become central to computational pathology, aiming to offer general encoders for feature extraction from whole-slide images (WSIs). Despite strong benchmark performance, PFM robustness to real-world technical domain shifts, such as variability from whole-slide scanner devices, remains poorly understood. We systematically evaluated the robustness of 14 PFMs to scanner-induced variability, including state-of-the-art models, earlier self-supervised models, and a baseline trained on natural images. Using a multiscanner dataset of 384 breast cancer WSIs scanned on five devices, we isolated scanner effects independently from biological and laboratory confounders. Robustness is assessed via complementary unsupervised embedding analyses and a set of clinicopathological supervised prediction tasks. Our results demonstrate that current PFMs are not invariant to scanner-induced domain shifts. Most models encode pronounced scanner-specific variability in their embedding spaces. While AUC often remains stable, this masks a critical failure mode: scanner variability systematically alters the embedding space and impacts calibration of downstream model predictions, resulting in scanner-dependent bias that can impact reliability in clinical use cases. We further show that robustness is not a simple function of training data scale, model size, or model recency. None of the models provided reliable robustness against scanner-induced variability. While the models trained on the most diverse data, here represented by vision-language models, appear to have an advantage with respect to robustness, they underperformed on downstream supervised tasks. We conclude that development and evaluation of PFMs requires moving beyond accuracy-centric benchmarks toward explicit evaluation and optimisation of embedding stability and calibration under realistic acquisition variability.

[509] Learn2Reg 2024: New Benchmark Datasets Driving Progress on New Challenges

Lasse Hansen, Wiebke Heyer, Christoph Großbröhmer, Frederic Madesta, Thilo Sentker, Wang Jiazheng, Yuxi Zhang, Hang Zhang, Min Liu, Junyi Wang, Xi Zhu, Yuhua Li, Liwen Wang, Daniil Morozov, Nazim Haouchine, Joel Honkamaa, Pekka Marttinen, Yichao Zhou, Zuopeng Tan, Zhuoyuan Wang, Yi Wang, Hongchao Zhou, Shunbo Hu, Yi Zhang, Qian Tao, Lukas Förner, Thomas Wendler, Bailiang Jian, Christian Wachinger, Jin Kim, Dan Ruan, Marek Wodzinski, Henning Müller, Tony C. W. Mok, Xi Jia, Jinming Duan, Mikael Brudfors, Seyed-Ahmad Ahmadi, Yunzheng Zhu, William Hsu, Tina Kapur, William M. Wells, Alexandra Golby, Aaron Carass, Harrison Bai, Yihao Liu, Perrine Paul-Gilloteaux, Joakim Lindblad, Nataša Sladoje, Andreas Walter, Junyu Chen, Reuben Dorent, Alessa Hering, Mattias P. Heinrich

Main category: eess.IV

TL;DR: The Learn2Reg 2024 challenge expands medical image registration benchmarking with new tasks covering multi-modal registration, inter-subject brain registration, and microscopy, while inspiring new method developments.

Details

Motivation: To provide fair benchmarking for medical image registration methods and monitor progress in the field by expanding the scope to cover more diverse registration scenarios.

Method: Building on previous Learn2Reg challenges (2020-2023), the 2024 edition introduces three new tasks: large-scale multi-modal registration, unsupervised inter-subject brain registration, and microscopy-focused benchmark. The challenge uses established metrics and complementary datasets.

Result: The expanded challenge scope covers wider modality diversity and task complexity. New datasets inspired method developments including invertibility constraints, pyramid features, keypoints alignment, and instance optimization.

Conclusion: The Learn2Reg 2024 challenge advances medical image registration benchmarking by addressing more complex and diverse clinical scenarios, while driving innovation in registration methods through new datasets and tasks.

Abstract: Medical image registration is critical for clinical applications, and fair benchmarking of different methods is essential for monitoring ongoing progress in the field. To date, the Learn2Reg 2020-2023 challenges have released several complementary datasets and established metrics for evaluations. Building on this foundation, the 2024 edition expands the challenge’s scope to cover a wider range of registration scenarios, particularly in terms of modality diversity and task complexity, by introducing three new tasks, including large-scale multi-modal registration and unsupervised inter-subject brain registration, as well as the first microscopy-focused benchmark within Learn2Reg. The new datasets also inspired new method developments, including invertibility constraints, pyramid features, keypoints alignment and instance optimisation. Visit Learn2Reg at https://learn2reg.grand-challenge.org.

[510] Staged Voxel-Level Deep Reinforcement Learning for 3D Medical Image Segmentation with Noisy Annotations

Yuyang Fu, Xiuzhen Guo, Ji Shi

Main category: eess.IV

TL;DR: SVL-DRL is an end-to-end staged voxel-level deep reinforcement learning framework for robust medical image segmentation under noisy annotations, achieving state-of-the-art performance with 3% average improvement in Dice and IoU scores.

Details

Motivation: Noisy annotations in medical image segmentation due to complex organ morphology and inter-annotator variability significantly limit segmentation model efficacy. The paper is motivated by how medical imaging annotators can correct labeling errors using prior knowledge.

Method: Proposes SVL-DRL framework with: 1) Formulating noisy annotations as voxel-dependent problem using staged reinforcement learning; 2) Voxel-level asynchronous advantage actor-critic (vA3C) module treating each voxel as autonomous agent; 3) Novel action space and composite reward function combining Dice value and spatial continuity metric.

Result: Achieves state-of-the-art performance on three public medical image datasets under various experimental settings, with average improvement of over 3% in both Dice and IoU scores.

Conclusion: SVL-DRL provides an effective end-to-end solution for robust medical image segmentation under noisy annotations, automatically mitigating erroneous label impact without manual intervention through dynamic iterative update strategy.

Abstract: Deep learning has achieved significant advancements in medical image segmentation. Currently, obtaining accurate segmentation outcomes is critically reliant on large-scale datasets with high-quality annotations. However, noisy annotations are frequently encountered owing to the complex morphological structures of organs in medical images and variations among different annotators, which can substantially limit the efficacy of segmentation models. Motivated by the fact that medical imaging annotator can correct labeling errors during segmentation based on prior knowledge, we propose an end-to-end Staged Voxel-Level Deep Reinforcement Learning (SVL-DRL) framework for robust medical image segmentation under noisy annotations. This framework employs a dynamic iterative update strategy to automatically mitigate the impact of erroneous labels without requiring manual intervention. The key advancements of SVL-DRL over existing works include: i) formulating noisy annotations as a voxel-dependent problem and addressing it through a novel staged reinforcement learning framework which guarantees robust model convergence; ii) incorporating a voxel-level asynchronous advantage actor-critic (vA3C) module that conceptualizes each voxel as an autonomous agent, which allows each agent to dynamically refine its own state representation during training, thereby directly mitigating the influence of erroneous labels; iii) designing a novel action space for the agents, along with a composite reward function that strategically combines the Dice value and a spatial continuity metric to significantly boost segmentation accuracy while maintain semantic integrity. Experiments on three public medical image datasets demonstrates State-of-The-Art (SoTA) performance under various experimental settings, with an average improvement of over 3% in both Dice and IoU scores.

Today’s Research Highlights

Table of Contents

cs.CL

[1] DeepResearch-Slice: Bridging the Retrieval-Utilization Gap via Explicit Text Slicing

[2] Internal Reasoning vs. External Control: A Thermodynamic Analysis of Sycophancy in Large Language Models

[3] Jailbreak-Zero: A Path to Pareto Optimal Red Teaming for Large Language Models

[4] Benchmarking and Adapting On-Device Large Language Models for Clinical Decision Support

[5] OpenAI GPT-5 System Card

[6] WRAVAL – WRiting Assist eVALuation

[7] The Instruction Gap: LLMs get lost in Following Instruction

[8] Advances and Challenges in Semantic Textual Similarity: A Comprehensive Survey

[9] Less is more: Not all samples are effective for evaluation

[10] Analyzing Reasoning Shifts in Audio Deepfake Detection under Adversarial Attacks: The Reasoning Tax versus Shield Bifurcation

[11] GuardEval: A Multi-Perspective Benchmark for Evaluating Safety, Fairness, and Robustness in LLM Moderators

[12] LLM_annotate: A Python package for annotating and analyzing fiction characters

[13] Topic Segmentation Using Generative Language Models

[14] HiKE: Hierarchical Evaluation Framework for Korean-English Code-Switching Speech Recognition

[15] Bare-Metal Tensor Virtualization: Overcoming the Memory Wall in Edge-AI Inference on ARM64

[16] A path to natural language through tokenisation and transformers

[17] Metaphors are a Source of Cross-Domain Misalignment of Large Reasoning Models

[18] Breaking the Assistant Mold: Modeling Behavioral Variation in LLM Based Procedural Character Generation

[19] Rendering Data Unlearnable by Exploiting LLM Alignment Mechanisms

[20] Tigrinya Number Verbalization: Rules, Algorithm, and Implementation

[21] Implicit Graph, Explicit Retrieval: Towards Efficient and Interpretable Long-horizon Memory for Large Language Models

[22] PCoA: A New Benchmark for Medical Aspect-Based Summarization With Phrase-Level Context Attribution

[23] Training-Free Adaptation of New-Generation LLMs using Legacy Clinical Models

[24] The Critical Role of Aspects in Measuring Document Similarity

[25] Grading Scale Impact on LLM-as-a-Judge: Human-LLM Alignment Is Highest on 0-5 Grading Scale

[26] Enhancing Linguistic Competence of Language Models through Pre-training with Language Learning Tasks

[27] Prompting Underestimates LLM Capability for Time Series Classification

[28] EpiQAL: Benchmarking Large Language Models in Epidemiological Question Answering for Enhanced Alignment and Reasoning

[29] SegNSP: Revisiting Next Sentence Prediction for Linear Text Segmentation

[30] Self-Explaining Hate Speech Detection with Moral Rationales

[31] CALM: Culturally Self-Aware Language Models

[32] Submodular Evaluation Subset Selection in Automatic Prompt Optimization

[33] Beyond Perplexity: A Lightweight Benchmark for Knowledge Retention in Supervised Fine-Tuning

[34] Reasoning Pattern Alignment Merging for Adaptive Reasoning

[35] IntroLM: Introspective Language Models via Prefilling-Time Self-Evaluation

[36] Mem-Gallery: Benchmarking Multimodal Long-Term Conversational Memory for MLLM Agents

[37] PALM-Bench: A Comprehensive Benchmark for Personalized Audio-Language Models

[38] Persona-aware and Explainable Bikeability Assessment: A Vision-Language Model Approach

[39] DeepSynth-Eval: Objectively Evaluating Information Consolidation in Deep Survey Writing

[40] Layer-Order Inversion: Rethinking Latent Multi-Hop Reasoning in Large Language Models

[41] EvolMem: A Cognitive-Driven Benchmark for Multi-Session Dialogue Memory

[42] Value-Action Alignment in Large Language Models under Privacy-Prosocial Conflict

[43] Evaluating LLMs for Police Decision-Making: A Framework Based on Police Action Scenarios

[44] DiffCoT: Diffusion-styled Chain-of-Thought Reasoning in LLMs

[45] How Do Large Language Models Learn Concepts During Continual Pre-Training?

[46] PsychEthicsBench: Evaluating Large Language Models Against Australian Mental Health Ethics

[47] OLA: Output Language Alignment in Code-Switched LLM Interactions

[48] From Chains to Graphs: Self-Structured Reasoning for General-Domain LLMs

[49] DiVA: Fine-grained Factuality Verification with Agentic-Discriminative Verifier

[50] Evaluating the Pre-Consultation Ability of LLMs using Diagnostic Guidelines

[51] Reasoning Model Is Superior LLM-Judge, Yet Suffers from Biases

[52] Agent-Dice: Disentangling Knowledge Updates via Geometric Consensus for Agent Continual Learning

[53] LLM-MC-Affect: LLM-Based Monte Carlo Modeling of Affective Trajectories and Latent Ambiguity for Interpersonal Dynamic Insight

[54] ELO: Efficient Layer-Specific Optimization for Continual Pretraining of Multilingual LLMs

[55] SyncThink: A Training-Free Strategy to Align Inference Termination with Reasoning Saturation

[56] e5-omni: Explicit Cross-modal Alignment for Omni-modal Embeddings

[57] eTracer: Towards Traceable Text Generation via Claim-Level Grounding

[58] DisastQA: A Comprehensive Benchmark for Evaluating Question Answering in Disaster Management

[59] NeuronScope: A Multi-Agent Framework for Explaining Polysemantic Neurons in Language Models

[60] Towards Compositional Generalization of LLMs via Skill Taxonomy Guided Data Synthesis

[61] From Implicit to Explicit: Token-Efficient Logical Supervision for Mathematical Reasoning in LLMs

[62] Evaluation Framework for AI Creativity: A Case Study Based on Story Generation

[63] ADEPT: Adaptive Dynamic Early-Exit Process for Transformers

[64] RedBench: A Universal Dataset for Comprehensive Red Teaming of Large Language Models

[65] AirNav: A Large-Scale Real-World UAV Vision-and-Language Navigation Dataset with Natural and Diverse Instructions

[66] Visual Merit or Linguistic Crutch? A Close Look at DeepSeek-OCR

[67] MIND: From Passive Mimicry to Active Reasoning through Capability-Aware Multi-Perspective CoT Distillation

[68] O-Researcher: An Open Ended Deep Research Model via Multi-Agent Distillation and Agentic RL

[69] Stuttering-Aware Automatic Speech Recognition for Indonesian Language

[70] Whose Facts Win? LLM Source Preferences under Knowledge Conflicts

[71] Evaluation of Multilingual LLMs Personalized Text Generation Capabilities Targeting Groups and Social-Media Platforms

[72] Do LLM Self-Explanations Help Users Predict Model Behavior? Evaluating Counterfactual Simulatability with Pragmatic Perturbations

[73] Tracing the complexity profiles of different linguistic phenomena through the intrinsic dimension of LLM representations

[74] HearSay Benchmark: Do Audio LLMs Leak What They Hear?

[75] Membox: Weaving Topic Continuity into Long-Range Memory for LLM Agents

[76] Compact Example-Based Explanations for Language Models

[77] NeoAMT: Neologism-Aware Agentic Machine Translation with Reinforcement Learning