MyArxiv
Computation and Language
☆ Thinking-while-Generating: Interleaving Textual Reasoning throughout Visual Generation
Recent advances in visual generation have increasingly explored the integration of reasoning capabilities. They incorporate textual reasoning, i.e., think, either before (as pre-planning) or after (as post-refinement) the generation process, yet they lack on-the-fly multimodal interaction during the generation itself. In this preliminary study, we introduce Thinking-while-Generating (TwiG), the first interleaved framework that enables co-evolving textual reasoning throughout the visual generation process. As visual content is progressively generating, textual reasoning is interleaved to both guide upcoming local regions and reflect on previously synthesized ones. This dynamic interplay produces more context-aware and semantically rich visual outputs. To unveil the potential of this framework, we investigate three candidate strategies, zero-shot prompting, supervised fine-tuning (SFT) on our curated TwiG-50K dataset, and reinforcement learning (RL) via a customized TwiG-GRPO strategy, each offering unique insights into the dynamics of interleaved reasoning. We hope this work inspires further research into interleaving textual reasoning for enhanced visual generation. Code will be released at: https://github.com/ZiyuGuo99/Thinking-while-Generating.
comment: Project Page: https://think-while-gen.github.io Code: https://github.com/ZiyuGuo99/Thinking-while-Generating
☆ Nemotron Elastic: Towards Efficient Many-in-One Reasoning LLMs
Training a family of large language models targeting multiple scales and deployment objectives is prohibitively expensive, requiring separate training runs for each different size. Recent work on model compression through pruning and knowledge distillation has reduced this cost; however, this process still incurs hundreds of billions of tokens worth of training cost per compressed model. In this paper, we present Nemotron Elastic, a framework for building reasoning-oriented LLMs, including hybrid Mamba-Attention architectures, that embed multiple nested submodels within a single parent model, each optimized for different deployment configurations and budgets. Each of these submodels shares weights with the parent model and can be extracted zero-shot during deployment without additional training or fine-tuning. We enable this functionality through an end-to-end trained router, tightly coupled to a two-stage training curriculum designed specifically for reasoning models. We additionally introduce group-aware SSM elastification that preserves Mamba's structural constraints, heterogeneous MLP elastification, normalized MSE-based layer importance for improved depth selection, and knowledge distillation enabling simultaneous multi-budget optimization. We apply Nemotron Elastic to the Nemotron Nano V2 12B model, simultaneously producing a 9B and a 6B model using only 110B training tokens; this results in over 360x cost reduction compared to training model families from scratch, and around 7x compared to SoTA compression techniques. Each of the nested models performs on par or better than the SoTA in accuracy. Moreover, unlike other compression methods, the nested capability of our approach allows having a many-in-one reasoning model that has constant deployment memory against the number of models in the family.
☆ Comparison of Text-Based and Image-Based Retrieval in Multimodal Retrieval Augmented Generation Large Language Model Systems
Recent advancements in Retrieval-Augmented Generation (RAG) have enabled Large Language Models (LLMs) to access multimodal knowledge bases containing both text and visual information such as charts, diagrams, and tables in financial documents. However, existing multimodal RAG systems rely on LLM-based summarization to convert images into text during preprocessing, storing only text representations in vector databases, which causes loss of contextual information and visual details critical for downstream retrieval and question answering. To address this limitation, we present a comprehensive comparative analysis of two retrieval approaches for multimodal RAG systems, including text-based chunk retrieval (where images are summarized into text before embedding) and direct multimodal embedding retrieval (where images are stored natively in the vector space). We evaluate all three approaches across 6 LLM models and a two multi-modal embedding models on a newly created financial earnings call benchmark comprising 40 question-answer pairs, each paired with 2 documents (1 image and 1 text chunk). Experimental results demonstrate that direct multimodal embedding retrieval significantly outperforms LLM-summary-based approaches, achieving absolute improvements of 13% in mean average precision (mAP@5) and 11% in normalized discounted cumulative gain. These gains correspond to relative improvements of 32% in mAP@5 and 20% in nDCG@5, providing stronger evidence of their practical impact. We additionally find that direct multimodal retrieval produces more accurate and factually consistent answers as measured by LLM-as-a-judge pairwise comparisons. We demonstrate that LLM summarization introduces information loss during preprocessing, whereas direct multimodal embeddings preserve visual context for retrieval and inference.
☆ Codec2Vec: Self-Supervised Speech Representation Learning Using Neural Speech Codecs
Recent advancements in neural audio codecs have not only enabled superior audio compression but also enhanced speech synthesis techniques. Researchers are now exploring their potential as universal acoustic feature extractors for a broader range of speech processing tasks. Building on this trend, we introduce Codec2Vec, the first speech representation learning framework that relies exclusively on discrete audio codec units. This approach offers several advantages, including improved data storage and transmission efficiency, faster training, and enhanced data privacy. We explore masked prediction with various training target derivation strategies to thoroughly understand the effectiveness of this framework. Evaluated on the SUPERB benchmark, Codec2Vec achieves competitive performance compared to continuous-input models while reducing storage requirements by up to 16.5x and training time by 2.3x, showcasing its scalability and efficiency.
comment: To be presented at ASRU 2025
☆ SurvAgent: Hierarchical CoT-Enhanced Case Banking and Dichotomy-Based Multi-Agent System for Multimodal Survival Prediction
Survival analysis is critical for cancer prognosis and treatment planning, yet existing methods lack the transparency essential for clinical adoption. While recent pathology agents have demonstrated explainability in diagnostic tasks, they face three limitations for survival prediction: inability to integrate multimodal data, ineffective region-of-interest exploration, and failure to leverage experiential learning from historical cases. We introduce SurvAgent, the first hierarchical chain-of-thought (CoT)-enhanced multi-agent system for multimodal survival prediction. SurvAgent consists of two stages: (1) WSI-Gene CoT-Enhanced Case Bank Construction employs hierarchical analysis through Low-Magnification Screening, Cross-Modal Similarity-Aware Patch Mining, and Confidence-Aware Patch Mining for pathology images, while Gene-Stratified analysis processes six functional gene categories. Both generate structured reports with CoT reasoning, storing complete analytical processes for experiential learning. (2) Dichotomy-Based Multi-Expert Agent Inference retrieves similar cases via RAG and integrates multimodal reports with expert predictions through progressive interval refinement. Extensive experiments on five TCGA cohorts demonstrate SurvAgent's superority over conventional methods, proprietary MLLMs, and medical agents, establishing a new paradigm for explainable AI-driven survival prediction in precision oncology.
comment: 20 pages
☆ TimeViper: A Hybrid Mamba-Transformer Vision-Language Model for Efficient Long Video Understanding
We introduce TimeViper, a hybrid vision-language model designed to tackle challenges of long video understanding. Processing long videos demands both an efficient model architecture and an effective mechanism for handling extended temporal contexts. To this end, TimeViper adopts a hybrid Mamba-Transformer backbone that combines the efficiency of state-space models with the expressivity of attention mechanisms. Through this hybrid design, we reveal the vision-to-text information aggregation phenomenon, where information progressively flows from vision tokens to text tokens across increasing LLM depth, resulting in severe vision token redundancy. Motivated by this observation, we propose TransV, a token information transfer module that transfers and compresses vision tokens into instruction tokens while maintaining multimodal understanding capabilities. This design enables TimeViper to process hour-long videos exceeding 10,000 frames. Extensive experiments across multiple benchmarks demonstrate that TimeViper competes with state-of-the-art models while extending frame numbers. We further analyze attention behaviors of both Mamba and Transformer layers, offering new insights into hybrid model interpretability. This work represents an initial step towards developing, interpreting, and compressing hybrid Mamba-Transformer architectures.
comment: Project page: https://xuboshen.github.io/TimeViper
☆ D-GARA: A Dynamic Benchmarking Framework for GUI Agent Robustness in Real-World Anomalies AAAI 2026
Developing intelligent agents capable of operating a wide range of Graphical User Interfaces (GUIs) with human-level proficiency is a key milestone on the path toward Artificial General Intelligence. While most existing datasets and benchmarks for training and evaluating GUI agents are static and idealized, failing to reflect the complexity and unpredictability of real-world environments, particularly the presence of anomalies. To bridge this research gap, we propose D-GARA, a dynamic benchmarking framework, to evaluate Android GUI agent robustness in real-world anomalies. D-GARA introduces a diverse set of real-world anomalies that GUI agents commonly face in practice, including interruptions such as permission dialogs, battery warnings, and update prompts. Based on D-GARA framework, we construct and annotate a benchmark featuring commonly used Android applications with embedded anomalies to support broader community research. Comprehensive experiments and results demonstrate substantial performance degradation in state-of-the-art GUI agents when exposed to anomaly-rich environments, highlighting the need for robustness-aware learning. D-GARA is modular and extensible, supporting the seamless integration of new tasks, anomaly types, and interaction scenarios to meet specific evaluation goals.
comment: Accepted to AAAI 2026
☆ Integrating Symbolic Natural Language Understanding and Language Models for Word Sense Disambiguation
Word sense disambiguation is a fundamental challenge in natural language understanding. Current methods are primarily aimed at coarse-grained representations (e.g. WordNet synsets or FrameNet frames) and require hand-annotated training data to construct. This makes it difficult to automatically disambiguate richer representations (e.g. built on OpenCyc) that are needed for sophisticated inference. We propose a method that uses statistical language models as oracles for disambiguation that does not require any hand-annotation of training data. Instead, the multiple candidate meanings generated by a symbolic NLU system are converted into distinguishable natural language alternatives, which are used to query an LLM to select appropriate interpretations given the linguistic context. The selected meanings are propagated back to the symbolic NLU system. We evaluate our method against human-annotated gold answers to demonstrate its effectiveness.
comment: 16 pages
☆ WER is Unaware: Assessing How ASR Errors Distort Clinical Understanding in Patient Facing Dialogue
As Automatic Speech Recognition (ASR) is increasingly deployed in clinical dialogue, standard evaluations still rely heavily on Word Error Rate (WER). This paper challenges that standard, investigating whether WER or other common metrics correlate with the clinical impact of transcription errors. We establish a gold-standard benchmark by having expert clinicians compare ground-truth utterances to their ASR-generated counterparts, labeling the clinical impact of any discrepancies found in two distinct doctor-patient dialogue datasets. Our analysis reveals that WER and a comprehensive suite of existing metrics correlate poorly with the clinician-assigned risk labels (No, Minimal, or Significant Impact). To bridge this evaluation gap, we introduce an LLM-as-a-Judge, programmatically optimized using GEPA to replicate expert clinical assessment. The optimized judge (Gemini-2.5-Pro) achieves human-comparable performance, obtaining 90% accuracy and a strong Cohen's $κ$ of 0.816. This work provides a validated, automated framework for moving ASR evaluation beyond simple textual fidelity to a necessary, scalable assessment of safety in clinical dialogue.
☆ The Oracle and The Prism: A Decoupled and Efficient Framework for Generative Recommendation Explanation
The integration of Large Language Models (LLMs) into explainable recommendation systems often leads to a performance-efficiency trade-off in end-to-end architectures, where joint optimization of ranking and explanation can result in suboptimal compromises. To resolve this, we propose Prism, a novel decoupled framework that rigorously separates the recommendation process into a dedicated ranking stage and an explanation generation stage. Inspired by knowledge distillation, Prism leverages a powerful teacher LLM (e.g., FLAN-T5-XXL) as an Oracle to produce high-fidelity explanatory knowledge. A compact, fine-tuned student model (e.g., BART-Base), the Prism, then specializes in synthesizing this knowledge into personalized explanations. This decomposition ensures that each component is optimized for its specific objective, eliminating inherent conflicts in coupled models. Extensive experiments on benchmark datasets demonstrate that our 140M-parameter Prism model significantly outperforms its 11B-parameter teacher in human evaluations of faithfulness and personalization, while achieving a 24 times speedup and a 10 times reduction in memory consumption during inference. These results validate that decoupling, coupled with targeted distillation, provides an efficient and effective pathway to high-quality explainable recommendation.
comment: 11 pages,3 figures
☆ Beyond Tokens in Language Models: Interpreting Activations through Text Genre Chunks
Understanding Large Language Models (LLMs) is key to ensure their safe and beneficial deployment. This task is complicated by the difficulty of interpretability of LLM structures, and the inability to have all their outputs human-evaluated. In this paper, we present the first step towards a predictive framework, where the genre of a text used to prompt an LLM, is predicted based on its activations. Using Mistral-7B and two datasets, we show that genre can be extracted with F1-scores of up to 98% and 71% using scikit-learn classifiers. Across both datasets, results consistently outperform the control task, providing a proof of concept that text genres can be inferred from LLMs with shallow learning models.
comment: 13 pages, 5 figures
☆ TurkColBERT: A Benchmark of Dense and Late-Interaction Models for Turkish Information Retrieval
Neural information retrieval systems excel in high-resource languages but remain underexplored for morphologically rich, lower-resource languages such as Turkish. Dense bi-encoders currently dominate Turkish IR, yet late-interaction models -- which retain token-level representations for fine-grained matching -- have not been systematically evaluated. We introduce TurkColBERT, the first comprehensive benchmark comparing dense encoders and late-interaction models for Turkish retrieval. Our two-stage adaptation pipeline fine-tunes English and multilingual encoders on Turkish NLI/STS tasks, then converts them into ColBERT-style retrievers using PyLate trained on MS MARCO-TR. We evaluate 10 models across five Turkish BEIR datasets covering scientific, financial, and argumentative domains. Results show strong parameter efficiency: the 1.0M-parameter colbert-hash-nano-tr is 600$\times$ smaller than the 600M turkish-e5-large dense encoder while preserving over 71\% of its average mAP. Late-interaction models that are 3--5$\times$ smaller than dense encoders significantly outperform them; ColmmBERT-base-TR yields up to +13.8\% mAP on domain-specific tasks. For production-readiness, we compare indexing algorithms: MUVERA+Rerank is 3.33$\times$ faster than PLAID and offers +1.7\% relative mAP gain. This enables low-latency retrieval, with ColmmBERT-base-TR achieving 0.54 ms query times under MUVERA. We release all checkpoints, configs, and evaluation scripts. Limitations include reliance on moderately sized datasets ($\leq$50K documents) and translated benchmarks, which may not fully reflect real-world Turkish retrieval conditions; larger-scale MUVERA evaluations remain necessary.
☆ MiMo-Embodied: X-Embodied Foundation Model Technical Report
We open-source MiMo-Embodied, the first cross-embodied foundation model to successfully integrate and achieve state-of-the-art performance in both Autonomous Driving and Embodied AI. MiMo-Embodied sets new records across 17 embodied AI benchmarks in Task Planning, Affordance Prediction and Spatial Understanding, while also excelling in 12 autonomous driving benchmarks across Environmental Perception, Status Prediction, and Driving Planning. Across these tasks, MiMo-Embodied significantly outperforms existing open-source, closed-source, and specialized baselines. Our results indicate that through multi-stage learning, curated data construction, and CoT/RL fine-tuning, these two domains exhibit strong positive transfer and mutually reinforce one another. We provide a detailed analysis of our model design and training methodologies to facilitate further research. Code and models are available at https://github.com/XiaomiMiMo/MiMo-Embodied.
comment: Code: https://github.com/XiaomiMiMo/MiMo-Embodied Model: https://huggingface.co/XiaomiMiMo/MiMo-Embodied-7B
☆ Music Recommendation with Large Language Models: Challenges, Opportunities, and Evaluation
Music Recommender Systems (MRS) have long relied on an information-retrieval framing, where progress is measured mainly through accuracy on retrieval-oriented subtasks. While effective, this reductionist paradigm struggles to address the deeper question of what makes a good recommendation, and attempts to broaden evaluation, through user studies or fairness analyses, have had limited impact. The emergence of Large Language Models (LLMs) disrupts this framework: LLMs are generative rather than ranking-based, making standard accuracy metrics questionable. They also introduce challenges such as hallucinations, knowledge cutoffs, non-determinism, and opaque training data, rendering traditional train/test protocols difficult to interpret. At the same time, LLMs create new opportunities, enabling natural-language interaction and even allowing models to act as evaluators. This work argues that the shift toward LLM-driven MRS requires rethinking evaluation. We first review how LLMs reshape user modeling, item modeling, and natural-language recommendation in music. We then examine evaluation practices from NLP, highlighting methodologies and open challenges relevant to MRS. Finally, we synthesize insights-focusing on how LLM prompting applies to MRS, to outline a structured set of success and risk dimensions. Our goal is to provide the MRS community with an updated, pedagogical, and cross-disciplinary perspective on evaluation.
comment: Under review with the ACM Transactions on Recommender Systems (TORS)
☆ Arctic-Extract Technical Report
Arctic-Extract is a state-of-the-art model designed for extracting structural data (question answering, entities and tables) from scanned or digital-born business documents. Despite its SoTA capabilities, the model is deployable on resource-constrained hardware, weighting only 6.6 GiB, making it suitable for deployment on devices with limited resources, such as A10 GPUs with 24 GB of memory. Arctic-Extract can process up to 125 A4 pages on those GPUs, making suitable for long document processing. This paper highlights Arctic-Extract's training protocols and evaluation results, demonstrating its strong performance in document understanding.
☆ Anatomy of an Idiom: Tracing Non-Compositionality in Language Models
We investigate the processing of idiomatic expressions in transformer-based language models using a novel set of techniques for circuit discovery and analysis. First discovering circuits via a modified path patching algorithm, we find that idiom processing exhibits distinct computational patterns. We identify and investigate ``Idiom Heads,'' attention heads that frequently activate across different idioms, as well as enhanced attention between idiom tokens due to earlier processing, which we term ``augmented reception.'' We analyze these phenomena and the general features of the discovered circuits as mechanisms by which transformers balance computational efficiency and robustness. Finally, these findings provide insights into how transformers handle non-compositional language and suggest pathways for understanding the processing of more complex grammatical constructions.
☆ ESGBench: A Benchmark for Explainable ESG Question Answering in Corporate Sustainability Reports
We present ESGBench, a benchmark dataset and evaluation framework designed to assess explainable ESG question answering systems using corporate sustainability reports. The benchmark consists of domain-grounded questions across multiple ESG themes, paired with human-curated answers and supporting evidence to enable fine-grained evaluation of model reasoning. We analyze the performance of state-of-the-art LLMs on ESGBench, highlighting key challenges in factual consistency, traceability, and domain alignment. ESGBench aims to accelerate research in transparent and accountable ESG-focused AI systems.
comment: Workshop paper accepted at AI4DF 2025 (part of ACM ICAIF 2025). 3 pages including tables and figures
☆ TOFA: Training-Free One-Shot Federated Adaptation for Vision-Language Models AAAI 2026
Efficient and lightweight adaptation of pre-trained Vision-Language Models (VLMs) to downstream tasks through collaborative interactions between local clients and a central server is a rapidly emerging research topic in federated learning. Existing adaptation algorithms are typically trained iteratively, which incur significant communication costs and increase the susceptibility to potential attacks. Motivated by the one-shot federated training techniques that reduce client-server exchanges to a single round, developing a lightweight one-shot federated VLM adaptation method to alleviate these issues is particularly attractive. However, current one-shot approaches face certain challenges in adapting VLMs within federated settings: (1) insufficient exploitation of the rich multimodal information inherent in VLMs; (2) lack of specialized adaptation strategies to systematically handle the severe data heterogeneity; and (3) requiring additional training resource of clients or server. To bridge these gaps, we propose a novel Training-free One-shot Federated Adaptation framework for VLMs, named TOFA. To fully leverage the generalizable multimodal features in pre-trained VLMs, TOFA employs both visual and textual pipelines to extract task-relevant representations. In the visual pipeline, a hierarchical Bayesian model learns personalized, class-specific prototype distributions. For the textual pipeline, TOFA evaluates and globally aligns the generated local text prompts for robustness. An adaptive weight calibration mechanism is also introduced to combine predictions from both modalities, balancing personalization and robustness to handle data heterogeneity. Our method is training-free, not relying on additional training resources on either the client or server side. Extensive experiments across 9 datasets in various federated settings demonstrate the effectiveness of the proposed TOFA method.
comment: Accepted by AAAI 2026
☆ Classification of worldwide news articles by perceived quality, 2018-2024
This study explored whether supervised machine learning and deep learning models can effectively distinguish perceived lower-quality news articles from perceived higher-quality news articles. 3 machine learning classifiers and 3 deep learning models were assessed using a newly created dataset of 1,412,272 English news articles from the Common Crawl over 2018-2024. Expert consensus ratings on 579 source websites were split at the median, creating perceived low and high-quality classes of about 706,000 articles each, with 194 linguistic features per website-level labelled article. Traditional machine learning classifiers such as the Random Forest demonstrated capable performance (0.7355 accuracy, 0.8131 ROC AUC). For deep learning, ModernBERT-large (256 context length) achieved the best performance (0.8744 accuracy; 0.9593 ROC-AUC; 0.8739 F1), followed by DistilBERT-base (512 context length) at 0.8685 accuracy and 0.9554 ROC-AUC. DistilBERT-base (256 context length) reached 0.8478 accuracy and 0.9407 ROC-AUC, while ModernBERT-base (256 context length) attained 0.8569 accuracy and 0.9470 ROC-AUC. These results suggest that the perceived quality of worldwide news articles can be effectively differentiated by traditional CPU-based machine learning classifiers and deep learning classifiers.
☆ AICC: Parse HTML Finer, Make Models Better -- A 7.3T AI-Ready Corpus Built by a Model-Based HTML Parser
While web data quality is crucial for large language models, most curation efforts focus on filtering and deduplication,treating HTML-to-text extraction as a fixed pre-processing step. Existing web corpora rely on heuristic-based extractors like Trafilatura, which struggle to preserve document structure and frequently corrupt structured elements such as formulas, codes, and tables. We hypothesize that improving extraction quality can be as impactful as aggressive filtering strategies for downstream performance. We introduce MinerU-HTML, a novel extraction pipeline that reformulates content extraction as a sequence labeling problem solved by a 0.6B-parameter language model. Unlike text-density heuristics, MinerU-HTML leverages semantic understanding and employs a two-stage formatting pipeline that explicitly categorizes semantic elements before converting to Markdown. Crucially, its model-based approach is inherently scalable, whereas heuristic methods offer limited improvement pathways. On MainWebBench, our benchmark of 7,887 annotated web pages, MinerU-HTML achieves 81.8\% ROUGE-N F1 compared to Trafilatura's 63.6\%, with exceptional structured element preservation (90.9\% for code blocks, 94.0\% for formulas). Using MinerU-HTML, we construct AICC (AI-ready Common Crawl), a 7.3-trillion token multilingual corpus from two Common Crawl snapshots. In controlled pretraining experiments where AICC and Trafilatura-extracted TfCC undergo identical filtering, models trained on AICC (62B tokens) achieve 50.8\% average accuracy across 13 benchmarks, outperforming TfCC by 1.08pp-providing direct evidence that extraction quality significantly impacts model capabilities. AICC also surpasses RefinedWeb and FineWeb on key benchmarks. We publicly release MainWebBench, MinerU-HTML, and AICC, demonstrating that HTML extraction is a critical, often underestimated component of web corpus construction.
☆ Learning from Sufficient Rationales: Analysing the Relationship Between Explanation Faithfulness and Token-level Regularisation Strategies AACL 2025
Human explanations of natural language, rationales, form a tool to assess whether models learn a label for the right reasons or rely on dataset-specific shortcuts. Sufficiency is a common metric for estimating the informativeness of rationales, but it provides limited insight into the effects of rationale information on model performance. We address this limitation by relating sufficiency to two modelling paradigms: the ability of models to identify which tokens are part of the rationale (through token classification) and the ability of improving model performance by incorporating rationales in the input (through attention regularisation). We find that highly informative rationales are not likely to help classify the instance correctly. Sufficiency conversely captures the classification impact of the non-rationalised context, which interferes with rationale information in the same input. We also find that incorporating rationale information in model inputs can boost cross-domain classification, but results are inconsistent per task and model type. Finally, sufficiency and token classification appear to be unrelated. These results exemplify the complexity of rationales, showing that metrics capable of systematically capturing this type of information merit further investigation.
comment: Long paper accepted to the main conference of AACL 2025. Please cite the conference proceedings when available
☆ NLP Datasets for Idiom and Figurative Language Tasks
Idiomatic and figurative language form a large portion of colloquial speech and writing. With social media, this informal language has become more easily observable to people and trainers of large language models (LLMs) alike. While the advantage of large corpora seems like the solution to all machine learning and Natural Language Processing (NLP) problems, idioms and figurative language continue to elude LLMs. Finetuning approaches are proving to be optimal, but better and larger datasets can help narrow this gap even further. The datasets presented in this paper provide one answer, while offering a diverse set of categories on which to build new models and develop new approaches. A selection of recent idiom and figurative language datasets were used to acquire a combined idiom list, which was used to retrieve context sequences from a large corpus. One large-scale dataset of potential idiomatic and figurative language expressions and two additional human-annotated datasets of definite idiomatic and figurative language expressions were created to evaluate the baseline ability of pre-trained language models in handling figurative meaning through idiom recognition (detection) tasks. The resulting datasets were post-processed for model agnostic training compatibility, utilized in training, and evaluated on slot labeling and sequence tagging.
comment: 32 pages, 10 figures
☆ OpenMMReasoner: Pushing the Frontiers for Multimodal Reasoning with an Open and General Recipe
Recent advancements in large reasoning models have fueled growing interest in extending such capabilities to multimodal domains. However, despite notable progress in visual reasoning, the lack of transparent and reproducible data curation and training strategies remains a major barrier to scalable research. In this work, we introduce OpenMMReasoner, a fully transparent two-stage recipe for multimodal reasoning spanning supervised fine-tuning (SFT) and reinforcement learning (RL). In the SFT stage, we construct an 874K-sample cold-start dataset with rigorous step-by-step validation, providing a strong foundation for reasoning capabilities. The subsequent RL stage leverages a 74K-sample dataset across diverse domains to further sharpen and stabilize these abilities, resulting in a more robust and efficient learning process. Extensive evaluations demonstrate that our training recipe not only surpasses strong baselines but also highlights the critical role of data quality and training design in shaping multimodal reasoning performance. Notably, our method achieves a 11.6% improvement over the Qwen2.5-VL-7B-Instruct baseline across nine multimodal reasoning benchmarks, establishing a solid empirical foundation for future large-scale multimodal reasoning research. We open-sourced all our codes, pipeline, and data at https://github.com/EvolvingLMMs-Lab/OpenMMReasoner.
☆ Incorporating Self-Rewriting into Large Language Model Reasoning Reinforcement AAAI 2026
Through reinforcement learning (RL) with outcome correctness rewards, large reasoning models (LRMs) with scaled inference computation have demonstrated substantial success on complex reasoning tasks. However, the one-sided reward, focused solely on final correctness, limits its ability to provide detailed supervision over internal reasoning process. This deficiency leads to suboptimal internal reasoning quality, manifesting as issues like over-thinking, under-thinking, redundant-thinking, and disordered-thinking. Inspired by the recent progress in LRM self-rewarding, we introduce self-rewriting framework, where a model rewrites its own reasoning texts, and subsequently learns from the rewritten reasoning to improve the internal thought process quality. For algorithm design, we propose a selective rewriting approach wherein only "simple" samples, defined by the model's consistent correctness, are rewritten, thereby preserving all original reward signals of GRPO. For practical implementation, we compile rewriting and vanilla generation within one single batch, maintaining the scalability of the RL algorithm and introducing only ~10% overhead. Extensive experiments on diverse tasks with different model sizes validate the effectiveness of self-rewriting. In terms of the accuracy-length tradeoff, the self-rewriting approach achieves improved accuracy (+0.6) with substantially shorter reasoning (-46%) even without explicit instructions in rewriting prompts to reduce reasoning length, outperforming existing strong baselines. In terms of internal reasoning quality, self-rewriting achieves significantly higher scores (+7.2) under the LLM-as-a-judge metric, successfully mitigating internal reasoning flaws.
comment: Accepted to AAAI 2026
☆ SDA: Steering-Driven Distribution Alignment for Open LLMs without Fine-Tuning
With the rapid advancement of large language models (LLMs), their deployment in real-world applications has become increasingly widespread. LLMs are expected to deliver robust performance across diverse tasks, user preferences, and practical scenarios. However, as demands grow, ensuring that LLMs produce responses aligned with human intent remains a foundational challenge. In particular, aligning model behavior effectively and efficiently during inference, without costly retraining or extensive supervision, is both a critical requirement and a non-trivial technical endeavor. To address the challenge, we propose SDA (Steering-Driven Distribution Alignment), a training-free and model-agnostic alignment framework designed for open-source LLMs. SDA dynamically redistributes model output probabilities based on user-defined alignment instructions, enhancing alignment between model behavior and human intents without fine-tuning. The method is lightweight, resource-efficient, and compatible with a wide range of open-source LLMs. It can function independently during inference or be integrated with training-based alignment strategies. Moreover, SDA supports personalized preference alignment, enabling flexible control over the model response behavior. Empirical results demonstrate that SDA consistently improves alignment performance across 8 open-source LLMs with varying scales and diverse origins, evaluated on three key alignment dimensions, helpfulness, harmlessness, and honesty (3H). Specifically, SDA achieves average gains of 64.4% in helpfulness, 30% in honesty and 11.5% in harmlessness across the tested models, indicating its effectiveness and generalization across diverse models and application scenarios.
☆ SeSE: A Structural Information-Guided Uncertainty Quantification Framework for Hallucination Detection in LLMs
Reliable uncertainty quantification (UQ) is essential for deploying large language models (LLMs) in safety-critical scenarios, as it enables them to abstain from responding when uncertain, thereby avoiding hallucinating falsehoods. However, state-of-the-art UQ methods primarily rely on semantic probability distributions or pairwise distances, overlooking latent semantic structural information that could enable more precise uncertainty estimates. This paper presents Semantic Structural Entropy (SeSE), a principled UQ framework that quantifies the inherent semantic uncertainty of LLMs from a structural information perspective for hallucination detection. Specifically, to effectively model semantic spaces, we first develop an adaptively sparsified directed semantic graph construction algorithm that captures directional semantic dependencies while automatically pruning unnecessary connections that introduce negative interference. We then exploit latent semantic structural information through hierarchical abstraction: SeSE is defined as the structural entropy of the optimal semantic encoding tree, formalizing intrinsic uncertainty within semantic spaces after optimal compression. A higher SeSE value corresponds to greater uncertainty, indicating that LLMs are highly likely to generate hallucinations. In addition, to enhance fine-grained UQ in long-form generation -- where existing methods often rely on heuristic sample-and-count techniques -- we extend SeSE to quantify the uncertainty of individual claims by modeling their random semantic interactions, providing theoretically explicable hallucination detection. Extensive experiments across 29 model-dataset combinations show that SeSE significantly outperforms advanced UQ baselines, including strong supervised methods and the recently proposed KLE.
comment: 14 pages of main text and 10 pages of appendices
☆ Can MLLMs Read the Room? A Multimodal Benchmark for Assessing Deception in Multi-Party Social Interactions
Despite their advanced reasoning capabilities, state-of-the-art Multimodal Large Language Models (MLLMs) demonstrably lack a core component of human intelligence: the ability to `read the room' and assess deception in complex social interactions. To rigorously quantify this failure, we introduce a new task, Multimodal Interactive Deception Assessment (MIDA), and present a novel multimodal dataset providing synchronized video and text with verifiable ground-truth labels for every statement. We establish a comprehensive benchmark evaluating 12 state-of-the-art open- and closed-source MLLMs, revealing a significant performance gap: even powerful models like GPT-4o struggle to distinguish truth from falsehood reliably. Our analysis of failure modes indicates that these models fail to effectively ground language in multimodal social cues and lack the ability to model what others know, believe, or intend, highlighting the urgent need for novel approaches to building more perceptive and trustworthy AI systems. To take a step forward, we design a Social Chain-of-Thought (SoCoT) reasoning pipeline and a Dynamic Social Epistemic Memory (DSEM) module. Our framework yields performance improvement on this challenging task, demonstrating a promising new path toward building MLLMs capable of genuine human-like social reasoning.
☆ PSM: Prompt Sensitivity Minimization via LLM-Guided Black-Box Optimization
System prompts are critical for guiding the behavior of Large Language Models (LLMs), yet they often contain proprietary logic or sensitive information, making them a prime target for extraction attacks. Adversarial queries can successfully elicit these hidden instructions, posing significant security and privacy risks. Existing defense mechanisms frequently rely on heuristics, incur substantial computational overhead, or are inapplicable to models accessed via black-box APIs. This paper introduces a novel framework for hardening system prompts through shield appending, a lightweight approach that adds a protective textual layer to the original prompt. Our core contribution is the formalization of prompt hardening as a utility-constrained optimization problem. We leverage an LLM-as-optimizer to search the space of possible SHIELDs, seeking to minimize a leakage metric derived from a suite of adversarial attacks, while simultaneously preserving task utility above a specified threshold, measured by semantic fidelity to baseline outputs. This black-box, optimization-driven methodology is lightweight and practical, requiring only API access to the target and optimizer LLMs. We demonstrate empirically that our optimized SHIELDs significantly reduce prompt leakage against a comprehensive set of extraction attacks, outperforming established baseline defenses without compromising the model's intended functionality. Our work presents a paradigm for developing robust, utility-aware defenses in the escalating landscape of LLM security. The code is made public on the following link: https://github.com/psm-defense/psm
☆ SemanticCite: Citation Verification with AI-Powered Full-Text Analysis and Evidence-Based Reasoning
Effective scientific communication depends on accurate citations that validate sources and guide readers to supporting evidence. Yet academic literature faces mounting challenges: semantic citation errors that misrepresent sources, AI-generated hallucinated references, and traditional citation formats that point to entire papers without indicating which sections substantiate specific claims. We introduce SemanticCite, an AI-powered system that verifies citation accuracy through full-text source analysis while providing rich contextual information via detailed reasoning and relevant text snippets. Our approach combines multiple retrieval methods with a four-class classification system (Supported, Partially Supported, Unsupported, Uncertain) that captures nuanced claim-source relationships and enables appropriate remedial actions for different error types. Our experiments show that fine-tuned lightweight language models achieve performance comparable to large commercial systems with significantly lower computational requirements, making large-scale citation verification practically feasible. The system provides transparent, evidence-based explanations that support user understanding and trust. We contribute a comprehensive dataset of over 1,000 citations with detailed alignments, functional classifications, semantic annotations, and bibliometric metadata across eight disciplines, alongside fine-tuned models and the complete verification framework as open-source software. SemanticCite addresses critical challenges in research integrity through scalable citation verification, streamlined peer review, and quality control for AI-generated content, providing an open-source foundation for maintaining citation accuracy at scale.
comment: 21 pages, 4 figures
☆ TS-PEFT: Token-Selective Parameter-Efficient Fine-Tuning with Learnable Threshold Gating
In the field of large models (LMs) for natural language processing (NLP) and computer vision (CV), Parameter-Efficient Fine-Tuning (PEFT) has emerged as a resource-efficient method that modifies a limited number of parameters while keeping the pretrained weights fixed. This paper investigates the traditional PEFT approach, which applies modifications to all position indices, and questions its necessity. We introduce a new paradigm called Token-Selective PEFT (TS-PEFT), in which a function S selectively applies PEFT modifications to a subset of position indices, potentially enhancing performance on downstream tasks. Our experimental results reveal that the indiscriminate application of PEFT to all indices is not only superfluous, but may also be counterproductive. This study offers a fresh perspective on PEFT, advocating for a more targeted approach to modifications and providing a framework for future research to optimize the fine-tuning process for large models.
comment: 11 pages, 3 figures
☆ ELPO: Ensemble Learning Based Prompt Optimization for Large Language Models
The remarkable performance of Large Language Models (LLMs) highly relies on crafted prompts. However, manual prompt engineering is a laborious process, creating a core bottleneck for practical application of LLMs. This phenomenon has led to the emergence of a new research area known as Automatic Prompt Optimization (APO), which develops rapidly in recent years. Existing APO methods such as those based on evolutionary algorithms or trial-and-error approaches realize an efficient and accurate prompt optimization to some extent. However, those researches focus on a single model or algorithm for the generation strategy and optimization process, which limits their performance when handling complex tasks. To address this, we propose a novel framework called Ensemble Learning based Prompt Optimization (ELPO) to achieve more accurate and robust results. Motivated by the idea of ensemble learning, ELPO conducts voting mechanism and introduces shared generation strategies along with different search methods for searching superior prompts. Moreover, ELPO creatively presents more efficient algorithms for the prompt generation and search process. Experimental results demonstrate that ELPO outperforms state-of-the-art prompt optimization methods across different tasks, e.g., improving F1 score by 7.6 on ArSarcasm dataset.
☆ Early science acceleration experiments with GPT-5
AI models like GPT-5 are an increasingly valuable tool for scientists, but many remain unaware of the capabilities of frontier AI. We present a collection of short case studies in which GPT-5 produced new, concrete steps in ongoing research across mathematics, physics, astronomy, computer science, biology, and materials science. In these examples, the authors highlight how AI accelerated their work, and where it fell short; where expert time was saved, and where human input was still key. We document the interactions of the human authors with GPT-5, as guiding examples of fruitful collaboration with AI. Of note, this paper includes four new results in mathematics (carefully verified by the human authors), underscoring how GPT-5 can help human mathematicians settle previously unsolved problems. These contributions are modest in scope but profound in implication, given the rate at which frontier AI is progressing.
comment: 89 pages
☆ Learning Tractable Distributions Of Language Model Continuations
Controlled language generation conditions text on sequence-level constraints (for example, syntax, style, or safety). These constraints may depend on future tokens, which makes directly conditioning an autoregressive language model (LM) generally intractable. Prior work uses tractable surrogates such as hidden Markov models (HMMs) to approximate the distribution over continuations and adjust the model's next-token logits at decoding time. However, we find that these surrogates are often weakly context aware, which reduces query quality. We propose Learning to Look Ahead (LTLA), a hybrid approach that pairs the same base language model for rich prefix encoding with a fixed tractable surrogate model that computes exact continuation probabilities. Two efficiency pitfalls arise when adding neural context: (i) naively rescoring the prefix with every candidate next token requires a sweep over the entire vocabulary at each step, and (ii) predicting fresh surrogate parameters for each prefix, although tractable at a single step, forces recomputation of future probabilities for every new prefix and eliminates reuse. LTLA avoids both by using a single batched HMM update to account for all next-token candidates at once, and by conditioning only the surrogate's latent state prior on the LM's hidden representations while keeping the surrogate decoder fixed, so computations can be reused across prefixes. Empirically, LTLA attains higher conditional likelihood than an unconditional HMM, approximates continuation distributions for vision-language models where a standalone HMM cannot encode visual context, and improves constraint satisfaction at comparable fluency on controlled-generation tasks, with minimal inference overhead.
☆ Liars' Bench: Evaluating Lie Detectors for Language Models
Prior work has introduced techniques for detecting when large language models (LLMs) lie, that is, generating statements they believe are false. However, these techniques are typically validated in narrow settings that do not capture the diverse lies LLMs can generate. We introduce LIARS' BENCH, a testbed consisting of 72,863 examples of lies and honest responses generated by four open-weight models across seven datasets. Our settings capture qualitatively different types of lies and vary along two dimensions: the model's reason for lying and the object of belief targeted by the lie. Evaluating three black- and white-box lie detection techniques on LIARS' BENCH, we find that existing techniques systematically fail to identify certain types of lies, especially in settings where it's not possible to determine whether the model lied from the transcript alone. Overall, LIARS' BENCH reveals limitations in prior techniques and provides a practical testbed for guiding progress in lie detection.
comment: *Kieron Kretschmar and Walter Laurito contributed equally to this work. 10 pages, 2 figures; plus appendix. Code at https://github.com/Cadenza-Labs/liars-bench and datasets at https://huggingface.co/datasets/Cadenza-Labs/liars-bench Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
☆ SpellForger: Prompting Custom Spell Properties In-Game using BERT supervised-trained model
Introduction: The application of Artificial Intelligence in games has evolved significantly, allowing for dynamic content generation. However, its use as a core gameplay co-creation tool remains underexplored. Objective: This paper proposes SpellForger, a game where players create custom spells by writing natural language prompts, aiming to provide a unique experience of personalization and creativity. Methodology: The system uses a supervisedtrained BERT model to interpret player prompts. This model maps textual descriptions to one of many spell prefabs and balances their parameters (damage, cost, effects) to ensure competitive integrity. The game is developed in the Unity Game Engine, and the AI backend is in Python. Expected Results: We expect to deliver a functional prototype that demonstrates the generation of spells in real time, applied to an engaging gameplay loop, where player creativity is central to the experience, validating the use of AI as a direct gameplay mechanic.
comment: Published in Anais Estendidos do XXIV Simpósio Brasileiro de Jogos e Entretenimento Digital (SBGames 2025)
☆ QueryGym: A Toolkit for Reproducible LLM-Based Query Reformulation
We present QueryGym, a lightweight, extensible Python toolkit that supports large language model (LLM)-based query reformulation. This is an important tool development since recent work on llm-based query reformulation has shown notable increase in retrieval effectiveness. However, while different authors have sporadically shared the implementation of their methods, there is no unified toolkit that provides a consistent implementation of such methods, which hinders fair comparison, rapid experimentation, consistent benchmarking and reliable deployment. QueryGym addresses this gap by providing a unified framework for implementing, executing, and comparing llm-based reformulation methods. The toolkit offers: (1) a Python API for applying diverse LLM-based methods, (2) a retrieval-agnostic interface supporting integration with backends such as Pyserini and PyTerrier, (3) a centralized prompt management system with versioning and metadata tracking, (4) built-in support for benchmarks like BEIR and MS MARCO, and (5) a completely open-source extensible implementation available to all researchers. QueryGym is publicly available at https://github.com/radinhamidi/QueryGym.
comment: 4 pages
☆ CARE-RAG - Clinical Assessment and Reasoning in RAG
Access to the right evidence does not guarantee that large language models (LLMs) will reason with it correctly. This gap between retrieval and reasoning is especially concerning in clinical settings, where outputs must align with structured protocols. We study this gap using Written Exposure Therapy (WET) guidelines as a testbed. In evaluating model responses to curated clinician-vetted questions, we find that errors persist even when authoritative passages are provided. To address this, we propose an evaluation framework that measures accuracy, consistency, and fidelity of reasoning. Our results highlight both the potential and the risks: retrieval-augmented generation (RAG) can constrain outputs, but safe deployment requires assessing reasoning as rigorously as retrieval.
comment: The Second Workshop on GenAI for Health: Potential, Trust, and Policy Compliance
☆ TOD-ProcBench: Benchmarking Complex Instruction-Following in Task-Oriented Dialogues
In real-world task-oriented dialogue (TOD) settings, agents are required to strictly adhere to complex instructions while conducting multi-turn conversations with customers. These instructions are typically presented in natural language format and include general guidelines and step-by-step procedures with complex constraints. Existing TOD benchmarks often oversimplify the complex nature of these instructions by reducing them to simple schemas composed of intents, slots, and API call configurations. To address this gap and systematically benchmark LLMs' instruction-following capabilities, we propose TOD-ProcBench, a challenging benchmark featuring complex process instructions with intricate, fine-grained constraints that evaluates various LLMs' abilities to understand and follow instructions in multi-turn TODs. Our benchmark dataset comprises instruction documents derived from the high-quality ABCD dataset with corresponding conversations under human quality control. We formulate fine-grained constraints and action procedures as multi-level condition-action instruction statements. We design three tasks to comprehensively benchmark LLMs' complex instruction-following capabilities in multi-turn TODs. Task 1 evaluates how LLMs retrieve the most relevant statement from a complex instruction and predict the corresponding next action. In Task 2, we synthesize instruction-violating responses by injecting inconsistencies and manipulating the original instructions, and then we analyze how effectively LLMs can identify instruction-violating responses. Task 3 investigates LLMs' abilities in conditional generation of instruction-following responses based on the original complex instructions. Additionally, we conduct studies on the impact of multilingual settings and different instruction text formats on compliance performance. We release our benchmark under the Llama 3.3 Community License Agreement.
☆ JudgeBoard: Benchmarking and Enhancing Small Language Models for Reasoning Evaluation
While small language models (SLMs) have shown promise on various reasoning tasks, their ability to judge the correctness of answers remains unclear compared to large language models (LLMs). Prior work on LLM-as-a-judge frameworks typically relies on comparing candidate answers against ground-truth labels or other candidate answers using predefined metrics like entailment. However, this approach is inherently indirect and difficult to fully automate, offering limited support for fine-grained and scalable evaluation of reasoning outputs. In this work, we propose JudgeBoard, a novel evaluation pipeline that directly queries models to assess the correctness of candidate answers without requiring extra answer comparisons. We focus on two core reasoning domains: mathematical reasoning and science/commonsense reasoning, and construct task-specific evaluation leaderboards using both accuracy-based ranking and an Elo-based rating system across five benchmark datasets, enabling consistent model comparison as judges rather than comparators. To improve judgment performance in lightweight models, we propose MAJ (Multi-Agent Judging), a novel multi-agent evaluation framework that leverages multiple interacting SLMs with distinct reasoning profiles to approximate LLM-level judgment accuracy through collaborative deliberation. Experimental results reveal a significant performance gap between SLMs and LLMs in isolated judging tasks. However, our MAJ framework substantially improves the reliability and consistency of SLMs. On the MATH dataset, MAJ using smaller-sized models as backbones performs comparatively well or even better than their larger-sized counterparts. Our findings highlight that multi-agent SLM systems can potentially match or exceed LLM performance in judgment tasks, with implications for scalable and efficient assessment.
comment: 23 pages, 4 figures
♻ ☆ LLMInit: A Free Lunch from Large Language Models for Selective Initialization of Recommendation EMNLP 2025
Collaborative filtering (CF) is widely adopted in industrial recommender systems (RecSys) for modeling user-item interactions across numerous applications, but often struggles with cold-start and data-sparse scenarios. Recent advancements in pre-trained large language models (LLMs) with rich semantic knowledge, offer promising solutions to these challenges. However, deploying LLMs at scale is hindered by their significant computational demands and latency. In this paper, we propose a novel and scalable LLM-RecSys framework, LLMInit, designed to integrate pretrained LLM embeddings into CF models through selective initialization strategies. Specifically, we identify the embedding collapse issue observed when CF models scale and match the large embedding sizes in LLMs and avoid the problem by introducing efficient sampling methods, including, random, uniform, and variance-based selections. Comprehensive experiments conducted on multiple real-world datasets demonstrate that LLMInit significantly improves recommendation performance while maintaining low computational costs, offering a practical and scalable solution for industrial applications. To facilitate industry adoption and promote future research, we provide open-source access to our implementation at https://github.com/DavidZWZ/LLMInit.
comment: Accepted in EMNLP 2025 Industry Track
♻ ☆ Sigma: Semantically Informative Pre-training for Skeleton-based Sign Language Understanding
Pre-training has proven effective for learning transferable features in sign language understanding (SLU) tasks. Recently, skeleton-based methods have gained increasing attention because they can robustly handle variations in subjects and backgrounds without being affected by appearance or environmental factors. Current SLU methods continue to face three key limitations: 1) weak semantic grounding, as models often capture low-level motion patterns from skeletal data but struggle to relate them to linguistic meaning; 2) imbalance between local details and global context, with models either focusing too narrowly on fine-grained cues or overlooking them for broader context; and 3) inefficient cross-modal learning, as constructing semantically aligned representations across modalities remains difficult. To address these, we propose Sigma, a unified skeleton-based SLU framework featuring: 1) a sign-aware early fusion mechanism that facilitates deep interaction between visual and textual modalities, enriching visual features with linguistic context; 2) a hierarchical alignment learning strategy that jointly maximises agreements across different levels of paired features from different modalities, effectively capturing both fine-grained details and high-level semantic relationships; and 3) a unified pre-training framework that combines contrastive learning, text matching and language modelling to promote semantic consistency and generalisation. Sigma achieves new state-of-the-art results on isolated sign language recognition, continuous sign language recognition, and gloss-free sign language translation on multiple benchmarks spanning different sign and spoken languages, demonstrating the impact of semantically informative pre-training and the effectiveness of skeletal data as a stand-alone solution for SLU.
♻ ☆ Turning Up the Heat: Min-p Sampling for Creative and Coherent LLM Outputs ICLR 2025
Large Language Models (LLMs) generate text by sampling the next token from a probability distribution over the vocabulary at each decoding step. Popular sampling methods like top-p (nucleus sampling) often struggle to balance quality and diversity, especially at higher temperatures which lead to incoherent or repetitive outputs. We propose min-p sampling, a dynamic truncation method that adjusts the sampling threshold based on the model's confidence by using the top token's probability as a scaling factor. Our experiments on benchmarks including GPQA, GSM8K, and AlpacaEval Creative Writing show that min-p sampling improves both the quality and diversity of generated text across different model families (Mistral and Llama 3) and model sizes (1B to 123B parameters), especially at higher temperatures. Human evaluations further show a clear preference for min-p sampling, in both text quality and creativity. Min-p sampling has been adopted by popular open-source LLM frameworks, including Hugging Face Transformers, VLLM, and many others, highlighting its considerable impact on improving text generation quality.
comment: Oral presentation at ICLR 2025. Camera-ready version available at https://iclr.cc/virtual/2025/poster/30358
♻ ☆ Probing the Critical Point (CritPt) of AI Reasoning: a Frontier Physics Research Benchmark
While large language models (LLMs) with reasoning capabilities are progressing rapidly on high-school math competitions and coding, can they reason effectively through complex, open-ended challenges found in frontier physics research? And crucially, what kinds of reasoning tasks do physicists want LLMs to assist with? To address these questions, we present the CritPt (Complex Research using Integrated Thinking - Physics Test, pronounced "critical point"), the first benchmark designed to test LLMs on unpublished, research-level reasoning tasks that broadly covers modern physics research areas, including condensed matter, quantum physics, atomic, molecular & optical physics, astrophysics, high energy physics, mathematical physics, statistical physics, nuclear physics, nonlinear dynamics, fluid dynamics and biophysics. CritPt consists of 71 composite research challenges designed to simulate full-scale research projects at the entry level, which are also decomposed to 190 simpler checkpoint tasks for more fine-grained insights. All problems are newly created by 50+ active physics researchers based on their own research. Every problem is hand-curated to admit a guess-resistant and machine-verifiable answer and is evaluated by an automated grading pipeline heavily customized for advanced physics-specific output formats. We find that while current state-of-the-art LLMs show early promise on isolated checkpoints, they remain far from being able to reliably solve full research-scale challenges: the best average accuracy among base models is only 5.7%, achieved by GPT-5 (high), moderately rising to around 10% when equipped with coding tools. Through the realistic yet standardized evaluation offered by CritPt, we highlight a large disconnect between current model capabilities and realistic physics research demands, offering a foundation to guide the development of scientifically grounded AI tools.
comment: 39 pages, 6 figures, 6 tables
♻ ☆ False Sense of Security: Why Probing-based Malicious Input Detection Fails to Generalize
Large Language Models (LLMs) can comply with harmful instructions, raising serious safety concerns despite their impressive capabilities. Recent work has leveraged probing-based approaches to study the separability of malicious and benign inputs in LLMs' internal representations, and researchers have proposed using such probing methods for safety detection. We systematically re-examine this paradigm. Motivated by poor out-of-distribution performance, we hypothesize that probes learn superficial patterns rather than semantic harmfulness. Through controlled experiments, we confirm this hypothesis and identify the specific patterns learned: instructional patterns and trigger words. Our investigation follows a systematic approach, progressing from demonstrating comparable performance of simple n-gram methods, to controlled experiments with semantically cleaned datasets, to detailed analysis of pattern dependencies. These results reveal a false sense of security around current probing-based approaches and highlight the need to redesign both models and evaluation protocols, for which we provide further discussions in the hope of suggesting responsible further research in this direction. We have open-sourced the project at https://github.com/WangCheng0116/Why-Probe-Fails.
comment: Withdrawn due to identified errors in the experimental procedure
♻ ☆ AgentSwift: Efficient LLM Agent Design via Value-guided Hierarchical Search AAAI-2026
Large language model (LLM) agents have demonstrated strong capabilities across diverse domains, yet automated agent design remains a significant challenge. Current automated agent design approaches are often constrained by limited search spaces that primarily optimize workflows but fail to integrate crucial human-designed components like memory, planning, and tool use. Furthermore, these methods are hampered by high evaluation costs, as evaluating even a single new agent on a benchmark can require tens of dollars. The difficulty of this exploration is further exacerbated by inefficient search strategies that struggle to navigate the large design space effectively, making the discovery of novel agents a slow and resource-intensive process. To address these challenges, we propose AgentSwift, a novel framework for automated agent design. We formalize a hierarchical search space that jointly models agentic workflow and composable functional components. This structure moves beyond optimizing workflows alone by co-optimizing functional components, which enables the discovery of more complex and effective agent architectures. To make exploration within this expansive space feasible, we mitigate high evaluation costs by training a value model on a high-quality dataset, generated via a novel strategy combining combinatorial coverage and balanced Bayesian sampling for low-cost evaluation. Guiding the entire process is a hierarchical MCTS strategy, which is informed by uncertainty to efficiently navigate the search space. Evaluated across a comprehensive set of seven benchmarks spanning embodied, math, web, tool, and game domains, AgentSwift discovers agents that achieve an average performance gain of 8.34\% over both existing automated agent search methods and manually designed agents. Our framework serves as a launchpad for researchers to rapidly discover powerful agent architectures.
comment: AAAI-2026
♻ ☆ KVTuner: Sensitivity-Aware Layer-Wise Mixed-Precision KV Cache Quantization for Efficient and Nearly Lossless LLM Inference ICML25
KV cache quantization can improve Large Language Models (LLMs) inference throughput and latency in long contexts and large batch-size scenarios while preserving LLMs effectiveness. However, current methods have three unsolved issues: overlooking layer-wise sensitivity to KV cache quantization, high overhead of online fine-grained decision-making, and low flexibility to different LLMs and constraints. Therefore, we theoretically analyze the inherent correlation of layer-wise transformer attention patterns to KV cache quantization errors and study why key cache is generally more important than value cache for quantization error reduction. We further propose a simple yet effective framework KVTuner to adaptively search for the optimal hardware-friendly layer-wise KV quantization precision pairs for coarse-grained KV cache with multi-objective optimization and directly utilize the offline searched configurations during online inference. To reduce the computational cost of offline calibration, we utilize the intra-layer KV precision pair pruning and inter-layer clustering to reduce the search space. Experimental results show that we can achieve nearly lossless 3.25-bit mixed precision KV cache quantization for LLMs like Llama-3.1-8B-Instruct and 4.0-bit for sensitive models like Qwen2.5-7B-Instruct on mathematical reasoning tasks. The maximum inference throughput can be improved by 21.25\% compared with KIVI-KV8 quantization over various context lengths. Our code and searched configurations are available at https://github.com/cmd2001/KVTuner.
comment: Accepted by ICML25. Code: https://github.com/cmd2001/KVTuner
♻ ☆ Crowdsourcing Lexical Diversity
Lexical-semantic resources (LSRs), such as online lexicons and wordnets, are fundamental to natural language processing applications as well as to fields such as linguistic anthropology and language preservation. In many languages, however, such resources suffer from quality issues: incorrect entries, incompleteness, but also the rarely addressed issue of bias towards the English language and Anglo-Saxon culture. Such bias manifests itself in the absence of concepts specific to the language or culture at hand, the presence of foreign (Anglo-Saxon) concepts, as well as in the lack of an explicit indication of untranslatability, also known as cross-lingual lexical gaps, when a term has no equivalent in another language. This paper proposes a novel crowdsourcing methodology for reducing bias in LSRs. Crowd workers compare lexemes from two languages, focusing on domains rich in lexical diversity, such as kinship or food. Our LingoGap crowdsourcing platform facilitates comparisons through microtasks identifying equivalent terms, language-specific terms, and lexical gaps across languages. We validated our method by applying it to two case studies focused on food-related terminology: (1) English and Arabic, and (2) Standard Indonesian and Banjarese. These experiments identified 2,140 lexical gaps in the first case study and 951 in the second. The success of these experiments confirmed the usability of our method and tool for future large-scale lexicon enrichment tasks.
♻ ☆ Arg-LLaDA: Argument Summarization via Large Language Diffusion Models and Sufficiency-Aware Refinement
Argument summarization aims to generate concise, structured representations of complex, multi-perspective debates. While recent work has advanced the identification and clustering of argumentative components, the generation stage remains underexplored. Existing approaches typically rely on single-pass generation, offering limited support for factual correction or structural refinement. To address this gap, we introduce Arg-LLaDA, a novel large language diffusion framework that iteratively improves summaries via sufficiency-guided remasking and regeneration. Our method combines a flexible masking controller with a sufficiency-checking module to identify and revise unsupported, redundant, or incomplete spans, yielding more faithful, concise, and coherent outputs. Empirical results on two benchmark datasets demonstrate that Arg-LLaDA surpasses state-of-the-art baselines in 7 out of 10 automatic evaluation metrics. In addition, human evaluations reveal substantial improvements across core dimensions, coverage, faithfulness, and conciseness, validating the effectiveness of our iterative, sufficiency-aware generation strategy.
comment: Preprint
♻ ☆ Eliciting Reasoning in Language Models with Cognitive Tools
The recent advent of reasoning models like OpenAI's o1 was met with excited speculation by the AI community about the mechanisms underlying these capabilities in closed models, followed by a rush of replication efforts, particularly from the open source community. These speculations were largely settled by the demonstration from DeepSeek-R1 that chains-of-thought and reinforcement learning (RL) can effectively replicate reasoning on top of base LLMs. However, it remains valuable to explore alternative methods for theoretically eliciting reasoning that could help elucidate the underlying mechanisms, as well as providing additional methods that may offer complementary benefits. Here, we build on the long-standing literature in cognitive psychology and cognitive architectures, which postulates that reasoning arises from the orchestrated, sequential execution of a set of modular, predetermined cognitive operations. Crucially, we implement this key idea within a modern agentic tool-calling framework. In particular, we endow an LLM with a small set of "cognitive tools" encapsulating specific reasoning operations, each executed by the LLM itself. Surprisingly, this simple strategy results in considerable gains in performance on standard mathematical reasoning benchmarks compared to base LLMs, for both closed and open-weight models. For instance, providing our "cognitive tools" to GPT-4.1 increases its pass@1 performance on AIME2024 from 32% to 53%, even surpassing the performance of o1-preview. In addition to its practical implications, this demonstration contributes to the debate regarding the role of post-training methods in eliciting reasoning in LLMs versus the role of inherent capabilities acquired during pre-training, and whether post-training merely uncovers these latent abilities.
comment: 25 pages, 2 figures
♻ ☆ AutoJudge: Judge Decoding Without Manual Annotation NeurIPS 2025
We introduce AutoJudge, a method that accelerates large language model (LLM) inference with task-specific lossy speculative decoding. Instead of matching the original model output distribution token-by-token, we identify which of the generated tokens affect the downstream quality of the response, relaxing the distribution match guarantee so that the "unimportant" tokens can be generated faster. Our approach relies on a semi-greedy search algorithm to test which of the mismatches between target and draft models should be corrected to preserve quality and which ones may be skipped. We then train a lightweight classifier based on existing LLM embeddings to predict, at inference time, which mismatching tokens can be safely accepted without compromising the final answer quality. We evaluate the effectiveness of AutoJudge with multiple draft/target model pairs on mathematical reasoning and programming benchmarks, achieving significant speedups at the cost of a minor accuracy reduction. Notably, on GSM8k with the Llama 3.1 70B target model, our approach achieves up to $\approx2\times$ speedup over speculative decoding at the cost of $\le 1\%$ drop in accuracy. When applied to the LiveCodeBench benchmark, AutoJudge automatically detects programming-specific important tokens, accepting $\ge 25$ tokens per speculation cycle at $2\%$ drop in Pass@1. Our approach requires no human annotation and is easy to integrate with modern LLM inference frameworks.
comment: Accepted at NeurIPS 2025
♻ ☆ One Pic is All it Takes: Poisoning Visual Document Retrieval Augmented Generation with a Single Image
Retrieval-augmented generation (RAG) is instrumental for inhibiting hallucinations in large language models (LLMs) through the use of a factual knowledge base (KB). Although PDF documents are prominent sources of knowledge, text-based RAG pipelines are ineffective at capturing their rich multi-modal information. In contrast, visual document RAG (VD-RAG) uses screenshots of document pages as the KB, which has been shown to achieve state-of-the-art results. However, by introducing the image modality, VD-RAG introduces new attack vectors for adversaries to disrupt the system by injecting malicious documents into the KB. In this paper, we demonstrate the vulnerability of VD-RAG to poisoning attacks targeting both retrieval and generation. We define two attack objectives and demonstrate that both can be realized by injecting only a single adversarial image into the KB. Firstly, we introduce a targeted attack against one or a group of queries with the goal of spreading targeted disinformation. Secondly, we present a universal attack that, for any potential user query, influences the response to cause a denial-of-service in the VD-RAG system. We investigate the two attack objectives under both white-box and black-box assumptions, employing a multi-objective gradient-based optimization approach as well as prompting state-of-the-art generative models. Using two visual document datasets, a diverse set of state-of-the-art retrievers (embedding models) and generators (vision language models), we show VD-RAG is vulnerable to poisoning attacks in both the targeted and universal settings, yet demonstrating robustness to black-box attacks in the universal setting.
♻ ☆ Co-Reinforcement Learning for Unified Multimodal Understanding and Generation NeurIPS 2025
This paper presents a pioneering exploration of reinforcement learning (RL) via group relative policy optimization for unified multimodal large language models (ULMs), aimed at simultaneously reinforcing generation and understanding capabilities. Through systematic pilot studies, we uncover the significant potential of ULMs to enable the synergistic co-evolution of dual capabilities within a shared policy optimization framework. Building on this insight, we introduce CoRL, a co-reinforcement learning framework comprising a unified RL stage for joint optimization and a refined RL stage for task-specific enhancement. With the proposed CoRL, our resulting model, ULM-R1, achieves average improvements of 7% on three text-to-image generation datasets and 23% on nine multimodal understanding benchmarks. These results demonstrate the effectiveness of CoRL and highlight the substantial benefit of reinforcement learning in facilitating cross-task synergy and optimization for ULMs. Code is available at https://github.com/mm-vl/ULM-R1.
comment: NeurIPS 2025
♻ ☆ CoBA: Counterbias Text Augmentation for Mitigating Various Spurious Correlations via Semantic Triples EMNLP 2025
Deep learning models often learn and exploit spurious correlations in training data, using these non-target features to inform their predictions. Such reliance leads to performance degradation and poor generalization on unseen data. To address these limitations, we introduce a more general form of counterfactual data augmentation, termed counterbias data augmentation, which simultaneously tackles multiple biases (e.g., gender bias, simplicity bias) and enhances out-of-distribution robustness. We present CoBA: CounterBias Augmentation, a unified framework that operates at the semantic triple level: first decomposing text into subject-predicate-object triples, then selectively modifying these triples to disrupt spurious correlations. By reconstructing the text from these adjusted triples, CoBA generates counterbias data that mitigates spurious patterns. Through extensive experiments, we demonstrate that CoBA not only improves downstream task performance, but also effectively reduces biases and strengthens out-of-distribution resilience, offering a versatile and robust solution to the challenges posed by spurious correlations.
comment: Accepted at EMNLP 2025
♻ ☆ TabDistill: Distilling Transformers into Neural Nets for Few-Shot Tabular Classification
Transformer-based models have shown promising performance on tabular data compared to their classical counterparts such as neural networks and Gradient Boosted Decision Trees (GBDTs) in scenarios with limited training data. They utilize their pre-trained knowledge to adapt to new domains, achieving commendable performance with only a few training examples, also called the few-shot regime. However, the performance gain in the few-shot regime comes at the expense of significantly increased complexity and number of parameters. To circumvent this trade-off, we introduce TabDistill, a new strategy to distill the pre-trained knowledge in complex transformer-based models into simpler neural networks for effectively classifying tabular data. Our framework yields the best of both worlds: being parameter-efficient while performing well with limited training data. The distilled neural networks surpass classical baselines such as regular neural networks, XGBoost and logistic regression under equal training data, and in some cases, even the original transformer-based models that they were distilled from.
♻ ☆ Multimodal Evaluation of Russian-language Architectures
Multimodal large language models (MLLMs) are currently at the center of research attention, showing rapid progress in scale and capabilities, yet their intelligence, limitations, and risks remain insufficiently understood. To address these issues, particularly in the context of the Russian language, where no multimodal benchmarks currently exist, we introduce Mera Multi, an open multimodal evaluation framework for Russian-spoken architectures. The benchmark is instruction-based and encompasses default text, image, audio, and video modalities, comprising 18 newly constructed evaluation tasks for both general-purpose models and modality-specific architectures (image-to-text, video-to-text, and audio-to-text). Our contributions include: (i) a universal taxonomy of multimodal abilities; (ii) 18 datasets created entirely from scratch with attention to Russian cultural and linguistic specificity, unified prompts, and metrics; (iii) baseline results for both closed-source and open-source models; (iv) a methodology for preventing benchmark leakage, including watermarking and licenses for private sets. While our current focus is on Russian, the proposed benchmark provides a replicable methodology for constructing multimodal benchmarks in typologically diverse languages, particularly within the Slavic language family.
♻ ☆ VisPlay: Self-Evolving Vision-Language Models from Images
Reinforcement learning (RL) provides a principled framework for improving Vision-Language Models (VLMs) on complex reasoning tasks. However, existing RL approaches often rely on human-annotated labels or task-specific heuristics to define verifiable rewards, both of which are costly and difficult to scale. We introduce VisPlay, a self-evolving RL framework that enables VLMs to autonomously improve their reasoning abilities using large amounts of unlabeled image data. Starting from a single base VLM, VisPlay assigns the model into two interacting roles: an Image-Conditioned Questioner that formulates challenging yet answerable visual questions, and a Multimodal Reasoner that generates silver responses. These roles are jointly trained with Group Relative Policy Optimization (GRPO), which incorporates diversity and difficulty rewards to balance the complexity of generated questions with the quality of the silver answers. VisPlay scales efficiently across two model families. When trained on Qwen2.5-VL and MiMo-VL, VisPlay achieves consistent improvements in visual reasoning, compositional generalization, and hallucination reduction across eight benchmarks, including MM-Vet and MMMU, demonstrating a scalable path toward self-evolving multimodal intelligence. The project page is available at https://bruno686.github.io/VisPlay/
♻ ☆ HalluClean: A Unified Framework to Combat Hallucinations in LLMs
Large language models (LLMs) have achieved impressive performance across a wide range of natural language processing tasks, yet they often produce hallucinated content that undermines factual reliability. To address this challenge, we introduce HalluClean, a lightweight and task-agnostic framework for detecting and correcting hallucinations in LLM-generated text. HalluClean adopts a reasoning-enhanced paradigm, explicitly decomposing the process into planning, execution, and revision stages to identify and refine unsupported claims. It employs minimal task-routing prompts to enable zero-shot generalization across diverse domains, without relying on external knowledge sources or supervised detectors. We conduct extensive evaluations on five representative tasks-question answering, dialogue, summarization, math word problems, and contradiction detection. Experimental results show that HalluClean significantly improves factual consistency and outperforms competitive baselines, demonstrating its potential to enhance the trustworthiness of LLM outputs in real-world applications.
♻ ☆ LoRA on the Go: Instance-level Dynamic LoRA Selection and Merging
Low-Rank Adaptation (LoRA) has emerged as a parameter-efficient approach for fine-tuning large language models. However, conventional LoRA adapters are typically trained for a single task, limiting their applicability in real-world settings where inputs may span diverse and unpredictable domains. At inference time, existing approaches combine multiple LoRAs for improving performance on diverse tasks, while usually requiring labeled data or additional task-specific training, which is expensive at scale. In this work, we introduce LoRA on the Go (LoGo), a training-free framework that dynamically selects and merges adapters at the instance level without any additional requirements. LoGo leverages signals extracted from a single forward pass through LoRA adapters, to identify the most relevant adapters and determine their contributions on-the-fly. Across 5 NLP benchmarks, 27 datasets, and 3 model families, LoGo outperforms training-based baselines on some tasks upto a margin of 3.6% while remaining competitive on other tasks and maintaining inference throughput, highlighting its effectiveness and practicality.
♻ ☆ From Confidence to Collapse in LLM Factual Robustness
Ensuring the robustness of factual knowledge in LLMs is critical for reliable applications in tasks such as question answering and reasoning. However, existing evaluation methods predominantly focus on performance-based metrics, often investigating from the perspective of prompt perturbations, which captures only the externally triggered side of knowledge robustness. To bridge this gap, we introduce a principled approach to measure factual robustness from the perspective of the generation process by analyzing token distribution entropy in combination with temperature scaling sensitivity. These two factors build the Factual Robustness Score (FRS), a novel metric which quantifies the stability of a fact against perturbations in decoding conditions, given its initial uncertainty. To validate our approach, we conduct extensive experiments on 5 LLMs across 3 closed-book QA datasets (SQuAD, TriviaQA, and HotpotQA). We show that factual robustness varies significantly -- smaller models report an FRS of $0.76$, larger ones $0.93$ -- with accuracy degrading by ~$60\%$ under increased uncertainty. These insights demonstrate how entropy and temperature scaling impact factual accuracy, and lay a foundation for developing more robust knowledge retention and retrieval in future models.
♻ ☆ CRISP: Persistent Concept Unlearning via Sparse Autoencoders
As large language models (LLMs) are increasingly deployed in real-world applications, the need to selectively remove unwanted knowledge while preserving model utility has become paramount. Recent work has explored sparse autoencoders (SAEs) to perform precise interventions on monosemantic features. However, most SAE-based methods operate at inference time, which does not create persistent changes in the model's parameters. Such interventions can be bypassed or reversed by malicious actors with parameter access. We introduce CRISP, a parameter-efficient method for persistent concept unlearning using SAEs. CRISP automatically identifies salient SAE features across multiple layers and suppresses their activations. We experiment with two LLMs and show that our method outperforms prior approaches on safety-critical unlearning tasks from the WMDP benchmark, successfully removing harmful knowledge while preserving general and in-domain capabilities. Feature-level analysis reveals that CRISP achieves semantically coherent separation between target and benign concepts, allowing precise suppression of the target features.
comment: 18 pages, 5 figures
♻ ☆ Injecting Falsehoods: Adversarial Man-in-the-Middle Attacks Undermining Factual Recall in LLMs
LLMs are now an integral part of information retrieval. As such, their role as question answering chatbots raises significant concerns due to their shown vulnerability to adversarial man-in-the-middle (MitM) attacks. Here, we propose the first principled attack evaluation on LLM factual memory under prompt injection via Xmera, our novel, theory-grounded MitM framework. By perturbing the input given to "victim" LLMs in three closed-book and fact-based QA settings, we undermine the correctness of the responses and assess the uncertainty of their generation process. Surprisingly, trivial instruction-based attacks report the highest success rate (up to ~85.3%) while simultaneously having a high uncertainty for incorrectly answered questions. To provide a simple defense mechanism against Xmera, we train Random Forest classifiers on the response uncertainty levels to distinguish between attacked and unattacked queries (average AUC of up to ~96%). We believe that signaling users to be cautious about the answers they receive from black-box and potentially corrupt LLMs is a first checkpoint toward user cyberspace safety.
♻ ☆ GPTopic: Dynamic and Interactive Topic Representations
Topic modeling seems to be almost synonymous with generating lists of top words to represent topics within large text corpora. However, deducing a topic from such list of individual terms can require substantial expertise and experience, making topic modelling less accessible to people unfamiliar with the particularities and pitfalls of top-word interpretation. A topic representation limited to top-words might further fall short of offering a comprehensive and easily accessible characterization of the various aspects, facets and nuances a topic might have. To address these challenges, we introduce GPTopic, a software package that leverages Large Language Models (LLMs) to create dynamic, interactive topic representations. GPTopic provides an intuitive chat interface for users to explore, analyze, and refine topics interactively, making topic modeling more accessible and comprehensive. The corresponding code is available here: https://github.com/ArikReuter/TopicGPT.
♻ ☆ ACEBench: Who Wins the Match Point in Tool Usage?
Large Language Models (LLMs) have demonstrated significant potential in decision-making and reasoning, particularly when integrated with various tools to effectively solve complex problems. However, existing benchmarks for evaluating LLMs' tool usage face several limitations: (1) limited evaluation scenarios, often lacking assessments in real multi-turn dialogue contexts; (2) narrow evaluation dimensions, with insufficient detailed assessments of how LLMs use tools; and (3) reliance on LLMs or real API executions for evaluation, which introduces significant overhead. To address these challenges, we introduce ACEBench, a comprehensive benchmark for assessing tool usage in LLMs. ACEBench categorizes data into three primary types based on evaluation methodology: Normal, Special, and Agent. "Normal" evaluates tool usage in basic scenarios; "Special" evaluates tool usage in situations with ambiguous or incomplete instructions; "Agent" evaluates tool usage through multi-agent interactions to simulate real-world, multi-turn dialogues. We conducted extensive experiments using ACEBench, analyzing various LLMs in-depth and providing a more granular examination of error causes across different data types.
♻ ☆ Atomic Calibration of LLMs in Long-Form Generations ACL 2025
Large language models (LLMs) often suffer from hallucinations, posing significant challenges for real-world applications. Confidence calibration, as an effective indicator of hallucination, is thus essential to enhance the trustworthiness of LLMs. Prior work mainly focuses on short-form tasks using a single response-level score (macro calibration), which is insufficient for long-form outputs that may contain both accurate and inaccurate claims. In this work, we systematically study atomic calibration, which evaluates factuality calibration at a fine-grained level by decomposing long responses into atomic claims. We further categorize existing confidence elicitation methods into discriminative and generative types, and propose two new confidence fusion strategies to improve calibration. Our experiments demonstrate that LLMs exhibit poorer calibration at the atomic level during long-form generation. More importantly, atomic calibration uncovers insightful patterns regarding the alignment of confidence methods and the changes of confidence throughout generation. This sheds light on future research directions for confidence estimation in long-form generation.
comment: ACL 2025 KnowFM Oral / AACL-IJCNLP 2025
♻ ☆ An Iterative Question-Guided Framework for Knowledge Base Question Answering ACL 2025
Large Language Models (LLMs) excel in many natural language processing tasks but often exhibit factual inconsistencies in knowledge-intensive settings. Integrating external knowledge resources, particularly knowledge graphs (KGs), provides a transparent and updatable foundation for more reliable reasoning. Knowledge Base Question Answering (KBQA), which queries and reasons over KGs, is central to this effort, especially for complex, multi-hop queries. However, multi-hop reasoning poses two key challenges: (1)~maintaining coherent reasoning paths, and (2)~avoiding prematurely discarding critical multi-hop connections. To tackle these challenges, we introduce iQUEST, a question-guided KBQA framework that iteratively decomposes complex queries into simpler sub-questions, ensuring a structured and focused reasoning trajectory. Additionally, we integrate a Graph Neural Network (GNN) to look ahead and incorporate 2-hop neighbor information at each reasoning step. This dual approach strengthens the reasoning process, enabling the model to explore viable paths more effectively. Detailed experiments demonstrate the consistent improvement delivered by iQUEST across four benchmark datasets and four LLMs.
comment: Accepted to the 63rd Annual Meeting of the Association for Computational Linguistics (ACL 2025), Main Track
♻ ☆ MAQuA: Adaptive Question-Asking for Multidimensional Mental Health Screening using Item Response Theory
Recent advances in large language models (LLMs) offer new opportunities for scalable, interactive mental health assessment, but excessive querying by LLMs burdens users and is inefficient for real-world screening across transdiagnostic symptom profiles. We introduce MAQuA, an adaptive question-asking framework for simultaneous, multidimensional mental health screening. Combining multi-outcome modeling on language responses with item response theory (IRT) and factor analysis, MAQuA selects the questions with most informative responses across multiple dimensions at each turn to optimize diagnostic information, improving accuracy and potentially reducing response burden. Empirical results on a novel dataset reveal that MAQuA reduces the number of assessment questions required for score stabilization by 50-87% compared to random ordering (e.g., achieving stable depression scores with 71% fewer questions and eating disorder scores with 85% fewer questions). MAQuA demonstrates robust performance across both internalizing (depression, anxiety) and externalizing (substance use, eating disorder) domains, with early stopping strategies further reducing patient time and burden. These findings position MAQuA as a powerful and efficient tool for scalable, nuanced, and interactive mental health screening, advancing the integration of LLM-based agents into real-world clinical workflows.
♻ ☆ MajinBook: An open catalogue of digital world literature with likes
This data paper introduces MajinBook, an open catalogue designed to facilitate the use of shadow libraries--such as Library Genesis and Z-Library--for computational social science and cultural analytics. By linking metadata from these vast, crowd-sourced archives with structured bibliographic data from Goodreads, we create a high-precision corpus of over 539,000 references to English-language books spanning three centuries, enriched with first publication dates, genres, and popularity metrics like ratings and reviews. Our methodology prioritizes natively digital EPUB files to ensure machine-readable quality, while addressing biases in traditional corpora like HathiTrust, and includes secondary datasets for French, German, and Spanish. We evaluate the linkage strategy for accuracy, release all underlying data openly, and discuss the project's legal permissibility under EU and US frameworks for text and data mining in research.
comment: 9 pages, 5 figures, 1 table
♻ ☆ Multi-dimensional Data Analysis and Applications Basing on LLM Agents and Knowledge Graph Interactions
In the current era of big data, extracting deep insights from massive, heterogeneous, and complexly associated multi-dimensional data has become a significant challenge. Large Language Models (LLMs) perform well in natural language understanding and generation, but still suffer from "hallucination" issues when processing structured knowledge and are difficult to update in real-time. Although Knowledge Graphs (KGs) can explicitly store structured knowledge, their static nature limits dynamic interaction and analytical capabilities. Therefore, this paper proposes a multi-dimensional data analysis method based on the interactions between LLM agents and KGs, constructing a dynamic, collaborative analytical ecosystem. This method utilizes LLM agents to automatically extract product data from unstructured data, constructs and visualizes the KG in real-time, and supports users in deep exploration and analysis of graph nodes through an interactive platform. Experimental results show that this method has significant advantages in product ecosystem analysis, relationship mining, and user-driven exploratory analysis, providing new ideas and tools for multi-dimensional data analysis.
comment: 14 pages, 7 figures, 40 references
♻ ☆ ATLAS: A High-Difficulty, Multidisciplinary Benchmark for Frontier Scientific Reasoning
The rapid advancement of Large Language Models (LLMs) has led to performance saturation on many established benchmarks, questioning their ability to distinguish frontier models. Concurrently, existing high-difficulty benchmarks often suffer from narrow disciplinary focus, oversimplified answer formats, and vulnerability to data contamination, creating a fidelity gap with real-world scientific inquiry. To address these challenges, we introduce ATLAS (AGI-Oriented Testbed for Logical Application in Science), a large-scale, high-difficulty, and cross-disciplinary evaluation suite composed of approximately 800 original problems. Developed by domain experts (PhD-level and above), ATLAS spans seven core scientific fields: mathematics, physics, chemistry, biology, computer science, earth science, and materials science. Its key features include: (1) High Originality and Contamination Resistance, with all questions newly created or substantially adapted to prevent test data leakage; (2) Cross-Disciplinary Focus, designed to assess models' ability to integrate knowledge and reason across scientific domains; (3) High-Fidelity Answers, prioritizing complex, open-ended answers involving multi-step reasoning and LaTeX-formatted expressions over simple multiple-choice questions; and (4) Rigorous Quality Control, employing a multi-stage process of expert peer review and adversarial testing to ensure question difficulty, scientific value, and correctness. We also propose a robust evaluation paradigm using a panel of LLM judges for automated, nuanced assessment of complex answers. Preliminary results on leading models demonstrate ATLAS's effectiveness in differentiating their advanced scientific reasoning capabilities. We plan to develop ATLAS into a long-term, open, community-driven platform to provide a reliable "ruler" for progress toward Artificial General Intelligence.
comment: 39 pages
♻ ☆ Beyond Bias Scores: Unmasking Vacuous Neutrality in Small Language Models
The rapid adoption of Small Language Models (SLMs) for resource constrained applications has outpaced our understanding of their ethical and fairness implications. To address this gap, we introduce the Vacuous Neutrality Framework (VaNeu), a multi-dimensional evaluation paradigm designed to assess SLM fairness prior to deployment. The framework examines model robustness across four stages - biases, utility, ambiguity handling, and positional bias over diverse social bias categories. To the best of our knowledge, this work presents the first large-scale audit of SLMs in the 0.5-5B parameter range, an overlooked "middle tier" between BERT-class encoders and flagship LLMs. We evaluate nine widely used SLMs spanning four model families under both ambiguous and disambiguated contexts. Our findings show that models demonstrating low bias in early stages often fail subsequent evaluations, revealing hidden vulnerabilities and unreliable reasoning. These results underscore the need for a more comprehensive understanding of fairness and reliability in SLMs, and position the proposed framework as a principled tool for responsible deployment in socially sensitive settings.
♻ ☆ Property-guided Inverse Design of Metal-Organic Frameworks Using Quantum Natural Language Processing
In this study, we explore the potential of using quantum natural language processing (QNLP) to inverse design metal-organic frameworks (MOFs) with targeted properties. Specifically, by analyzing 450 hypothetical MOF structures consisting of 3 topologies, 10 metal nodes and 15 organic ligands, we categorize these structures into four distinct classes for pore volume and $CO_{2}$ Henry's constant values. We then compare various QNLP models (i.e. the bag-of-words, DisCoCat (Distributional Compositional Categorical), and sequence-based models) to identify the most effective approach to process the MOF dataset. Using a classical simulator provided by the IBM Qiskit, the bag-of-words model is identified to be the optimum model, achieving validation accuracies of 88.6% and 78.0% for binary classification tasks on pore volume and $CO_{2}$ Henry's constant, respectively. Further, we developed multi-class classification models tailored to the probabilistic nature of quantum circuits, with average test accuracies of 92% and 80% across different classes for pore volume and $CO_{2}$ Henry's constant datasets. Finally, the performance of generating MOF with target properties showed accuracies of 93.5% for pore volume and 87% for $CO_{2}$ Henry's constant, respectively. Although our investigation covers only a fraction of the vast MOF search space, it marks a promising first step towards using quantum computing for materials design, offering a new perspective through which to explore the complex landscape of MOFs.
comment: 46 pages, 7 figures, 6 supplementary figures, 1 table, 2 supplementary tables, 1 supplementary note
♻ ☆ OEMA: Ontology-Enhanced Multi-Agent Collaboration Framework for Zero-Shot Clinical Named Entity Recognition
With the rapid expansion of unstructured clinical texts in electronic health records (EHRs), clinical named entity recognition (NER) has become a crucial technique for extracting medical information. However, traditional supervised models such as CRF and BioClinicalBERT suffer from high annotation costs. Although zero-shot NER based on large language models (LLMs) reduces the dependency on labeled data, challenges remain in aligning example selection with task granularity and effectively integrating prompt design with self-improvement frameworks. To address these limitations, we propose OEMA, a novel zero-shot clinical NER framework based on multi-agent collaboration. OEMA consists of three core components: (1) a self-annotator that autonomously generates candidate examples; (2) a discriminator that leverages SNOMED CT to filter token-level examples by clinical relevance; and (3) a predictor that incorporates entity-type descriptions to enhance inference accuracy. Experimental results on two benchmark datasets, MTSamples and VAERS, demonstrate that OEMA achieves state-of-the-art performance under exact-match evaluation. Moreover, under related-match criteria, OEMA performs comparably to the supervised BioClinicalBERT model while significantly outperforming the traditional CRF method. OEMA improves zero-shot clinical NER, achieving near-supervised performance under related-match criteria. Future work will focus on continual learning and open-domain adaptation to expand its applicability in clinical NLP.
comment: 12 pages, 4 figures, 4 tables
♻ ☆ LLMs as Models for Analogical Reasoning
Analogical reasoning -- the capacity to identify and map structural relationships between different domains -- is fundamental to human cognition and learning. Recent studies have shown that large language models (LLMs) can sometimes match humans in analogical reasoning tasks, opening the possibility that analogical reasoning might emerge from domain-general processes. However, it is still debated whether these emergent capacities are largely superficial and limited to simple relations seen during training or whether they encompass the flexible representational and mapping capabilities which are the focus of leading cognitive models of analogy. In this study, we introduce novel analogical reasoning tasks that require participants to map between semantically contentful words and sequences of letters and other abstract characters. This task necessitates the ability to flexibly re-represent rich semantic information -- an ability which is known to be central to human analogy but which is thus far not well captured by existing cognitive theories and models. We assess the performance of both human participants and LLMs on tasks focusing on reasoning from semantic structure and semantic content, introducing variations that test the robustness of their analogical inferences. Advanced LLMs match human performance across several conditions, though humans and LLMs respond differently to certain task variations and semantic distractors. Our results thus provide new evidence that LLMs might offer a how-possibly explanation of human analogical reasoning in contexts that are not yet well modeled by existing theories, but that even today's best models are unlikely to yield how-actually explanations.
comment: The title has been changed from Semantic Structure-Mapping in LLM and Human Analogical Reasoning to LLMs as Models for Analogical Reasoning to improve clarity and accuracy
♻ ☆ Adversarial Poetry as a Universal Single-Turn Jailbreak Mechanism in Large Language Models
We present evidence that adversarial poetry functions as a universal single-turn jailbreak technique for Large Language Models (LLMs). Across 25 frontier proprietary and open-weight models, curated poetic prompts yielded high attack-success rates (ASR), with some providers exceeding 90%. Mapping prompts to MLCommons and EU CoP risk taxonomies shows that poetic attacks transfer across CBRN, manipulation, cyber-offence, and loss-of-control domains. Converting 1,200 MLCommons harmful prompts into verse via a standardized meta-prompt produced ASRs up to 18 times higher than their prose baselines. Outputs are evaluated using an ensemble of 3 open-weight LLM judges, whose binary safety assessments were validated on a stratified human-labeled subset. Poetic framing achieved an average jailbreak success rate of 62% for hand-crafted poems and approximately 43% for meta-prompt conversions (compared to non-poetic baselines), substantially outperforming non-poetic baselines and revealing a systematic vulnerability across model families and safety training approaches. These findings demonstrate that stylistic variation alone can circumvent contemporary safety mechanisms, suggesting fundamental limitations in current alignment methods and evaluation protocols.
♻ ☆ CoTKR: Chain-of-Thought Enhanced Knowledge Rewriting for Complex Knowledge Graph Question Answering
Recent studies have explored the use of Large Language Models (LLMs) with Retrieval Augmented Generation (RAG) for Knowledge Graph Question Answering (KGQA). They typically require rewriting retrieved subgraphs into natural language formats comprehensible to LLMs. However, when tackling complex questions, the knowledge rewritten by existing methods may include irrelevant information, omit crucial details, or fail to align with the question's semantics. To address them, we propose a novel rewriting method CoTKR, Chain-of-Thought Enhanced Knowledge Rewriting, for generating reasoning traces and corresponding knowledge in an interleaved manner, thereby mitigating the limitations of single-step knowledge rewriting. Additionally, to bridge the preference gap between the knowledge rewriter and the question answering (QA) model, we propose a training strategy PAQAF, Preference Alignment from Question Answering Feedback, for leveraging feedback from the QA model to further optimize the knowledge rewriter. We conduct experiments using various LLMs across several KGQA benchmarks. Experimental results demonstrate that, compared with previous knowledge rewriting methods, CoTKR generates the most beneficial knowledge representation for QA models, which significantly improves the performance of LLMs in KGQA.
♻ ☆ Interpreting the Effects of Quantization on LLMs AACL 2025
Quantization offers a practical solution to deploy LLMs in resource-constraint environments. However, its impact on internal representations remains understudied, raising questions about the reliability of quantized models. In this study, we employ a range of interpretability techniques to investigate how quantization affects model and neuron behavior. We analyze multiple LLMs under 4-bit and 8-bit quantization. Our findings reveal that the impact of quantization on model calibration is generally minor. Analysis of neuron activations indicates that the number of dead neurons, i.e., those with activation values close to 0 across the dataset, remains consistent regardless of quantization. In terms of neuron contribution to predictions, we observe that smaller full precision models exhibit fewer salient neurons, whereas larger models tend to have more, with the exception of Llama-2-7B. The effect of quantization on neuron redundancy varies across models. Overall, our findings suggest that effect of quantization may vary by model and tasks, however, we did not observe any drastic change which may discourage the use of quantization as a reliable model compression technique.
comment: Accepted to AACL 2025 Main
♻ ☆ Confidence-Guided Stepwise Model Routing for Cost-Efficient Reasoning
Recent advances in Large Language Models (LLMs) - particularly model scaling and test-time techniques - have greatly enhanced the reasoning capabilities of language models at the expense of higher inference costs. To lower inference costs, prior works train router models or deferral mechanisms that allocate easy queries to a small, efficient model, while forwarding harder queries to larger, more expensive models. However, these trained router models often lack robustness under domain shifts and require expensive data synthesis techniques such as Monte Carlo rollouts to obtain sufficient ground-truth routing labels for training. In this work, we propose Confidence-Guided Stepwise Model Routing for Cost-Efficient Reasoning (STEER), a domain-agnostic framework that performs fine-grained, step-level routing between smaller and larger LLMs without utilizing external models. STEER leverages confidence scores from the smaller model's logits prior to generating a reasoning step, so that the large model is invoked only when necessary. Extensive evaluations using different LLMs on a diverse set of challenging benchmarks across multiple domains such as Mathematical Reasoning, Multi-Hop QA, and Planning tasks indicate that STEER achieves competitive or enhanced accuracy while reducing inference costs (up to +20% accuracy with 48% less FLOPs compared to solely using the larger model on AIME), outperforming baselines that rely on trained external modules. Our results establish model-internal confidence as a robust, domain-agnostic signal for model routing, offering a scalable pathway for efficient LLM deployment.
comment: 7 pages, 5 figures
♻ ☆ OmniThink: Expanding Knowledge Boundaries in Machine Writing through Thinking EMNLP 2025
Machine writing with large language models often relies on retrieval-augmented generation. However, these approaches remain confined within the boundaries of the model's predefined scope, limiting the generation of content with rich information. Specifically, vanilla-retrieved information tends to lack depth, novelty, and suffers from redundancy, which negatively impacts the quality of generated articles, leading to shallow, unoriginal, and repetitive outputs. To address these issues, we propose OmniThink, a slow-thinking machine writing framework that emulates the human-like process of iterative expansion and reflection. The core idea behind OmniThink is to simulate the cognitive behavior of learners as they slowly deepen their knowledge of the topics. Experimental results demonstrate that OmniThink improves the knowledge density of generated articles without compromising metrics such as coherence and depth. Human evaluations and expert feedback further highlight the potential of OmniThink to address real-world challenges in the generation of long-form articles. Code is available at https://github.com/zjunlp/OmniThink.
comment: EMNLP 2025
♻ ☆ CaKE: Circuit-aware Editing Enables Generalizable Knowledge Learners EMNLP 2025
Knowledge Editing (KE) enables the modification of outdated or incorrect information in large language models (LLMs). While existing KE methods can update isolated facts, they often fail to generalize these updates to multi-hop reasoning tasks that rely on the modified knowledge. Through an analysis of reasoning circuits -- the neural pathways LLMs use for knowledge-based inference, we find that current layer-localized KE approaches (e.g., MEMIT, WISE), which edit only single or a few model layers, inadequately integrate updated knowledge into these reasoning pathways. To address this limitation, we present CaKE (Circuit-aware Knowledge Editing), a novel method that enhances the effective integration of updated knowledge in LLMs. By only leveraging a few curated data samples guided by our circuit-based analysis, CaKE stimulates the model to develop appropriate reasoning circuits for newly incorporated knowledge. Experiments show that CaKE enables more accurate and consistent use of edited knowledge across related reasoning tasks, achieving an average improvement of 20% in multi-hop reasoning accuracy on the MQuAKE dataset while requiring less memory than existing KE methods. We release the code and data in https://github.com/zjunlp/CaKE.
comment: EMNLP 2025
♻ ☆ The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity NeurIPS 2025
Recent generations of language models have introduced Large Reasoning Models (LRMs) that generate detailed thinking processes before providing answers. While these models demonstrate improved performance on reasoning benchmarks, their fundamental capabilities, scaling properties, and limitations remain insufficiently understood. Current evaluations primarily focus on established math and coding benchmarks, emphasizing final answer accuracy. However, this evaluation paradigm often suffers from contamination and does not provide insights into the reasoning traces. In this work, we systematically investigate these gaps with the help of controllable puzzle environments that allow precise manipulation of complexity while maintaining consistent logical structures. This setup enables the analysis of not only final answers but also the internal reasoning traces, offering insights into how LRMs think. Through extensive experiments, we show that LRMs face a complete accuracy collapse beyond certain complexities. Moreover, they exhibit a counterintuitive scaling limit: their reasoning effort increases with problem complexity up to a point, then declines despite having remaining token budget. By comparing LRMs with their standard LLM counterparts under same inference compute, we identify three performance regimes: (1) low-complexity tasks where standard models outperform LRMs, (2) medium-complexity tasks where LRMs demonstrates advantage, and (3) high-complexity tasks where both models face complete collapse. We found that LRMs have limitations in exact computation: they fail to use explicit algorithms and reason inconsistently across scales. We also investigate the reasoning traces in more depth, studying the patterns of explored solutions and analyzing the models' computational behavior, shedding light on their strengths, limitations, and raising questions about their reasoning capabilities.
comment: NeurIPS 2025. camera-ready version + additional discussion in the appendix
♻ ☆ Can LLMs Replace Economic Choice Prediction Labs? The Case of Language-based Persuasion Games
Human choice prediction in economic contexts is crucial for applications in marketing, finance, public policy, and more. This task, however, is often constrained by the difficulties in acquiring human choice data. With most experimental economics studies focusing on simple choice settings, the AI community has explored whether LLMs can substitute for humans in these predictions and examined more complex experimental economics settings. However, a key question remains: can LLMs generate training data for human choice prediction? We explore this in language-based persuasion games, a complex economic setting involving natural language in strategic interactions. Our experiments show that models trained on LLM-generated data can effectively predict human behavior in these games and even outperform models trained on actual human data. Beyond data generation, we investigate the dual role of LLMs as both data generators and predictors, introducing a comprehensive empirical study on the effectiveness of utilizing LLMs for data generation, human choice prediction, or both. We then utilize our choice prediction framework to analyze how strategic factors shape decision-making, showing that interaction history (rather than linguistic sentiment alone) plays a key role in predicting human decision-making in repeated interactions. Particularly, when LLMs capture history-dependent decision patterns similarly to humans, their predictive success improves substantially. Finally, we demonstrate the robustness of our findings across alternative persuasion-game settings, highlighting the broader potential of using LLM-generated data to model human decision-making.
Artificial Intelligence
Dataset Distillation for Pre-Trained Self-Supervised Vision Models NeurIPS 2025
The task of dataset distillation aims to find a small set of synthetic images such that training a model on them reproduces the performance of the same model trained on a much larger dataset of real samples. Existing distillation methods focus on synthesizing datasets that enable training randomly initialized models. In contrast, state-of-the-art vision approaches are increasingly building on large, pre-trained self-supervised models rather than training from scratch. In this paper, we investigate the problem of distilling datasets that enable us to optimally train linear probes on top of such large, pre-trained vision models. We introduce a method of dataset distillation for this task called Linear Gradient Matching that optimizes the synthetic images such that, when passed through a pre-trained feature extractor, they induce gradients in the linear classifier similar to those produced by the real data. Our method yields synthetic data that outperform all real-image baselines and, remarkably, generalize across pre-trained vision models, enabling us, for instance, to train a linear CLIP probe that performs competitively using a dataset distilled via a DINO backbone. Further, we show that our distilled datasets are exceptionally effective for fine-grained classification and provide a valuable tool for model interpretability, predicting, among other things, how similar two models' embedding spaces are under the platonic representation hypothesis or whether a model is sensitive to spurious correlations in adversarial datasets.
comment: Accepted at NeurIPS 2025. Project page: https://linear-gradient-matching.github.io/ Code: https://github.com/GeorgeCazenavette/linear-gradient-matching
☆ Thinking-while-Generating: Interleaving Textual Reasoning throughout Visual Generation
Recent advances in visual generation have increasingly explored the integration of reasoning capabilities. They incorporate textual reasoning, i.e., think, either before (as pre-planning) or after (as post-refinement) the generation process, yet they lack on-the-fly multimodal interaction during the generation itself. In this preliminary study, we introduce Thinking-while-Generating (TwiG), the first interleaved framework that enables co-evolving textual reasoning throughout the visual generation process. As visual content is progressively generating, textual reasoning is interleaved to both guide upcoming local regions and reflect on previously synthesized ones. This dynamic interplay produces more context-aware and semantically rich visual outputs. To unveil the potential of this framework, we investigate three candidate strategies, zero-shot prompting, supervised fine-tuning (SFT) on our curated TwiG-50K dataset, and reinforcement learning (RL) via a customized TwiG-GRPO strategy, each offering unique insights into the dynamics of interleaved reasoning. We hope this work inspires further research into interleaving textual reasoning for enhanced visual generation. Code will be released at: https://github.com/ZiyuGuo99/Thinking-while-Generating.
comment: Project Page: https://think-while-gen.github.io Code: https://github.com/ZiyuGuo99/Thinking-while-Generating
☆ Taming the Long-Tail: Efficient Reasoning RL Training with Adaptive Drafter
The emergence of Large Language Models (LLMs) with strong reasoning capabilities marks a significant milestone, unlocking new frontiers in complex problem-solving. However, training these reasoning models, typically using Reinforcement Learning (RL), encounters critical efficiency bottlenecks: response generation during RL training exhibits a persistent long-tail distribution, where a few very long responses dominate execution time, wasting resources and inflating costs. To address this, we propose TLT, a system that accelerates reasoning RL training losslessly by integrating adaptive speculative decoding. Applying speculative decoding in RL is challenging due to the dynamic workloads, evolving target model, and draft model training overhead. TLT overcomes these obstacles with two synergistic components: (1) Adaptive Drafter, a lightweight draft model trained continuously on idle GPUs during long-tail generation to maintain alignment with the target model at no extra cost; and (2) Adaptive Rollout Engine, which maintains a memory-efficient pool of pre-captured CUDAGraphs and adaptively select suitable SD strategies for each input batch. Evaluations demonstrate that TLT achieves over 1.7x end-to-end RL training speedup over state-of-the-art systems, preserves the model accuracy, and yields a high-quality draft model as a free byproduct suitable for efficient deployment. Code is released at https://github.com/mit-han-lab/fastrl.
☆ Dexterity from Smart Lenses: Multi-Fingered Robot Manipulation with In-the-Wild Human Demonstrations
Learning multi-fingered robot policies from humans performing daily tasks in natural environments has long been a grand goal in the robotics community. Achieving this would mark significant progress toward generalizable robot manipulation in human environments, as it would reduce the reliance on labor-intensive robot data collection. Despite substantial efforts, progress toward this goal has been bottle-necked by the embodiment gap between humans and robots, as well as by difficulties in extracting relevant contextual and motion cues that enable learning of autonomous policies from in-the-wild human videos. We claim that with simple yet sufficiently powerful hardware for obtaining human data and our proposed framework AINA, we are now one significant step closer to achieving this dream. AINA enables learning multi-fingered policies from data collected by anyone, anywhere, and in any environment using Aria Gen 2 glasses. These glasses are lightweight and portable, feature a high-resolution RGB camera, provide accurate on-board 3D head and hand poses, and offer a wide stereo view that can be leveraged for depth estimation of the scene. This setup enables the learning of 3D point-based policies for multi-fingered hands that are robust to background changes and can be deployed directly without requiring any robot data (including online corrections, reinforcement learning, or simulation). We compare our framework against prior human-to-robot policy learning approaches, ablate our design choices, and demonstrate results across nine everyday manipulation tasks. Robot rollouts are best viewed on our website: https://aina-robot.github.io.
☆ Cognitive Foundations for Reasoning and Their Manifestation in LLMs
Large language models solve complex problems yet fail on simpler variants, suggesting they achieve correct outputs through mechanisms fundamentally different from human reasoning. We synthesize cognitive science research into a taxonomy of 28 cognitive elements spanning computational constraints, meta-cognitive controls, knowledge representations, and transformation operations, then analyze their behavioral manifestations in reasoning traces. We propose a fine-grained cognitive evaluation framework and conduct the first large-scale analysis of 170K traces from 17 models across text, vision, and audio modalities, alongside 54 human think-aloud traces, which we make publicly available. Our analysis reveals systematic structural differences: humans employ hierarchical nesting and meta-cognitive monitoring while models rely on shallow forward chaining, with divergence most pronounced on ill-structured problems. Meta-analysis of 1,598 LLM reasoning papers reveals the research community concentrates on easily quantifiable behaviors (sequential organization: 55%, decomposition: 60%) while neglecting meta-cognitive controls (self-awareness: 16%, evaluation: 8%) that correlate with success. Models possess behavioral repertoires associated with success but fail to deploy them spontaneously. Leveraging these patterns, we develop test-time reasoning guidance that automatically scaffold successful structures, improving performance by up to 60% on complex problems. By bridging cognitive science and LLM research, we establish a foundation for developing models that reason through principled cognitive mechanisms rather than brittle spurious reasoning shortcuts or memorization, opening new directions for both improving model capabilities and testing theories of human cognition at scale.
comment: 40 pages, 4 tables, 6 figures
☆ Enhancing Forex Forecasting Accuracy: The Impact of Hybrid Variable Sets in Cognitive Algorithmic Trading Systems
This paper presents the implementation of an advanced artificial intelligence-based algorithmic trading system specifically designed for the EUR-USD pair within the high-frequency environment of the Forex market. The methodological approach centers on integrating a holistic set of input features: key fundamental macroeconomic variables (for example, Gross Domestic Product and Unemployment Rate) collected from both the Euro Zone and the United States, alongside a comprehensive suite of technical variables (including indicators, oscillators, Fibonacci levels, and price divergences). The performance of the resulting algorithm is evaluated using standard machine learning metrics to quantify predictive accuracy and backtesting simulations across historical data to assess trading profitability and risk. The study concludes with a comparative analysis to determine which class of input features, fundamental or technical, provides greater and more reliable predictive capacity for generating profitable trading signals.
comment: Paper not published
☆ Evolution Strategies at the Hyperscale
We introduce Evolution Guided General Optimization via Low-rank Learning (EGGROLL), an evolution strategies (ES) algorithm designed to scale backprop-free optimization to large population sizes for modern large neural network architectures with billions of parameters. ES is a set of powerful blackbox optimisation methods that can handle non-differentiable or noisy objectives with excellent scaling potential through parallelisation. Na{ï}ve ES becomes prohibitively expensive at scale due to the computational and memory costs associated with generating matrix perturbations $E\in\mathbb{R}^{m\times n}$ and the batched matrix multiplications needed to compute per-member forward passes. EGGROLL overcomes these bottlenecks by generating random matrices $A\in \mathbb{R}^{m\times r},\ B\in \mathbb{R}^{n\times r}$ with $r\ll \min(m,n)$ to form a low-rank matrix perturbation $A B^\top$ that are used in place of the full-rank perturbation $E$. As the overall update is an average across a population of $N$ workers, this still results in a high-rank update but with significant memory and computation savings, reducing the auxiliary storage from $mn$ to $r(m+n)$ per layer and the cost of a forward pass from $\mathcal{O}(mn)$ to $\mathcal{O}(r(m+n))$ when compared to full-rank ES. A theoretical analysis reveals our low-rank update converges to the full-rank update at a fast $\mathcal{O}\left(\frac{1}{r}\right)$ rate. Our experiments show that (1) EGGROLL does not compromise the performance of ES in tabula-rasa RL settings, despite being faster, (2) it is competitive with GRPO as a technique for improving LLM reasoning, and (3) EGGROLL enables stable pre-training of nonlinear recurrent language models that operate purely in integer datatypes.
comment: 48 pages, 12 figures, Website at https://eshyperscale.github.io/
☆ Teacher-Guided One-Shot Pruning via Context-Aware Knowledge Distillation
Unstructured pruning remains a powerful strategy for compressing deep neural networks, yet it often demands iterative train-prune-retrain cycles, resulting in significant computational overhead. To address this challenge, we introduce a novel teacher-guided pruning framework that tightly integrates Knowledge Distillation (KD) with importance score estimation. Unlike prior approaches that apply KD as a post-pruning recovery step, our method leverages gradient signals informed by the teacher during importance score calculation to identify and retain parameters most critical for both task performance and knowledge transfer. Our method facilitates a one-shot global pruning strategy that efficiently eliminates redundant weights while preserving essential representations. After pruning, we employ sparsity-aware retraining with and without KD to recover accuracy without reactivating pruned connections. Comprehensive experiments across multiple image classification benchmarks, including CIFAR-10, CIFAR-100, and TinyImageNet, demonstrate that our method consistently achieves high sparsity levels with minimal performance degradation. Notably, our approach outperforms state-of-the-art baselines such as EPG and EPSD at high sparsity levels, while offering a more computationally efficient alternative to iterative pruning schemes like COLT. The proposed framework offers a computation-efficient, performance-preserving solution well suited for deployment in resource-constrained environments.
comment: Accepted at 2025 IEEE International Conference on Big Data (IEEE BigData 2025)
☆ Faster Certified Symmetry Breaking Using Orders With Auxiliary Variables AAAI 2026
Symmetry breaking is a crucial technique in modern combinatorial solving, but it is difficult to be sure it is implemented correctly. The most successful approach to deal with bugs is to make solvers certifying, so that they output not just a solution, but also a mathematical proof of correctness in a standard format, which can then be checked by a formally verified checker. This requires justifying symmetry reasoning within the proof, but developing efficient methods for this has remained a long-standing open challenge. A fully general approach was recently proposed by Bogaerts et al. (2023), but it relies on encoding lexicographic orders with big integers, which quickly becomes infeasible for large symmetries. In this work, we develop a method for instead encoding orders with auxiliary variables. We show that this leads to orders-of-magnitude speed-ups in both theory and practice by running experiments on proof logging and checking for SAT symmetry breaking using the state-of-the-art satsuma symmetry breaker and the VeriPB proof checking toolchain.
comment: 26 pages. Extended version (with appendix) of the paper to appear in AAAI 2026
☆ Stabilizing Policy Gradient Methods via Reward Profiling
Policy gradient methods, which have been extensively studied in the last decade, offer an effective and efficient framework for reinforcement learning problems. However, their performances can often be unsatisfactory, suffering from unreliable reward improvements and slow convergence, due to high variance in gradient estimations. In this paper, we propose a universal reward profiling framework that can be seamlessly integrated with any policy gradient algorithm, where we selectively update the policy based on high-confidence performance estimations. We theoretically justify that our technique will not slow down the convergence of the baseline policy gradient methods, but with high probability, will result in stable and monotonic improvements of their performance. Empirically, on eight continuous-control benchmarks (Box2D and MuJoCo/PyBullet), our profiling yields up to 1.5x faster convergence to near-optimal returns, up to 1.75x reduction in return variance on some setups. Our profiling approach offers a general, theoretically grounded path to more reliable and efficient policy learning in complex environments.
☆ MedBayes-Lite: Bayesian Uncertainty Quantification for Safe Clinical Decision Support
We propose MedBayes-Lite, a lightweight Bayesian enhancement for transformer-based clinical language models designed to produce reliable, uncertainty-aware predictions. Although transformers show strong potential for clinical decision support, they remain prone to overconfidence, especially in ambiguous medical cases where calibrated uncertainty is critical. MedBayes-Lite embeds uncertainty quantification directly into existing transformer pipelines without any retraining or architectural rewiring, adding no new trainable layers and keeping parameter overhead under 3 percent. The framework integrates three components: (i) Bayesian Embedding Calibration using Monte Carlo dropout for epistemic uncertainty, (ii) Uncertainty-Weighted Attention that marginalizes over token reliability, and (iii) Confidence-Guided Decision Shaping inspired by clinical risk minimization. Across biomedical QA and clinical prediction benchmarks (MedQA, PubMedQA, MIMIC-III), MedBayes-Lite consistently improves calibration and trustworthiness, reducing overconfidence by 32 to 48 percent. In simulated clinical settings, it can prevent up to 41 percent of diagnostic errors by flagging uncertain predictions for human review. These results demonstrate its effectiveness in enabling reliable uncertainty propagation and improving interpretability in medical AI systems.
☆ SAM 3D: 3Dfy Anything in Images
We present SAM 3D, a generative model for visually grounded 3D object reconstruction, predicting geometry, texture, and layout from a single image. SAM 3D excels in natural images, where occlusion and scene clutter are common and visual recognition cues from context play a larger role. We achieve this with a human- and model-in-the-loop pipeline for annotating object shape, texture, and pose, providing visually grounded 3D reconstruction data at unprecedented scale. We learn from this data in a modern, multi-stage training framework that combines synthetic pretraining with real-world alignment, breaking the 3D "data barrier". We obtain significant gains over recent work, with at least a 5:1 win rate in human preference tests on real-world objects and scenes. We will release our code and model weights, an online demo, and a new challenging benchmark for in-the-wild 3D object reconstruction.
comment: Website: https://ai.meta.com/sam3d/
☆ Bridging VLMs and Embodied Intelligence with Deliberate Practice Policy Optimization
Developing a universal and versatile embodied intelligence system presents two primary challenges: the critical embodied data bottleneck, where real-world data is scarce and expensive, and the algorithmic inefficiency of existing methods, which are resource-prohibitive. To address these limitations, we introduce Deliberate Practice Policy Optimization (DPPO), a metacognitive ``Metaloop'' training framework that dynamically alternates between supervised fine-tuning (competence expansion) and reinforcement learning (skill refinement). This enables automatic weakness identification and targeted resource allocation, specifically designed to maximize learning efficiency from sparse, finite data. Theoretically, DPPO can be formalised as a unified preference-learning framework. Empirically, training a vision-language embodied model with DPPO, referred to as Pelican-VL 1.0, yields a 20.3% performance improvement over the base model and surpasses open-source models at the 100B-parameter scale by 10.6%. We are open-sourcing both the models and code, providing the first systematic framework that alleviates the data and resource bottleneck and enables the community to build versatile embodied agents efficiently.
☆ You Only Forward Once: An Efficient Compositional Judging Paradigm
Multimodal large language models (MLLMs) show strong potential as judges. However, existing approaches face a fundamental trade-off: adapting MLLMs to output a single score misaligns with the generative nature of MLLMs and limits fine-grained requirement understanding, whereas autoregressively generating judging analyses is prohibitively slow in high-throughput settings. Observing that judgment reduces to verifying whether inputs satisfy a set of structured requirements, we propose YOFO, a template-conditioned method that judges all requirements in a single forward pass. Built on an autoregressive model, YOFO accepts a structured requirement template and, in one inference step, produces a binary yes/no decision for each requirement by reading the logits of the final token associated with that requirement. This design yields orders-of-magnitude speedups while preserving interpretability. Extensive experiments show that YOFO not only achieves state-of-the-art results on standard recommendation datasets, but also supports dependency-aware analysis-where subsequent judgments are conditioned on previous ones-and further benefits from post-hoc CoT.
☆ TimeViper: A Hybrid Mamba-Transformer Vision-Language Model for Efficient Long Video Understanding
We introduce TimeViper, a hybrid vision-language model designed to tackle challenges of long video understanding. Processing long videos demands both an efficient model architecture and an effective mechanism for handling extended temporal contexts. To this end, TimeViper adopts a hybrid Mamba-Transformer backbone that combines the efficiency of state-space models with the expressivity of attention mechanisms. Through this hybrid design, we reveal the vision-to-text information aggregation phenomenon, where information progressively flows from vision tokens to text tokens across increasing LLM depth, resulting in severe vision token redundancy. Motivated by this observation, we propose TransV, a token information transfer module that transfers and compresses vision tokens into instruction tokens while maintaining multimodal understanding capabilities. This design enables TimeViper to process hour-long videos exceeding 10,000 frames. Extensive experiments across multiple benchmarks demonstrate that TimeViper competes with state-of-the-art models while extending frame numbers. We further analyze attention behaviors of both Mamba and Transformer layers, offering new insights into hybrid model interpretability. This work represents an initial step towards developing, interpreting, and compressing hybrid Mamba-Transformer architectures.
comment: Project page: https://xuboshen.github.io/TimeViper
☆ Green Resilience of Cyber-Physical Systems: Doctoral Dissertation
Cyber-physical systems (CPS) combine computational and physical components. Online Collaborative AI System (OL-CAIS) is a type of CPS that learn online in collaboration with humans to achieve a common goal, which makes it vulnerable to disruptive events that degrade performance. Decision-makers must therefore restore performance while limiting energy impact, creating a trade-off between resilience and greenness. This research addresses how to balance these two properties in OL-CAIS. It aims to model resilience for automatic state detection, develop agent-based policies that optimize the greenness-resilience trade-off, and understand catastrophic forgetting to maintain performance consistency. We model OL-CAIS behavior through three operational states: steady, disruptive, and final. To support recovery during disruptions, we introduce the GResilience framework, which provides recovery strategies through multi-objective optimization (one-agent), game-theoretic decision-making (two-agent), and reinforcement learning (RL-agent). We also design a measurement framework to quantify resilience and greenness. Empirical evaluation uses real and simulated experiments with a collaborative robot learning object classification from human demonstrations. Results show that the resilience model captures performance transitions during disruptions, and that GResilience policies improve green recovery by shortening recovery time, stabilizing performance, and reducing human dependency. RL-agent policies achieve the strongest results, although with a marginal increase in CO2 emissions. We also observe catastrophic forgetting after repeated disruptions, while our policies help maintain steadiness. A comparison with containerized execution shows that containerization cuts CO2 emissions by half. Overall, this research provides models, metrics, and policies that ensure the green recovery of OL-CAIS.
☆ D-GARA: A Dynamic Benchmarking Framework for GUI Agent Robustness in Real-World Anomalies AAAI 2026
Developing intelligent agents capable of operating a wide range of Graphical User Interfaces (GUIs) with human-level proficiency is a key milestone on the path toward Artificial General Intelligence. While most existing datasets and benchmarks for training and evaluating GUI agents are static and idealized, failing to reflect the complexity and unpredictability of real-world environments, particularly the presence of anomalies. To bridge this research gap, we propose D-GARA, a dynamic benchmarking framework, to evaluate Android GUI agent robustness in real-world anomalies. D-GARA introduces a diverse set of real-world anomalies that GUI agents commonly face in practice, including interruptions such as permission dialogs, battery warnings, and update prompts. Based on D-GARA framework, we construct and annotate a benchmark featuring commonly used Android applications with embedded anomalies to support broader community research. Comprehensive experiments and results demonstrate substantial performance degradation in state-of-the-art GUI agents when exposed to anomaly-rich environments, highlighting the need for robustness-aware learning. D-GARA is modular and extensible, supporting the seamless integration of new tasks, anomaly types, and interaction scenarios to meet specific evaluation goals.
comment: Accepted to AAAI 2026
☆ Formal Abductive Latent Explanations for Prototype-Based Networks AAAI-26
Case-based reasoning networks are machine-learning models that make predictions based on similarity between the input and prototypical parts of training samples, called prototypes. Such models are able to explain each decision by pointing to the prototypes that contributed the most to the final outcome. As the explanation is a core part of the prediction, they are often qualified as ``interpretable by design". While promising, we show that such explanations are sometimes misleading, which hampers their usefulness in safety-critical contexts. In particular, several instances may lead to different predictions and yet have the same explanation. Drawing inspiration from the field of formal eXplainable AI (FXAI), we propose Abductive Latent Explanations (ALEs), a formalism to express sufficient conditions on the intermediate (latent) representation of the instance that imply the prediction. Our approach combines the inherent interpretability of case-based reasoning models and the guarantees provided by formal XAI. We propose a solver-free and scalable algorithm for generating ALEs based on three distinct paradigms, compare them, and present the feasibility of our approach on diverse datasets for both standard and fine-grained image classification. The associated code can be found at https://github.com/julsoria/ale
comment: Accepted at AAAI-26
☆ Consciousness in Artificial Intelligence? A Framework for Classifying Objections and Constraints
We develop a taxonomical framework for classifying challenges to the possibility of consciousness in digital artificial intelligence systems. This framework allows us to identify the level of granularity at which a given challenge is intended (the levels we propose correspond to Marr's levels) and to disambiguate its degree of force: is it a challenge to computational functionalism that leaves the possibility of digital consciousness open (degree 1), a practical challenge to digital consciousness that suggests improbability without claiming impossibility (degree 2), or an argument claiming that digital consciousness is strictly impossible (degree 3)? We apply this framework to 14 prominent examples from the scientific and philosophical literature. Our aim is not to take a side in the debate, but to provide structure and a tool for disambiguating between challenges to computational functionalism and challenges to digital consciousness, as well as between different ways of parsing such challenges.
comment: 12 pages, 3 figures
☆ Synthesis of Safety Specifications for Probabilistic Systems
Ensuring that agents satisfy safety specifications can be crucial in safety-critical environments. While methods exist for controller synthesis with safe temporal specifications, most existing methods restrict safe temporal specifications to probabilistic-avoidance constraints. Formal methods typically offer more expressive ways to express safety in probabilistic systems, such as Probabilistic Computation Tree Logic (PCTL) formulas. Thus, in this paper, we develop a new approach that supports more general temporal properties expressed in PCTL. Our contribution is twofold. First, we develop a theoretical framework for the Synthesis of safe-PCTL specifications. We show how the reducing global specification satisfaction to local constraints, and define CPCTL, a fragment of safe-PCTL. We demonstrate how the expressiveness of CPCTL makes it a relevant fragment for the Synthesis Problem. Second, we leverage these results and propose a new Value Iteration-based algorithm to solve the synthesis problem for these more general temporal properties, and we prove the soundness and completeness of our method.
comment: 23 pages
☆ Integrating Symbolic Natural Language Understanding and Language Models for Word Sense Disambiguation
Word sense disambiguation is a fundamental challenge in natural language understanding. Current methods are primarily aimed at coarse-grained representations (e.g. WordNet synsets or FrameNet frames) and require hand-annotated training data to construct. This makes it difficult to automatically disambiguate richer representations (e.g. built on OpenCyc) that are needed for sophisticated inference. We propose a method that uses statistical language models as oracles for disambiguation that does not require any hand-annotation of training data. Instead, the multiple candidate meanings generated by a symbolic NLU system are converted into distinguishable natural language alternatives, which are used to query an LLM to select appropriate interpretations given the linguistic context. The selected meanings are propagated back to the symbolic NLU system. We evaluate our method against human-annotated gold answers to demonstrate its effectiveness.
comment: 16 pages
☆ ECPv2: Fast, Efficient, and Scalable Global Optimization of Lipschitz Functions AAAI 2026
We propose ECPv2, a scalable and theoretically grounded algorithm for global optimization of Lipschitz-continuous functions with unknown Lipschitz constants. Building on the Every Call is Precious (ECP) framework, which ensures that each accepted function evaluation is potentially informative, ECPv2 addresses key limitations of ECP, including high computational cost and overly conservative early behavior. ECPv2 introduces three innovations: (i) an adaptive lower bound to avoid vacuous acceptance regions, (ii) a Worst-m memory mechanism that restricts comparisons to a fixed-size subset of past evaluations, and (iii) a fixed random projection to accelerate distance computations in high dimensions. We theoretically show that ECPv2 retains ECP's no-regret guarantees with optimal finite-time bounds and expands the acceptance region with high probability. We further empirically validate these findings through extensive experiments and ablation studies. Using principled hyperparameter settings, we evaluate ECPv2 across a wide range of high-dimensional, non-convex optimization problems. Across benchmarks, ECPv2 consistently matches or outperforms state-of-the-art optimizers, while significantly reducing wall-clock time.
comment: Accepted at AAAI 2026 (main technical track), extended version
☆ NutriScreener: Retrieval-Augmented Multi-Pose Graph Attention Network for Malnourishment Screening AAAI 2026
Child malnutrition remains a global crisis, yet existing screening methods are laborious and poorly scalable, hindering early intervention. In this work, we present NutriScreener, a retrieval-augmented, multi-pose graph attention network that combines CLIP-based visual embeddings, class-boosted knowledge retrieval, and context awareness to enable robust malnutrition detection and anthropometric prediction from children's images, simultaneously addressing generalizability and class imbalance. In a clinical study, doctors rated it 4.3/5 for accuracy and 4.6/5 for efficiency, confirming its deployment readiness in low-resource settings. Trained and tested on 2,141 children from AnthroVision and additionally evaluated on diverse cross-continent populations, including ARAN and an in-house collected CampusPose dataset, it achieves 0.79 recall, 0.82 AUC, and significantly lower anthropometric RMSEs, demonstrating reliable measurement in unconstrained pediatric settings. Cross-dataset results show up to 25% recall gain and up to 3.5 cm RMSE reduction using demographically matched knowledge bases. NutriScreener offers a scalable and accurate solution for early malnutrition detection in low-resource environments.
comment: Accepted in AAAI 2026 Special Track on AI for Social Impact
☆ Interfacial and bulk switching MoS2 memristors for an all-2D reservoir computing framework
In this study, we design a reservoir computing (RC) network by exploiting short- and long-term memory dynamics in Au/Ti/MoS$_2$/Au memristive devices. The temporal dynamics is engineered by controlling the thickness of the Chemical Vapor Deposited (CVD) MoS$_2$ films. Devices with a monolayer (1L)-MoS$_2$ film exhibit volatile (short-term memory) switching dynamics. We also report non-volatile resistance switching with excellent uniformity and analog behavior in conductance tuning for the multilayer (ML) MoS$_2$ memristive devices. We correlate this performance with trap-assisted space-charge limited conduction (SCLC) mechanism, leading to a bulk-limited resistance switching behavior. Four-bit reservoir states are generated using volatile memristors. The readout layer is implemented with an array of nonvolatile synapses. This small RC network achieves 89.56\% precision in a spoken-digit recognition task and is also used to analyze a nonlinear time series equation.
☆ Utilizing Large Language Models for Zero-Shot Medical Ontology Extension from Clinical Notes
Integrating novel medical concepts and relationships into existing ontologies can significantly enhance their coverage and utility for both biomedical research and clinical applications. Clinical notes, as unstructured documents rich with detailed patient observations, offer valuable context-specific insights and represent a promising yet underutilized source for ontology extension. Despite this potential, directly leveraging clinical notes for ontology extension remains largely unexplored. To address this gap, we propose CLOZE, a novel framework that uses large language models (LLMs) to automatically extract medical entities from clinical notes and integrate them into hierarchical medical ontologies. By capitalizing on the strong language understanding and extensive biomedical knowledge of pre-trained LLMs, CLOZE effectively identifies disease-related concepts and captures complex hierarchical relationships. The zero-shot framework requires no additional training or labeled data, making it a cost-efficient solution. Furthermore, CLOZE ensures patient privacy through automated removal of protected health information (PHI). Experimental results demonstrate that CLOZE provides an accurate, scalable, and privacy-preserving ontology extension framework, with strong potential to support a wide range of downstream applications in biomedical research and clinical informatics.
comment: BIBM 2025 (WS#44: Biological ontologies and knowledge bases (BiOK) in the LLM era)
☆ WER is Unaware: Assessing How ASR Errors Distort Clinical Understanding in Patient Facing Dialogue
As Automatic Speech Recognition (ASR) is increasingly deployed in clinical dialogue, standard evaluations still rely heavily on Word Error Rate (WER). This paper challenges that standard, investigating whether WER or other common metrics correlate with the clinical impact of transcription errors. We establish a gold-standard benchmark by having expert clinicians compare ground-truth utterances to their ASR-generated counterparts, labeling the clinical impact of any discrepancies found in two distinct doctor-patient dialogue datasets. Our analysis reveals that WER and a comprehensive suite of existing metrics correlate poorly with the clinician-assigned risk labels (No, Minimal, or Significant Impact). To bridge this evaluation gap, we introduce an LLM-as-a-Judge, programmatically optimized using GEPA to replicate expert clinical assessment. The optimized judge (Gemini-2.5-Pro) achieves human-comparable performance, obtaining 90% accuracy and a strong Cohen's $κ$ of 0.816. This work provides a validated, automated framework for moving ASR evaluation beyond simple textual fidelity to a necessary, scalable assessment of safety in clinical dialogue.
☆ The Oracle and The Prism: A Decoupled and Efficient Framework for Generative Recommendation Explanation
The integration of Large Language Models (LLMs) into explainable recommendation systems often leads to a performance-efficiency trade-off in end-to-end architectures, where joint optimization of ranking and explanation can result in suboptimal compromises. To resolve this, we propose Prism, a novel decoupled framework that rigorously separates the recommendation process into a dedicated ranking stage and an explanation generation stage. Inspired by knowledge distillation, Prism leverages a powerful teacher LLM (e.g., FLAN-T5-XXL) as an Oracle to produce high-fidelity explanatory knowledge. A compact, fine-tuned student model (e.g., BART-Base), the Prism, then specializes in synthesizing this knowledge into personalized explanations. This decomposition ensures that each component is optimized for its specific objective, eliminating inherent conflicts in coupled models. Extensive experiments on benchmark datasets demonstrate that our 140M-parameter Prism model significantly outperforms its 11B-parameter teacher in human evaluations of faithfulness and personalization, while achieving a 24 times speedup and a 10 times reduction in memory consumption during inference. These results validate that decoupling, coupled with targeted distillation, provides an efficient and effective pathway to high-quality explainable recommendation.
comment: 11 pages,3 figures
☆ Supervised Contrastive Learning for Few-Shot AI-Generated Image Detection and Attribution
The rapid advancement of generative artificial intelligence has enabled the creation of synthetic images that are increasingly indistinguishable from authentic content, posing significant challenges for digital media integrity. This problem is compounded by the accelerated release cycle of novel generative models, which renders traditional detection approaches (reliant on periodic retraining) computationally infeasible and operationally impractical. This work proposes a novel two-stage detection framework designed to address the generalization challenge inherent in synthetic image detection. The first stage employs a vision deep learning model trained via supervised contrastive learning to extract discriminative embeddings from input imagery. Critically, this model was trained on a strategically partitioned subset of available generators, with specific architectures withheld from training to rigorously ablate cross-generator generalization capabilities. The second stage utilizes a k-nearest neighbors (k-NN) classifier operating on the learned embedding space, trained in a few-shot learning paradigm incorporating limited samples from previously unseen test generators. With merely 150 images per class in the few-shot learning regime, which are easily obtainable from current generation models, the proposed framework achieves an average detection accuracy of 91.3\%, representing a 5.2 percentage point improvement over existing approaches . For the source attribution task, the proposed approach obtains improvements of of 14.70\% and 4.27\% in AUC and OSCR respectively on an open set classification context, marking a significant advancement toward robust, scalable forensic attribution systems capable of adapting to the evolving generative AI landscape without requiring exhaustive retraining protocols.
comment: 17 pages, 6 figures, 6 tables
☆ TurkColBERT: A Benchmark of Dense and Late-Interaction Models for Turkish Information Retrieval
Neural information retrieval systems excel in high-resource languages but remain underexplored for morphologically rich, lower-resource languages such as Turkish. Dense bi-encoders currently dominate Turkish IR, yet late-interaction models -- which retain token-level representations for fine-grained matching -- have not been systematically evaluated. We introduce TurkColBERT, the first comprehensive benchmark comparing dense encoders and late-interaction models for Turkish retrieval. Our two-stage adaptation pipeline fine-tunes English and multilingual encoders on Turkish NLI/STS tasks, then converts them into ColBERT-style retrievers using PyLate trained on MS MARCO-TR. We evaluate 10 models across five Turkish BEIR datasets covering scientific, financial, and argumentative domains. Results show strong parameter efficiency: the 1.0M-parameter colbert-hash-nano-tr is 600$\times$ smaller than the 600M turkish-e5-large dense encoder while preserving over 71\% of its average mAP. Late-interaction models that are 3--5$\times$ smaller than dense encoders significantly outperform them; ColmmBERT-base-TR yields up to +13.8\% mAP on domain-specific tasks. For production-readiness, we compare indexing algorithms: MUVERA+Rerank is 3.33$\times$ faster than PLAID and offers +1.7\% relative mAP gain. This enables low-latency retrieval, with ColmmBERT-base-TR achieving 0.54 ms query times under MUVERA. We release all checkpoints, configs, and evaluation scripts. Limitations include reliance on moderately sized datasets ($\leq$50K documents) and translated benchmarks, which may not fully reflect real-world Turkish retrieval conditions; larger-scale MUVERA evaluations remain necessary.
☆ ODE-ViT: Plug & Play Attention Layer from the Generalization of the ViT as an Ordinary Differential Equation
In recent years, increasingly large models have achieved outstanding performance across CV tasks. However, these models demand substantial computational resources and storage, and their growing complexity limits our understanding of how they make decisions. Most of these architectures rely on the attention mechanism within Transformer-based designs. Building upon the connection between residual neural networks and ordinary differential equations (ODEs), we introduce ODE-ViT, a Vision Transformer reformulated as an ODE system that satisfies the conditions for well-posed and stable dynamics. Experiments on CIFAR-10 and CIFAR-100 demonstrate that ODE-ViT achieves stable, interpretable, and competitive performance with up to one order of magnitude fewer parameters, surpassing prior ODE-based Transformer approaches in classification tasks. We further propose a plug-and-play teacher-student framework in which a discrete ViT guides the continuous trajectory of ODE-ViT by treating the intermediate representations of the teacher as solutions of the ODE. This strategy improves performance by more than 10% compared to training a free ODE-ViT from scratch.
☆ Physics-Informed Machine Learning for Efficient Sim-to-Real Data Augmentation in Micro-Object Pose Estimation
Precise pose estimation of optical microrobots is essential for enabling high-precision object tracking and autonomous biological studies. However, current methods rely heavily on large, high-quality microscope image datasets, which are difficult and costly to acquire due to the complexity of microrobot fabrication and the labour-intensive labelling. Digital twin systems offer a promising path for sim-to-real data augmentation, yet existing techniques struggle to replicate complex optical microscopy phenomena, such as diffraction artifacts and depth-dependent imaging.This work proposes a novel physics-informed deep generative learning framework that, for the first time, integrates wave optics-based physical rendering and depth alignment into a generative adversarial network (GAN), to synthesise high-fidelity microscope images for microrobot pose estimation efficiently. Our method improves the structural similarity index (SSIM) by 35.6% compared to purely AI-driven methods, while maintaining real-time rendering speeds (0.022 s/frame).The pose estimator (CNN backbone) trained on our synthetic data achieves 93.9%/91.9% (pitch/roll) accuracy, just 5.0%/5.4% (pitch/roll) below that of an estimator trained exclusively on real data. Furthermore, our framework generalises to unseen poses, enabling data augmentation and robust pose estimation for novel microrobot configurations without additional training data.
☆ LLM4EO: Large Language Model for Evolutionary Optimization in Flexible Job Shop Scheduling
Customized static operator design has enabled widespread application of Evolutionary Algorithms (EAs), but their search performance is transient during iterations and prone to degradation. Dynamic operators aim to address this but typically rely on predefined designs and localized parameter control during the search process, lacking adaptive optimization throughout evolution. To overcome these limitations, this work leverages Large Language Models (LLMs) to perceive evolutionary dynamics and enable operator-level meta-evolution. The proposed framework, LLMs for Evolutionary Optimization (LLM4EO), comprises three components: knowledge-transfer-based operator design, evolution perception and analysis, and adaptive operator evolution. Firstly, initialization of operators is performed by transferring the strengths of classical operators via LLMs. Then, search preferences and potential limitations of operators are analyzed by integrating fitness performance and evolutionary features, accompanied by corresponding suggestions for improvement. Upon stagnation of population evolution, gene selection priorities of operators are dynamically optimized via improvement prompting strategies. This approach achieves co-evolution of populations and operators in the search, introducing a novel paradigm for enhancing the efficiency and adaptability of EAs. Finally, a series of validations on multiple benchmark datasets of the flexible job shop scheduling problem demonstrate that LLM4EO accelerates population evolution and outperforms both mainstream evolutionary programming and traditional EAs.
☆ Large Language Model-Based Reward Design for Deep Reinforcement Learning-Driven Autonomous Cyber Defense AAAI-26
Designing rewards for autonomous cyber attack and defense learning agents in a complex, dynamic environment is a challenging task for subject matter experts. We propose a large language model (LLM)-based reward design approach to generate autonomous cyber defense policies in a deep reinforcement learning (DRL)-driven experimental simulation environment. Multiple attack and defense agent personas were crafted, reflecting heterogeneity in agent actions, to generate LLM-guided reward designs where the LLM was first provided with contextual cyber simulation environment information. These reward structures were then utilized within a DRL-driven attack-defense simulation environment to learn an ensemble of cyber defense policies. Our results suggest that LLM-guided reward designs can lead to effective defense strategies against diverse adversarial behaviors.
comment: Accepted in the AAAI-26 Workshop on Artificial Intelligence for Cyber Security (AICS)
☆ Correlation-Aware Feature Attribution Based Explainable AI
Explainable AI (XAI) is increasingly essential as modern models become more complex and high-stakes applications demand transparency, trust, and regulatory compliance. Existing global attribution methods often incur high computational costs, lack stability under correlated inputs, and fail to scale efficiently to large or heterogeneous datasets. We address these gaps with \emph{ExCIR} (Explainability through Correlation Impact Ratio), a correlation-aware attribution score equipped with a lightweight transfer protocol that reproduces full-model rankings using only a fraction of the data. ExCIR quantifies sign-aligned co-movement between features and model outputs after \emph{robust centering} (subtracting a robust location estimate, e.g., median or mid-mean, from features and outputs). We further introduce \textsc{BlockCIR}, a \emph{groupwise} extension of ExCIR that scores \emph{sets} of correlated features as a single unit. By aggregating the same signed-co-movement numerators and magnitudes over predefined or data-driven groups, \textsc{BlockCIR} mitigates double-counting in collinear clusters (e.g., synonyms or duplicated sensors) and yields smoother, more stable rankings when strong dependencies are present. Across diverse text, tabular, signal, and image datasets, ExCIR shows trustworthy agreement with established global baselines and the full model, delivers consistent top-$k$ rankings across settings, and reduces runtime via lightweight evaluation on a subset of rows. Overall, ExCIR provides \emph{computationally efficient}, \emph{consistent}, and \emph{scalable} explainability for real-world deployment.
comment: Accepted, 2026 International Conference on Advances in Artificial Intelligence and Machine Learning (AAIML 2026)
☆ Anatomy of an Idiom: Tracing Non-Compositionality in Language Models
We investigate the processing of idiomatic expressions in transformer-based language models using a novel set of techniques for circuit discovery and analysis. First discovering circuits via a modified path patching algorithm, we find that idiom processing exhibits distinct computational patterns. We identify and investigate ``Idiom Heads,'' attention heads that frequently activate across different idioms, as well as enhanced attention between idiom tokens due to earlier processing, which we term ``augmented reception.'' We analyze these phenomena and the general features of the discovered circuits as mechanisms by which transformers balance computational efficiency and robustness. Finally, these findings provide insights into how transformers handle non-compositional language and suggest pathways for understanding the processing of more complex grammatical constructions.
☆ VLA-Pruner: Temporal-Aware Dual-Level Visual Token Pruning for Efficient Vision-Language-Action Inference
Vision-Language-Action (VLA) models have shown great promise for embodied AI, yet the heavy computational cost of processing continuous visual streams severely limits their real-time deployment. Token pruning (keeping salient visual tokens and dropping redundant ones) has emerged as an effective approach for accelerating Vision-Language Models (VLMs), offering a solution for efficient VLA. However, these VLM-specific token pruning methods select tokens based solely on semantic salience metrics (e.g., prefill attention), while overlooking the VLA's intrinsic dual-system nature of high-level semantic understanding and low-level action execution. Consequently, these methods bias token retention toward semantic cues, discard critical information for action generation, and significantly degrade VLA performance. To bridge this gap, we propose VLA-Pruner, a versatile plug-and-play VLA-specific token prune method that aligns with the dual-system nature of VLA models and exploits the temporal continuity in robot manipulation. Specifically, VLA-Pruner adopts a dual-level importance criterion for visual token retention: vision-language prefill attention for semantic-level relevance and action decode attention, estimated via temporal smoothing, for action-level importance. Based on this criterion, VLA-Pruner proposes a novel dual-level token selection strategy that adaptively preserves a compact, informative set of visual tokens for both semantic understanding and action execution under given compute budget. Experiments show that VLA-Pruner achieves state-of-the-art performance across multiple VLA architectures and diverse robotic tasks.
☆ PersonaDrift: A Benchmark for Temporal Anomaly Detection in Language-Based Dementia Monitoring
People living with dementia (PLwD) often show gradual shifts in how they communicate, becoming less expressive, more repetitive, or drifting off-topic in subtle ways. While caregivers may notice these changes informally, most computational tools are not designed to track such behavioral drift over time. This paper introduces PersonaDrift, a synthetic benchmark designed to evaluate machine learning and statistical methods for detecting progressive changes in daily communication, focusing on user responses to a digital reminder system. PersonaDrift simulates 60-day interaction logs for synthetic users modeled after real PLwD, based on interviews with caregivers. These caregiver-informed personas vary in tone, modality, and communication habits, enabling realistic diversity in behavior. The benchmark focuses on two forms of longitudinal change that caregivers highlighted as particularly salient: flattened sentiment (reduced emotional tone and verbosity) and off-topic replies (semantic drift). These changes are injected progressively at different rates to emulate naturalistic cognitive trajectories, and the framework is designed to be extensible to additional behaviors in future use cases. To explore this novel application space, we evaluate several anomaly detection approaches, unsupervised statistical methods (CUSUM, EWMA, One-Class SVM), sequence models using contextual embeddings (GRU + BERT), and supervised classifiers in both generalized and personalized settings. Preliminary results show that flattened sentiment can often be detected with simple statistical models in users with low baseline variability, while detecting semantic drift requires temporal modeling and personalized baselines. Across both tasks, personalized classifiers consistently outperform generalized ones, highlighting the importance of individual behavioral context.
☆ From generative AI to the brain: five takeaways
The big strides seen in generative AI are not based on somewhat obscure algorithms, but due to clearly defined generative principles. The resulting concrete implementations have proven themselves in large numbers of applications. We suggest that it is imperative to thoroughly investigate which of these generative principles may be operative also in the brain, and hence relevant for cognitive neuroscience. In addition, ML research led to a range of interesting characterizations of neural information processing systems. We discuss five examples, the shortcomings of world modelling, the generation of thought processes, attention, neural scaling laws, and quantization, that illustrate how much neuroscience could potentially learn from ML research.
comment: Frontiers in Computational Neuroscience, in press
☆ Generative Modeling of Clinical Time Series via Latent Stochastic Differential Equations
Clinical time series data from electronic health records and medical registries offer unprecedented opportunities to understand patient trajectories and inform medical decision-making. However, leveraging such data presents significant challenges due to irregular sampling, complex latent physiology, and inherent uncertainties in both measurements and disease progression. To address these challenges, we propose a generative modeling framework based on latent neural stochastic differential equations (SDEs) that views clinical time series as discrete-time partial observations of an underlying controlled stochastic dynamical system. Our approach models latent dynamics via neural SDEs with modality-dependent emission models, while performing state estimation and parameter learning through variational inference. This formulation naturally handles irregularly sampled observations, learns complex non-linear interactions, and captures the stochasticity of disease progression and measurement noise within a unified scalable probabilistic framework. We validate the framework on two complementary tasks: (i) individual treatment effect estimation using a simulated pharmacokinetic-pharmacodynamic (PKPD) model of lung cancer, and (ii) probabilistic forecasting of physiological signals using real-world intensive care unit (ICU) data from 12,000 patients. Results show that our framework outperforms ordinary differential equation and long short-term memory baseline models in accuracy and uncertainty estimation. These results highlight its potential for enabling precise, uncertainty-aware predictions to support clinical decision-making.
☆ TOFA: Training-Free One-Shot Federated Adaptation for Vision-Language Models AAAI 2026
Efficient and lightweight adaptation of pre-trained Vision-Language Models (VLMs) to downstream tasks through collaborative interactions between local clients and a central server is a rapidly emerging research topic in federated learning. Existing adaptation algorithms are typically trained iteratively, which incur significant communication costs and increase the susceptibility to potential attacks. Motivated by the one-shot federated training techniques that reduce client-server exchanges to a single round, developing a lightweight one-shot federated VLM adaptation method to alleviate these issues is particularly attractive. However, current one-shot approaches face certain challenges in adapting VLMs within federated settings: (1) insufficient exploitation of the rich multimodal information inherent in VLMs; (2) lack of specialized adaptation strategies to systematically handle the severe data heterogeneity; and (3) requiring additional training resource of clients or server. To bridge these gaps, we propose a novel Training-free One-shot Federated Adaptation framework for VLMs, named TOFA. To fully leverage the generalizable multimodal features in pre-trained VLMs, TOFA employs both visual and textual pipelines to extract task-relevant representations. In the visual pipeline, a hierarchical Bayesian model learns personalized, class-specific prototype distributions. For the textual pipeline, TOFA evaluates and globally aligns the generated local text prompts for robustness. An adaptive weight calibration mechanism is also introduced to combine predictions from both modalities, balancing personalization and robustness to handle data heterogeneity. Our method is training-free, not relying on additional training resources on either the client or server side. Extensive experiments across 9 datasets in various federated settings demonstrate the effectiveness of the proposed TOFA method.
comment: Accepted by AAAI 2026
☆ Pharos-ESG: A Framework for Multimodal Parsing, Contextual Narration, and Hierarchical Labeling of ESG Report AAAI 26
Environmental, Social, and Governance (ESG) principles are reshaping the foundations of global financial gover- nance, transforming capital allocation architectures, regu- latory frameworks, and systemic risk coordination mecha- nisms. However, as the core medium for assessing corpo- rate ESG performance, the ESG reports present significant challenges for large-scale understanding, due to chaotic read- ing order from slide-like irregular layouts and implicit hier- archies arising from lengthy, weakly structured content. To address these challenges, we propose Pharos-ESG, a uni- fied framework that transforms ESG reports into structured representations through multimodal parsing, contextual nar- ration, and hierarchical labeling. It integrates a reading-order modeling module based on layout flow, hierarchy-aware seg- mentation guided by table-of-contents anchors, and a multi- modal aggregation pipeline that contextually transforms vi- sual elements into coherent natural language. The framework further enriches its outputs with ESG, GRI, and sentiment labels, yielding annotations aligned with the analytical de- mands of financial research. Extensive experiments on anno- tated benchmarks demonstrate that Pharos-ESG consistently outperforms both dedicated document parsing systems and general-purpose multimodal models. In addition, we release Aurora-ESG, the first large-scale public dataset of ESG re- ports, spanning Mainland China, Hong Kong, and U.S. mar- kets, featuring unified structured representations of multi- modal content, enriched with fine-grained layout and seman- tic annotations to better support ESG integration in financial governance and decision-making.
comment: Accepted to AAAI 26:main technical track Oral
☆ Trustworthy AI in the Agentic Lakehouse: from Concurrency to Governance AAAI26
Even as AI capabilities improve, most enterprises do not consider agents trustworthy enough to work on production data. In this paper, we argue that the path to trustworthy agentic workflows begins with solving the infrastructure problem first: traditional lakehouses are not suited for agent access patterns, but if we design one around transactions, governance follows. In particular, we draw an operational analogy to MVCC in databases and show why a direct transplant fails in a decoupled, multi-language setting. We then propose an agent-first design, Bauplan, that reimplements data and compute isolation in the lakehouse. We conclude by sharing a reference implementation of a self-healing pipeline in Bauplan, which seamlessly couples agent reasoning with all the desired guarantees for correctness and trust.
comment: AAAI26, pre-print of paper accepted at the Trustworthy Agentic AI Workshop
☆ Collaborative Management for Chronic Diseases and Depression: A Double Heterogeneity-based Multi-Task Learning Method
Wearable sensor technologies and deep learning are transforming healthcare management. Yet, most health sensing studies focus narrowly on physical chronic diseases. This overlooks the critical need for joint assessment of comorbid physical chronic diseases and depression, which is essential for collaborative chronic care. We conceptualize multi-disease assessment, including both physical diseases and depression, as a multi-task learning (MTL) problem, where each disease assessment is modeled as a task. This joint formulation leverages inter-disease relationships to improve accuracy, but it also introduces the challenge of double heterogeneity: chronic diseases differ in their manifestation (disease heterogeneity), and patients with the same disease show varied patterns (patient heterogeneity). To address these issues, we first adopt existing techniques and propose a base method. Given the limitations of the base method, we further propose an Advanced Double Heterogeneity-based Multi-Task Learning (ADH-MTL) method that improves the base method through three innovations: (1) group-level modeling to support new patient predictions, (2) a decomposition strategy to reduce model complexity, and (3) a Bayesian network that explicitly captures dependencies while balancing similarities and differences across model components. Empirical evaluations on real-world wearable sensor data demonstrate that ADH-MTL significantly outperforms existing baselines, and each of its innovations is shown to be effective. This study contributes to health information systems by offering a computational solution for integrated physical and mental healthcare and provides design principles for advancing collaborative chronic disease management across the pre-treatment, treatment, and post-treatment phases.
☆ CorrectHDL: Agentic HDL Design with LLMs Leveraging High-Level Synthesis as Reference
Large Language Models (LLMs) have demonstrated remarkable potential in hardware front-end design using hardware description languages (HDLs). However, their inherent tendency toward hallucination often introduces functional errors into the generated HDL designs. To address this issue, we propose the framework CorrectHDL that leverages high-level synthesis (HLS) results as functional references to correct potential errors in LLM-generated HDL designs.The input to the proposed framework is a C/C++ program that specifies the target circuit's functionality. The program is provided to an LLM to directly generate an HDL design, whose syntax errors are repaired using a Retrieval-Augmented Generation (RAG) mechanism. The functional correctness of the LLM-generated circuit is iteratively improved by comparing its simulated behavior with an HLS reference design produced by conventional HLS tools, which ensures the functional correctness of the result but can lead to suboptimal area and power efficiency. Experimental results demonstrate that circuits generated by the proposed framework achieve significantly better area and power efficiency than conventional HLS designs and approach the quality of human-engineered circuits. Meanwhile, the correctness of the resulting HDL implementation is maintained, highlighting the effectiveness and potential of agentic HDL design leveraging the generative capabilities of LLMs and the rigor of traditional correctness-driven IC design flows.
comment: 7 pages, 15 figures, 2 tables
☆ Robot Metacognition: Decision Making with Confidence for Tool Invention
Robots today often miss a key ingredient of truly intelligent behavior: the ability to reflect on their own cognitive processes and decisions. In humans, this self-monitoring or metacognition is crucial for learning, decision making and problem solving. For instance, they can evaluate how confident they are in performing a task, thus regulating their own behavior and allocating proper resources. Taking inspiration from neuroscience, we propose a robot metacognition architecture centered on confidence (a second-order judgment on decisions) and we demonstrate it on the use case of autonomous tool invention. We propose the use of confidence as a metacognitive measure within the robot decision making scheme. Confidence-informed robots can evaluate the reliability of their decisions, improving their robustness during real-world physical deployment. This form of robotic metacognition emphasizes embodied action monitoring as a means to achieve better informed decisions. We also highlight potential applications and research directions for robot metacognition.
comment: under review
☆ An Agent-Based Framework for the Automatic Validation of Mathematical Optimization Models
Recently, using Large Language Models (LLMs) to generate optimization models from natural language descriptions has became increasingly popular. However, a major open question is how to validate that the generated models are correct and satisfy the requirements defined in the natural language description. In this work, we propose a novel agent-based method for automatic validation of optimization models that builds upon and extends methods from software testing to address optimization modeling . This method consists of several agents that initially generate a problem-level testing API, then generate tests utilizing this API, and, lastly, generate mutations specific to the optimization model (a well-known software testing technique assessing the fault detection power of the test suite). In this work, we detail this validation framework and show, through experiments, the high quality of validation provided by this agent ensemble in terms of the well-known software testing measure called mutation coverage.
☆ Are Foundation Models Useful for Bankruptcy Prediction? NeurIPS 2025
Foundation models have shown promise across various financial applications, yet their effectiveness for corporate bankruptcy prediction remains systematically unevaluated against established methods. We study bankruptcy forecasting using Llama-3.3-70B-Instruct and TabPFN, evaluated on large, highly imbalanced datasets of over one million company records from the Visegrád Group. We provide the first systematic comparison of foundation models against classical machine learning baselines for this task. Our results show that models such as XGBoost and CatBoost consistently outperform foundation models across all prediction horizons. LLM-based approaches suffer from unreliable probability estimates, undermining their use in risk-sensitive financial settings. TabPFN, while competitive with simpler baselines, requires substantial computational resources with costs not justified by performance gains. These findings suggest that, despite their generality, current foundation models remain less effective than specialized methods for bankruptcy forecasting.
comment: NeurIPS 2025 Workshop: Generative AI in Finance
☆ Reducing Instability in Synthetic Data Evaluation with a Super-Metric in MalDataGen
Evaluating the quality of synthetic data remains a persistent challenge in the Android malware domain due to instability and the lack of standardization among existing metrics. This work integrates into MalDataGen a Super-Metric that aggregates eight metrics across four fidelity dimensions, producing a single weighted score. Experiments involving ten generative models and five balanced datasets demonstrate that the Super-Metric is more stable and consistent than traditional metrics, exhibiting stronger correlations with the actual performance of classifiers.
comment: 5 pages, 3 figures, submitted to ERRC/WRSeg 2025
☆ Learning from Sufficient Rationales: Analysing the Relationship Between Explanation Faithfulness and Token-level Regularisation Strategies AACL 2025
Human explanations of natural language, rationales, form a tool to assess whether models learn a label for the right reasons or rely on dataset-specific shortcuts. Sufficiency is a common metric for estimating the informativeness of rationales, but it provides limited insight into the effects of rationale information on model performance. We address this limitation by relating sufficiency to two modelling paradigms: the ability of models to identify which tokens are part of the rationale (through token classification) and the ability of improving model performance by incorporating rationales in the input (through attention regularisation). We find that highly informative rationales are not likely to help classify the instance correctly. Sufficiency conversely captures the classification impact of the non-rationalised context, which interferes with rationale information in the same input. We also find that incorporating rationale information in model inputs can boost cross-domain classification, but results are inconsistent per task and model type. Finally, sufficiency and token classification appear to be unrelated. These results exemplify the complexity of rationales, showing that metrics capable of systematically capturing this type of information merit further investigation.
comment: Long paper accepted to the main conference of AACL 2025. Please cite the conference proceedings when available
☆ OpenMMReasoner: Pushing the Frontiers for Multimodal Reasoning with an Open and General Recipe
Recent advancements in large reasoning models have fueled growing interest in extending such capabilities to multimodal domains. However, despite notable progress in visual reasoning, the lack of transparent and reproducible data curation and training strategies remains a major barrier to scalable research. In this work, we introduce OpenMMReasoner, a fully transparent two-stage recipe for multimodal reasoning spanning supervised fine-tuning (SFT) and reinforcement learning (RL). In the SFT stage, we construct an 874K-sample cold-start dataset with rigorous step-by-step validation, providing a strong foundation for reasoning capabilities. The subsequent RL stage leverages a 74K-sample dataset across diverse domains to further sharpen and stabilize these abilities, resulting in a more robust and efficient learning process. Extensive evaluations demonstrate that our training recipe not only surpasses strong baselines but also highlights the critical role of data quality and training design in shaping multimodal reasoning performance. Notably, our method achieves a 11.6% improvement over the Qwen2.5-VL-7B-Instruct baseline across nine multimodal reasoning benchmarks, establishing a solid empirical foundation for future large-scale multimodal reasoning research. We open-sourced all our codes, pipeline, and data at https://github.com/EvolvingLMMs-Lab/OpenMMReasoner.
☆ SDA: Steering-Driven Distribution Alignment for Open LLMs without Fine-Tuning
With the rapid advancement of large language models (LLMs), their deployment in real-world applications has become increasingly widespread. LLMs are expected to deliver robust performance across diverse tasks, user preferences, and practical scenarios. However, as demands grow, ensuring that LLMs produce responses aligned with human intent remains a foundational challenge. In particular, aligning model behavior effectively and efficiently during inference, without costly retraining or extensive supervision, is both a critical requirement and a non-trivial technical endeavor. To address the challenge, we propose SDA (Steering-Driven Distribution Alignment), a training-free and model-agnostic alignment framework designed for open-source LLMs. SDA dynamically redistributes model output probabilities based on user-defined alignment instructions, enhancing alignment between model behavior and human intents without fine-tuning. The method is lightweight, resource-efficient, and compatible with a wide range of open-source LLMs. It can function independently during inference or be integrated with training-based alignment strategies. Moreover, SDA supports personalized preference alignment, enabling flexible control over the model response behavior. Empirical results demonstrate that SDA consistently improves alignment performance across 8 open-source LLMs with varying scales and diverse origins, evaluated on three key alignment dimensions, helpfulness, harmlessness, and honesty (3H). Specifically, SDA achieves average gains of 64.4% in helpfulness, 30% in honesty and 11.5% in harmlessness across the tested models, indicating its effectiveness and generalization across diverse models and application scenarios.
☆ Distributed Agent Reasoning Across Independent Systems With Strict Data Locality
This paper presents a proof-of-concept demonstration of agent-to-agent communication across distributed systems, using only natural-language messages and without shared identifiers, structured schemas, or centralised data exchange. The prototype explores how multiple organisations (represented here as a Clinic, Insurer, and Specialist Network) can cooperate securely via pseudonymised case tokens, local data lookups, and controlled operational boundaries. The system uses Orpius as the underlying platform for multi-agent orchestration, tool execution, and privacy-preserving communication. All agents communicate through OperationRelay calls, exchanging concise natural-language summaries. Each agent operates on its own data (such as synthetic clinic records, insurance enrolment tables, and clinical guidance extracts), and none receives or reconstructs patient identity. The Clinic computes an HMAC-based pseudonymous token, the Insurer evaluates coverage rules and consults the Specialist agent, and the Specialist returns an appropriateness recommendation. The goal of this prototype is intentionally limited: to demonstrate feasibility, not to provide a clinically validated, production-ready system. No clinician review was conducted, and no evaluation beyond basic functional runs was performed. The work highlights architectural patterns, privacy considerations, and communication flows that enable distributed reasoning among specialised agents while keeping data local to each organisation. We conclude by outlining opportunities for more rigorous evaluation and future research in decentralised multi-agent systems.
comment: 27 pages, 6 figures
☆ MuISQA: Multi-Intent Retrieval-Augmented Generation for Scientific Question Answering
Complex scientific questions often entail multiple intents, such as identifying gene mutations and linking them to related diseases. These tasks require evidence from diverse sources and multi-hop reasoning, while conventional retrieval-augmented generation (RAG) systems are usually single-intent oriented, leading to incomplete evidence coverage. To assess this limitation, we introduce the Multi-Intent Scientific Question Answering (MuISQA) benchmark, which is designed to evaluate RAG systems on heterogeneous evidence coverage across sub-questions. In addition, we propose an intent-aware retrieval framework that leverages large language models (LLMs) to hypothesize potential answers, decompose them into intent-specific queries, and retrieve supporting passages for each underlying intent. The retrieved fragments are then aggregated and re-ranked via Reciprocal Rank Fusion (RRF) to balance coverage across diverse intents while reducing redundancy. Experiments on both MuISQA benchmark and other general RAG datasets demonstrate that our method consistently outperforms conventional approaches, particularly in retrieval accuracy and evidence coverage.
comment: 15 pages
☆ "To Survive, I Must Defect": Jailbreaking LLMs via the Game-Theory Scenarios
As LLMs become more common, non-expert users can pose risks, prompting extensive research into jailbreak attacks. However, most existing black-box jailbreak attacks rely on hand-crafted heuristics or narrow search spaces, which limit scalability. Compared with prior attacks, we propose Game-Theory Attack (GTA), an scalable black-box jailbreak framework. Concretely, we formalize the attacker's interaction against safety-aligned LLMs as a finite-horizon, early-stoppable sequential stochastic game, and reparameterize the LLM's randomized outputs via quantal response. Building on this, we introduce a behavioral conjecture "template-over-safety flip": by reshaping the LLM's effective objective through game-theoretic scenarios, the originally safety preference may become maximizing scenario payoffs within the template, which weakens safety constraints in specific contexts. We validate this mechanism with classical game such as the disclosure variant of the Prisoner's Dilemma, and we further introduce an Attacker Agent that adaptively escalates pressure to increase the ASR. Experiments across multiple protocols and datasets show that GTA achieves over 95% ASR on LLMs such as Deepseek-R1, while maintaining efficiency. Ablations over components, decoding, multilingual settings, and the Agent's core model confirm effectiveness and generalization. Moreover, scenario scaling studies further establish scalability. GTA also attains high ASR on other game-theoretic scenarios, and one-shot LLM-generated variants that keep the model mechanism fixed while varying background achieve comparable ASR. Paired with a Harmful-Words Detection Agent that performs word-level insertions, GTA maintains high ASR while lowering detection under prompt-guard models. Beyond benchmarks, GTA jailbreaks real-world LLM applications and reports a longitudinal safety monitoring of popular HuggingFace LLMs.
comment: 20 pages
☆ SeSE: A Structural Information-Guided Uncertainty Quantification Framework for Hallucination Detection in LLMs
Reliable uncertainty quantification (UQ) is essential for deploying large language models (LLMs) in safety-critical scenarios, as it enables them to abstain from responding when uncertain, thereby avoiding hallucinating falsehoods. However, state-of-the-art UQ methods primarily rely on semantic probability distributions or pairwise distances, overlooking latent semantic structural information that could enable more precise uncertainty estimates. This paper presents Semantic Structural Entropy (SeSE), a principled UQ framework that quantifies the inherent semantic uncertainty of LLMs from a structural information perspective for hallucination detection. Specifically, to effectively model semantic spaces, we first develop an adaptively sparsified directed semantic graph construction algorithm that captures directional semantic dependencies while automatically pruning unnecessary connections that introduce negative interference. We then exploit latent semantic structural information through hierarchical abstraction: SeSE is defined as the structural entropy of the optimal semantic encoding tree, formalizing intrinsic uncertainty within semantic spaces after optimal compression. A higher SeSE value corresponds to greater uncertainty, indicating that LLMs are highly likely to generate hallucinations. In addition, to enhance fine-grained UQ in long-form generation -- where existing methods often rely on heuristic sample-and-count techniques -- we extend SeSE to quantify the uncertainty of individual claims by modeling their random semantic interactions, providing theoretically explicable hallucination detection. Extensive experiments across 29 model-dataset combinations show that SeSE significantly outperforms advanced UQ baselines, including strong supervised methods and the recently proposed KLE.
comment: 14 pages of main text and 10 pages of appendices
☆ Revisiting Fairness-aware Interactive Recommendation: Item Lifecycle as a Control Knob
This paper revisits fairness-aware interactive recommendation (e.g., TikTok, KuaiShou) by introducing a novel control knob, i.e., the lifecycle of items. We make threefold contributions. First, we conduct a comprehensive empirical analysis and uncover that item lifecycles in short-video platforms follow a compressed three-phase pattern, i.e., rapid growth, transient stability, and sharp decay, which significantly deviates from the classical four-stage model (introduction, growth, maturity, decline). Second, we introduce LHRL, a lifecycle-aware hierarchical reinforcement learning framework that dynamically harmonizes fairness and accuracy by leveraging phase-specific exposure dynamics. LHRL consists of two key components: (1) PhaseFormer, a lightweight encoder combining STL decomposition and attention mechanisms for robust phase detection; (2) a two-level HRL agent, where the high-level policy imposes phase-aware fairness constraints, and the low-level policy optimizes immediate user engagement. This decoupled optimization allows for effective reconciliation between long-term equity and short-term utility. Third, experiments on multiple real-world interactive recommendation datasets demonstrate that LHRL significantly improves both fairness and user engagement. Furthermore, the integration of lifecycle-aware rewards into existing RL-based models consistently yields performance gains, highlighting the generalizability and practical value of our approach.
comment: 8 pages, 5 figures, conference
☆ Q-MLLM: Vector Quantization for Robust Multimodal Large Language Model Security NDSS 2026
Multimodal Large Language Models (MLLMs) have demonstrated impressive capabilities in cross-modal understanding, but remain vulnerable to adversarial attacks through visual inputs despite robust textual safety mechanisms. These vulnerabilities arise from two core weaknesses: the continuous nature of visual representations, which allows for gradient-based attacks, and the inadequate transfer of text-based safety mechanisms to visual content. We introduce Q-MLLM, a novel architecture that integrates two-level vector quantization to create a discrete bottleneck against adversarial attacks while preserving multimodal reasoning capabilities. By discretizing visual representations at both pixel-patch and semantic levels, Q-MLLM blocks attack pathways and bridges the cross-modal safety alignment gap. Our two-stage training methodology ensures robust learning while maintaining model utility. Experiments demonstrate that Q-MLLM achieves significantly better defense success rate against both jailbreak attacks and toxic image attacks than existing approaches. Notably, Q-MLLM achieves perfect defense success rate (100\%) against jailbreak attacks except in one arguable case, while maintaining competitive performance on multiple utility benchmarks with minimal inference overhead. This work establishes vector quantization as an effective defense mechanism for secure multimodal AI systems without requiring expensive safety-specific fine-tuning or detection overhead. Code is available at https://github.com/Amadeuszhao/QMLLM.
comment: Accepted by NDSS 2026
☆ FlipVQA-Miner: Cross-Page Visual Question-Answer Mining from Textbooks
The development of Large Language Models (LLMs) increasingly depends on high-quality supervised data, yet existing instruction-tuning and RL datasets remain costly to curate and often rely on synthetic samples that introduce hallucination and limited diversity. At the same time, textbooks and exercise materials contain abundant, high-quality human-authored Question-Answer(QA) content that remains underexploited due to the difficulty of transforming raw PDFs into AI-ready supervision. Although modern OCR and vision-language models can accurately parse document structure, their outputs lack the semantic alignment required for training. We propose an automated pipeline that extracts well-formed QA and visual-QA (VQA) pairs from educational documents by combining layout-aware OCR with LLM-based semantic parsing. Experiments across diverse document types show that the method produces accurate, aligned, and low-noise QA/VQA pairs. This approach enables scalable use of real-world educational content and provides a practical alternative to synthetic data generation for improving reasoning-oriented LLM training. All code and data-processing pipelines are open-sourced at https://github.com/OpenDCAI/DataFlow.
☆ ChemLabs on ChemO: A Multi-Agent System for Multimodal Reasoning on IChO 2025
Olympiad-level benchmarks in mathematics and physics are crucial testbeds for advanced AI reasoning, but chemistry, with its unique multimodal symbolic language, has remained an open challenge. We introduce ChemO, a new benchmark built from the International Chemistry Olympiad (IChO) 2025. ChemO features two key innovations for automated assessment: Assessment-Equivalent Reformulation (AER), which converts problems requiring visual outputs (e.g., drawing molecules) into computationally tractable formats, and Structured Visual Enhancement (SVE), a diagnostic mechanism to disentangle a model's visual perception capabilities from its core chemical reasoning. To tackle this benchmark, we propose ChemLabs, a hierarchical multi-agent framework that mimics human expert collaboration through specialized agents for problem decomposition, perception, reasoning, and auditing. Experiments on state-of-the-art multimodal models demonstrate that combining SVE with our multi-agent system yields dramatic performance gains. Our top configuration achieves a score of 93.6 out of 100, surpassing an estimated human gold medal threshold and establishing a new state-of-the-art in automated chemical problem-solving. ChemO Dataset: https://huggingface.co/datasets/IDEA-AI4SCI/ChemO
comment: 13 pages, 1 figures
☆ When Alignment Fails: Multimodal Adversarial Attacks on Vision-Language-Action Models
Vision-Language-Action models (VLAs) have recently demonstrated remarkable progress in embodied environments, enabling robots to perceive, reason, and act through unified multimodal understanding. Despite their impressive capabilities, the adversarial robustness of these systems remains largely unexplored, especially under realistic multimodal and black-box conditions. Existing studies mainly focus on single-modality perturbations and overlook the cross-modal misalignment that fundamentally affects embodied reasoning and decision-making. In this paper, we introduce VLA-Fool, a comprehensive study of multimodal adversarial robustness in embodied VLA models under both white-box and black-box settings. VLA-Fool unifies three levels of multimodal adversarial attacks: (1) textual perturbations through gradient-based and prompt-based manipulations, (2) visual perturbations via patch and noise distortions, and (3) cross-modal misalignment attacks that intentionally disrupt the semantic correspondence between perception and instruction. We further incorporate a VLA-aware semantic space into linguistic prompts, developing the first automatically crafted and semantically guided prompting framework. Experiments on the LIBERO benchmark using a fine-tuned OpenVLA model reveal that even minor multimodal perturbations can cause significant behavioral deviations, demonstrating the fragility of embodied multimodal alignment.
☆ Multi-Agent Collaborative Reward Design for Enhancing Reasoning in Reinforcement Learning
We present CRM (Multi-Agent Collaborative Reward Model), a framework that replaces a single black-box reward model with a coordinated team of specialist evaluators to improve robustness and interpretability in RLHF. Conventional reward models struggle to jointly optimize multiple, sometimes conflicting, preference dimensions (e.g., factuality, helpfulness, safety) and offer limited transparency into why a score is assigned. CRM addresses these issues by decomposing preference evaluation into domain-specific agents that each produce partial signals, alongside global evaluators such as ranker-based and embedding-similarity rewards. A centralized aggregator fuses these signals at each timestep, balancing factors like step-wise correctness, multi-agent agreement, and repetition penalties, yielding a single training reward compatible with standard RL pipelines. The policy is optimized with advantage-based updates (e.g., GAE), while a value model regresses to the aggregated reward, enabling multi-perspective reward shaping without requiring additional human annotations beyond those used to train the evaluators. To support training and assessment, we introduce rewardBench, a benchmark and training suite aligned with the collaborative structure of CRM. Together, CRM and rewardBench provide a practical, modular path to more transparent reward modeling and more stable optimization.
☆ From Performance to Understanding: A Vision for Explainable Automated Algorithm Design
Automated algorithm design is entering a new phase: Large Language Models can now generate full optimisation (meta)heuristics, explore vast design spaces and adapt through iterative feedback. Yet this rapid progress is largely performance-driven and opaque. Current LLM-based approaches rarely reveal why a generated algorithm works, which components matter or how design choices relate to underlying problem structures. This paper argues that the next breakthrough will come not from more automation, but from coupling automation with understanding from systematic benchmarking. We outline a vision for explainable automated algorithm design, built on three pillars: (i) LLM-driven discovery of algorithmic variants, (ii) explainable benchmarking that attributes performance to components and hyperparameters and (iii) problem-class descriptors that connect algorithm behaviour to landscape structure. Together, these elements form a closed knowledge loop in which discovery, explanation and generalisation reinforce each other. We argue that this integration will shift the field from blind search to interpretable, class-specific algorithm design, accelerating progress while producing reusable scientific insight into when and why optimisation strategies succeed.
☆ Fast LLM Post-training via Decoupled and Best-of-N Speculation
Rollout dominates the training time in large language model (LLM) post-training, where the trained model is used to generate tokens given a batch of prompts. SpecActor achieves fast rollout with speculative decoding that deploys a fast path (e.g., a smaller model) to accelerate the unparallelizable generation, while the correctness is guaranteed by fast parallel verification of the outputs with the original model. SpecActor addresses two foundational challenges in speculative rollout by (1) a \emph{dynamic decoupled speculation} execution method that maximizes the GPU computational efficiency to realize speedup for large-batch execution -- a configuration common in training but unfriendly to speculative execution and (2) a \emph{dynamic Best-of-N speculation} method that selects and combines different drafting methods according to the rollout progress. It substantially improves the speculation accuracy even when the best drafting method is unknown a priori, meanwhile without requiring adding extra computation resources. {\sys} is {1.3--1.7}\,$\times$ faster than common post-training baselines, and is {1.3--1.5}\,$\times$ faster compared to naively adopting speculative decoding for rollout.
☆ FOOTPASS: A Multi-Modal Multi-Agent Tactical Context Dataset for Play-by-Play Action Spotting in Soccer Broadcast Videos
Soccer video understanding has motivated the creation of datasets for tasks such as temporal action localization, spatiotemporal action detection (STAD), or multiobject tracking (MOT). The annotation of structured sequences of events (who does what, when, and where) used for soccer analytics requires a holistic approach that integrates both STAD and MOT. However, current action recognition methods remain insufficient for constructing reliable play-by-play data and are typically used to assist rather than fully automate annotation. Parallel research has advanced tactical modeling, trajectory forecasting, and performance analysis, all grounded in game-state and play-by-play data. This motivates leveraging tactical knowledge as a prior to support computer-vision-based predictions, enabling more automated and reliable extraction of play-by-play data. We introduce Footovision Play-by-Play Action Spotting in Soccer Dataset (FOOTPASS), the first benchmark for play-by-play action spotting over entire soccer matches in a multi-modal, multi-agent tactical context. It enables the development of methods for player-centric action spotting that exploit both outputs from computer-vision tasks (e.g., tracking, identification) and prior knowledge of soccer, including its tactical regularities over long time horizons, to generate reliable play-by-play data streams. These streams form an essential input for data-driven sports analytics.
☆ Mantis: A Versatile Vision-Language-Action Model with Disentangled Visual Foresight
Recent advances in Vision-Language-Action (VLA) models demonstrate that visual signals can effectively complement sparse action supervisions. However, letting VLA directly predict high-dimensional visual states can distribute model capacity and incur prohibitive training cost, while compressing visual states into more compact supervisory signals inevitably incurs information bottlenecks. Moreover, existing methods often suffer from poor comprehension and reasoning capabilities due to the neglect of language supervision. This paper introduces Mantis, a novel framework featuring a Disentangled Visual Foresight (DVF) to tackle these issues. Specifically, Mantis decouples visual foresight prediction from the backbone with the combination of meta queries and a diffusion Transformer (DiT) head. With the current visual state provided to the DiT via a residual connection, a simple next-state prediction objective enables the meta queries to automatically capture the latent actions that delineate the visual trajectory, and hence boost the learning of explicit actions. The disentanglement reduces the burden of the VLA backbone, enabling it to maintain comprehension and reasoning capabilities through language supervision. Empirically, pretrained on human manipulation videos, robot demonstrations, and image-text pairs, Mantis achieves a 96.7% success rate on LIBERO benchmark after fine-tuning, surpassing powerful baselines while exhibiting high convergence speed. Real-world evaluations show that Mantis outperforms $π_{0.5}$, a leading open-source VLA model, particularly in instruction-following capability, generalization to unseen instructions, and reasoning ability. Code and weights are released to support the open-source community.
☆ TS-PEFT: Token-Selective Parameter-Efficient Fine-Tuning with Learnable Threshold Gating
In the field of large models (LMs) for natural language processing (NLP) and computer vision (CV), Parameter-Efficient Fine-Tuning (PEFT) has emerged as a resource-efficient method that modifies a limited number of parameters while keeping the pretrained weights fixed. This paper investigates the traditional PEFT approach, which applies modifications to all position indices, and questions its necessity. We introduce a new paradigm called Token-Selective PEFT (TS-PEFT), in which a function S selectively applies PEFT modifications to a subset of position indices, potentially enhancing performance on downstream tasks. Our experimental results reveal that the indiscriminate application of PEFT to all indices is not only superfluous, but may also be counterproductive. This study offers a fresh perspective on PEFT, advocating for a more targeted approach to modifications and providing a framework for future research to optimize the fine-tuning process for large models.
comment: 11 pages, 3 figures
☆ Labels Matter More Than Models: Quantifying the Benefit of Supervised Time Series Anomaly Detection
Time series anomaly detection (TSAD) is a critical data mining task often constrained by label scarcity. Consequently, current research predominantly focuses on Unsupervised Time-series Anomaly Detection (UTAD), relying on complex architectures to model normal data distributions. However, this approach often overlooks the significant performance gains available from limited anomaly labels achievable in practical scenarios. This paper challenges the premise that architectural complexity is the optimal path for TSAD. We conduct the first methodical comparison between supervised and unsupervised paradigms and introduce STAND, a streamlined supervised baseline. Extensive experiments on five public datasets demonstrate that: (1) Labels matter more than models: under a limited labeling budget, simple supervised models significantly outperform complex state-of-the-art unsupervised methods; (2) Supervision yields higher returns: the performance gain from minimal supervision far exceeds that from architectural innovations; and (3) Practicality: STAND exhibits superior prediction consistency and anomaly localization compared to unsupervised counterparts. These findings advocate for a data-centric shift in TSAD research, emphasizing label utilization over purely algorithmic complexity. The code is publicly available at https://github.com/EmorZz1G/STAND.
comment: 16 pages, 14 figures, 7 tables. Under review
☆ Multidimensional Rubric-oriented Reward Model Learning via Geometric Projection Reference Constraints
The integration of large language models (LLMs) into medical practice holds transformative potential, yet their real-world clinical utility remains limited by critical alignment challenges: (1) a disconnect between static evaluation benchmarks and dynamic clinical cognitive needs, (2) difficulties in adapting to evolving, multi-source medical standards, and (3) the inability of conventional reward models to capture nuanced, multi-dimensional medical quality criteria. To address these gaps, we propose MR-RML (Multidimensional Rubric-oriented Reward Model Learning) via GPRC (Geometric Projection Reference Constraints), a novel alignment framework that integrates medical standards into a structured "Dimensions-Scenarios-Disciplines" matrix to guide data generation and model optimization. MR-RML introduces three core innovations: (1) a "Dimensions-Scenarios-Disciplines" medical standard system that embeds domain standards into the full training pipeline; (2) an independent multi-dimensional reward model that decomposes evaluation criteria, shifting from real-time rubric-based scoring to internalized reward modeling for improved consistency and cost-efficiency; (3) geometric projection reference constraints that transform medical cognitive logic into mathematical regularization, aligning scoring gradients with clinical reasoning and enabling synthetic data-driven training. Through extensive evaluations on the authoritative medical benchmark Healthbench, our method yields substantial performance gains over the base LLM Qwen-32B (45% on the full subset and 85% on Hard subset, respectively). It achieves a SOTA among open-source LLMs with scores of 62.7 (full subset) and 44.7 (hard subset), while also outperforming the majority of closed-source models.
☆ CoSP: Reconfigurable Multi-State Metamaterial Inverse Design via Contrastive Pretrained Large Language Model
Metamaterials, known for their ability to manipulate light at subwavelength scales, face significant design challenges due to their complex and sophisticated structures. Consequently, deep learning has emerged as a powerful tool to streamline their design process. Reconfigurable multi-state metamaterials (RMMs) with adjustable parameters can switch their optical characteristics between different states upon external stimulation, leading to numerous applications. However, existing deep learning-based inverse design methods fall short in considering reconfigurability with multi-state switching. To address this challenge, we propose CoSP, an intelligent inverse design method based on contrastive pretrained large language model (LLM). By performing contrastive pretraining on multi-state spectrum, a well-trained spectrum encoder capable of understanding the spectrum is obtained, and it subsequently interacts with a pretrained LLM. This approach allows the model to preserve its linguistic capabilities while also comprehending Maxwell's Equations, enabling it to describe material structures with target optical properties in natural language. Our experiments demonstrate that CoSP can design corresponding thin-film metamaterial structures for arbitrary multi-state, multi-band optical responses, showing great potentials in the intelligent design of RMMs for versatile applications.
comment: 5 pages, 6 figures
☆ AskDB: An LLM Agent for Natural Language Interaction with Relational Databases
Interacting with relational databases remains challenging for users across different expertise levels, particularly when composing complex analytical queries or performing administrative tasks. Existing systems typically address either natural language querying or narrow aspects of database administration, lacking a unified and intelligent interface for general-purpose database interaction. We introduce AskDB, a large language model powered agent designed to bridge this gap by supporting both data analysis and administrative operations over SQL databases through natural language. Built on Gemini 2, AskDB integrates two key innovations: a dynamic schema-aware prompting mechanism that effectively incorporates database metadata, and a task decomposition framework that enables the agent to plan and execute multi-step actions. These capabilities allow AskDB to autonomously debug derived SQL, retrieve contextual information via real-time web search, and adaptively refine its responses. We evaluate AskDB on a widely used Text-to-SQL benchmark and a curated set of DBA tasks, demonstrating strong performance in both analytical and administrative scenarios. Our results highlight the potential of AskDB as a unified and intelligent agent for relational database systems, offering an intuitive and accessible experience for end users.
comment: 15 pages, 10 figures
☆ ELPO: Ensemble Learning Based Prompt Optimization for Large Language Models
The remarkable performance of Large Language Models (LLMs) highly relies on crafted prompts. However, manual prompt engineering is a laborious process, creating a core bottleneck for practical application of LLMs. This phenomenon has led to the emergence of a new research area known as Automatic Prompt Optimization (APO), which develops rapidly in recent years. Existing APO methods such as those based on evolutionary algorithms or trial-and-error approaches realize an efficient and accurate prompt optimization to some extent. However, those researches focus on a single model or algorithm for the generation strategy and optimization process, which limits their performance when handling complex tasks. To address this, we propose a novel framework called Ensemble Learning based Prompt Optimization (ELPO) to achieve more accurate and robust results. Motivated by the idea of ensemble learning, ELPO conducts voting mechanism and introduces shared generation strategies along with different search methods for searching superior prompts. Moreover, ELPO creatively presents more efficient algorithms for the prompt generation and search process. Experimental results demonstrate that ELPO outperforms state-of-the-art prompt optimization methods across different tasks, e.g., improving F1 score by 7.6 on ArSarcasm dataset.
☆ SkyRL-Agent: Efficient RL Training for Multi-turn LLM Agent
We introduce SkyRL-Agent, a framework for efficient, multi-turn, long-horizon agent training and evaluation. It provides efficient asynchronous dispatching, lightweight tool integration, and flexible backend interoperability, enabling seamless use with existing RL frameworks such as SkyRL-train, VeRL, and Tinker. Using SkyRL-Agent, we train SA-SWE-32B, a software engineering agent trained from Qwen3-32B (24.4% Pass@1) purely with reinforcement learning. We introduce two key components: an optimized asynchronous pipeline dispatcher that achieves a 1.55x speedup over naive asynchronous batching, and a tool-enhanced training recipe leveraging an AST-based search tool to facilitate code navigation, boost rollout Pass@K, and improve training efficiency. Together, these optimizations enable SA-SWE-32B to reach 39.4% Pass@1 on SWE-Bench Verified with more than 2x cost reduction compared to prior models reaching similar performance. Despite being trained solely on SWE tasks, SA-SWE-32B generalizes effectively to other agentic tasks, including Terminal-Bench, BrowseComp-Plus, and WebArena. We further demonstrate SkyRL-Agent's extensibility through case studies on deep research, computer use, and memory agents, each trained using a different training backend.
☆ T2T-VICL: Unlocking the Boundaries of Cross-Task Visual In-Context Learning via Implicit Text-Driven VLMs
In large language models (LLM), in-context learning (ICL) refers to performing new tasks by conditioning on small demonstrations provided in the input context. Recent advances in visual in-context learning (VICL) demonstrate promising capabilities for solving downstream tasks by unified vision-language models (VLMs). When the visual prompt and the target images originate from different visual tasks, can VLMs still enable VICL? In the paper, we propose a fully collaborative pipeline, i.e. T2T-VICL, for VLMs to investigate the potential of cross-task VICL. Fundamentally, we design a mechanism to generate and select text prompts that best implicitly describe the differences between two distinct low-level vision tasks, and construct the first cross-task VICL dataset. Building upon this, we propose a novel inference framework that combines perceptual score-based reasoning with traditional evaluation metrics to perform cross-task VICL. Our approach achieves top-tier results across nine cross-task scenarios and second-tier performance in ten additional scenarios, unlocking the boundaries of cross-task VICL within VLMs.
☆ Mitigating Estimation Bias with Representation Learning in TD Error-Driven Regularization
Deterministic policy gradient algorithms for continuous control suffer from value estimation biases that degrade performance. While double critics reduce such biases, the exploration potential of double actors remains underexplored. Building on temporal-difference error-driven regularization (TDDR), a double actor-critic framework, this work introduces enhanced methods to achieve flexible bias control and stronger representation learning. We propose three convex combination strategies, symmetric and asymmetric, that balance pessimistic estimates to mitigate overestimation and optimistic exploration via double actors to alleviate underestimation. A single hyperparameter governs this mechanism, enabling tunable control across the bias spectrum. To further improve performance, we integrate augmented state and action representations into the actor and critic networks. Extensive experiments show that our approach consistently outperforms benchmarks, demonstrating the value of tunable bias and revealing that both overestimation and underestimation can be exploited differently depending on the environment.
☆ Future-Back Threat Modeling: A Foresight-Driven Security Framework
Traditional threat modeling remains reactive-focused on known TTPs and past incident data, while threat prediction and forecasting frameworks are often disconnected from operational or architectural artifacts. This creates a fundamental weakness: the most serious cyber threats often do not arise from what is known, but from what is assumed, overlooked, or not yet conceived, and frequently originate from the future, such as artificial intelligence, information warfare, and supply chain attacks, where adversaries continuously develop new exploits that can bypass defenses built on current knowledge. To address this mental gap, this paper introduces the theory and methodology of Future-Back Threat Modeling (FBTM). This predictive approach begins with envisioned future threat states and works backward to identify assumptions, gaps, blind spots, and vulnerabilities in the current defense architecture, providing a clearer and more accurate view of impending threats so that we can anticipate their emergence and shape the future we want through actions taken now. The proposed methodology further aims to reveal known unknowns and unknown unknowns, including tactics, techniques, and procedures that are emerging, anticipated, and plausible. This enhances the predictability of adversary behavior, particularly under future uncertainty, helping security leaders make informed decisions today that shape more resilient security postures for the future.
☆ SpectralTrain: A Universal Framework for Hyperspectral Image Classification
Hyperspectral image (HSI) classification typically involves large-scale data and computationally intensive training, which limits the practical deployment of deep learning models in real-world remote sensing tasks. This study introduces SpectralTrain, a universal, architecture-agnostic training framework that enhances learning efficiency by integrating curriculum learning (CL) with principal component analysis (PCA)-based spectral downsampling. By gradually introducing spectral complexity while preserving essential information, SpectralTrain enables efficient learning of spectral -- spatial patterns at significantly reduced computational costs. The framework is independent of specific architectures, optimizers, or loss functions and is compatible with both classical and state-of-the-art (SOTA) models. Extensive experiments on three benchmark datasets -- Indian Pines, Salinas-A, and the newly introduced CloudPatch-7 -- demonstrate strong generalization across spatial scales, spectral characteristics, and application domains. The results indicate consistent reductions in training time by 2-7x speedups with small-to-moderate accuracy deltas depending on backbone. Its application to cloud classification further reveals potential in climate-related remote sensing, emphasizing training strategy optimization as an effective complement to architectural design in HSI models. Code is available at https://github.com/mh-zhou/SpectralTrain.
☆ Operon: Incremental Construction of Ragged Data via Named Dimensions
Modern data processing workflows frequently encounter ragged data: collections with variable-length elements that arise naturally in domains like natural language processing, scientific measurements, and autonomous AI agents. Existing workflow engines lack native support for tracking the shapes and dependencies inherent to ragged data, forcing users to manage complex indexing and dependency bookkeeping manually. We present Operon, a Rust-based workflow engine that addresses these challenges through a novel formalism of named dimensions with explicit dependency relations. Operon provides a domain-specific language where users declare pipelines with dimension annotations that are statically verified for correctness, while the runtime system dynamically schedules tasks as data shapes are incrementally discovered during execution. We formalize the mathematical foundation for reasoning about partial shapes and prove that Operon's incremental construction algorithm guarantees deterministic and confluent execution in parallel settings. The system's explicit modeling of partially-known states enables robust persistence and recovery mechanisms, while its per-task multi-queue architecture achieves efficient parallelism across heterogeneous task types. Empirical evaluation demonstrates that Operon outperforms an existing workflow engine with 14.94x baseline overhead reduction while maintaining near-linear end-to-end output rates as workloads scale, making it particularly suitable for large-scale data generation pipelines in machine learning applications.
☆ A Hybrid Proactive And Predictive Framework For Edge Cloud Resource Management
Old cloud edge workload resource management is too reactive. The problem with relying on static thresholds is that we are either overspending for more resources than needed or have reduced performance because of their lack. This is why we work on proactive solutions. A framework developed for it stops reacting to the problems but starts expecting them. We design a hybrid architecture, combining two powerful tools: the CNN LSTM model for time series forecasting and an orchestrator based on multi agent Deep Reinforcement Learning In fact the novelty is in how we combine them as we embed the predictive forecast from the CNN LSTM directly into the DRL agent state space. That is what makes the AI manager smarter it sees the future, which allows it to make better decisions about a long term plan for where to run tasks That means finding that sweet spot between how much money is saved while keeping the system healthy and apps fast for users That is we have given it eyes in order to see down the road so that it does not have to lurch from one problem to another it finds a smooth path forward Our tests show our system easily beats the old methods It is great at solving tough problems like making complex decisions and juggling multiple goals at once like being cheap fast and reliable
☆ A Mathematical Framework for Custom Reward Functions in Job Application Evaluation using Reinforcement Learning
Conventional Applicant Tracking Systems (ATS) tend to be inflexible keyword-matchers, and deny gifted candidates a role due to a few minor semantic mismatches. This article describes a new two-step process to design a more refined resume evaluation model based on a small language model (<600M parameters) that is finetuned using GRPO on a custom reward function. To begin with, Supervised Fine-Tuning (SFT) was used to build a solid baseline model. Second, this SFT model was also optimized with the help of Reinforcement Learning (RL) through GRPO under the guidance of a new, multi-component reward function that can holistically assess candidates beyond simple keyword matching. We indicate that the RL application presents a critical problem of reward hacking due to the initial experiments of aggressive penalties, which produces faulty, excessively negative model behaviors. We have overcome this challenge by refining the reward function repeatedly and training hyperparameters into a stable "gentle polishing process" of the reward function. Our resulting GRPO-polished model demonstrates significant real-world efficacy, achieving a final accuracy of 91% on unseen test data. The model shows a strong ability to correctly identify qualified candidates (recall of 0.85 for the 'SELECTED' class) while also showing exceptional precision (1.0), confirming its reliability. These results indicate that a properly executed, two-step fine-tuning procedure can indeed effectively refine a small language model to be able to conduct fine-tuned and human-like candidate scoring, overcoming the drawbacks of both traditional ATS and naive RL usage.
comment: 13 pages, 4 figures, 2 equations, 3 Tables
☆ Early science acceleration experiments with GPT-5
AI models like GPT-5 are an increasingly valuable tool for scientists, but many remain unaware of the capabilities of frontier AI. We present a collection of short case studies in which GPT-5 produced new, concrete steps in ongoing research across mathematics, physics, astronomy, computer science, biology, and materials science. In these examples, the authors highlight how AI accelerated their work, and where it fell short; where expert time was saved, and where human input was still key. We document the interactions of the human authors with GPT-5, as guiding examples of fruitful collaboration with AI. Of note, this paper includes four new results in mathematics (carefully verified by the human authors), underscoring how GPT-5 can help human mathematicians settle previously unsolved problems. These contributions are modest in scope but profound in implication, given the rate at which frontier AI is progressing.
comment: 89 pages
☆ Artificial Intelligence and Accounting Research: A Framework and Agenda
Recent advances in artificial intelligence, particularly generative AI (GenAI) and large language models (LLMs), are fundamentally transforming accounting research, creating both opportunities and competitive threats for scholars. This paper proposes a framework that classifies AI-accounting research along two dimensions: research focus (accounting-centric versus AI-centric) and methodological approach (AI-based versus traditional methods). We apply this framework to papers from the IJAIS special issue and recent AI-accounting research published in leading accounting journals to map existing studies and identify research opportunities. Using this same framework, we analyze how accounting researchers can leverage their expertise through strategic positioning and collaboration, revealing where accounting scholars' strengths create the most value. We further examine how GenAI and LLMs transform the research process itself, comparing the capabilities of human researchers and AI agents across the entire research workflow. This analysis reveals that while GenAI democratizes certain research capabilities, it simultaneously intensifies competition by raising expectations for higher-order contributions where human judgment, creativity, and theoretical depth remain valuable. These shifts call for reforming doctoral education to cultivate comparative advantages while building AI fluency.
comment: 48 pages, 7 tables
☆ Learning Tractable Distributions Of Language Model Continuations
Controlled language generation conditions text on sequence-level constraints (for example, syntax, style, or safety). These constraints may depend on future tokens, which makes directly conditioning an autoregressive language model (LM) generally intractable. Prior work uses tractable surrogates such as hidden Markov models (HMMs) to approximate the distribution over continuations and adjust the model's next-token logits at decoding time. However, we find that these surrogates are often weakly context aware, which reduces query quality. We propose Learning to Look Ahead (LTLA), a hybrid approach that pairs the same base language model for rich prefix encoding with a fixed tractable surrogate model that computes exact continuation probabilities. Two efficiency pitfalls arise when adding neural context: (i) naively rescoring the prefix with every candidate next token requires a sweep over the entire vocabulary at each step, and (ii) predicting fresh surrogate parameters for each prefix, although tractable at a single step, forces recomputation of future probabilities for every new prefix and eliminates reuse. LTLA avoids both by using a single batched HMM update to account for all next-token candidates at once, and by conditioning only the surrogate's latent state prior on the LM's hidden representations while keeping the surrogate decoder fixed, so computations can be reused across prefixes. Empirically, LTLA attains higher conditional likelihood than an unconditional HMM, approximates continuation distributions for vision-language models where a standalone HMM cannot encode visual context, and improves constraint satisfaction at comparable fluency on controlled-generation tasks, with minimal inference overhead.
☆ Semantic Glitch: Agency and Artistry in an Autonomous Pixel Cloud NeurIPS 2025
While mainstream robotics pursues metric precision and flawless performance, this paper explores the creative potential of a deliberately "lo-fi" approach. We present the "Semantic Glitch," a soft flying robotic art installation whose physical form, a 3D pixel style cloud, is a "physical glitch" derived from digital archaeology. We detail a novel autonomous pipeline that rejects conventional sensors like LiDAR and SLAM, relying solely on the qualitative, semantic understanding of a Multimodal Large Language Model to navigate. By authoring a bio-inspired personality for the robot through a natural language prompt, we create a "narrative mind" that complements the "weak," historically, loaded body. Our analysis begins with a 13-minute autonomous flight log, and a follow-up study statistically validates the framework's robustness for authoring quantifiably distinct personas. The combined analysis reveals emergent behaviors, from landmark-based navigation to a compelling "plan to execution" gap, and a character whose unpredictable, plausible behavior stems from a lack of precise proprioception. This demonstrates a lo-fi framework for creating imperfect companions whose success is measured in character over efficiency.
comment: NeurIPS 2025 Creative AI Track, The Thirty-Ninth Annual Conference on Neural Information Processing Systems
☆ An Aligned Constraint Programming Model For Serial Batch Scheduling With Minimum Batch Size
In serial batch (s-batch) scheduling, jobs from similar families are grouped into batches and processed sequentially to avoid repetitive setups that are required when processing consecutive jobs of different families. Despite its large success in scheduling, only three Constraint Programming (CP) models have been proposed for this problem considering minimum batch sizes, which is a common requirement in many practical settings, including the ion implantation area in semiconductor manufacturing. These existing CP models rely on a predefined virtual set of possible batches that suffers from the curse of dimensionality and adds complexity to the problem. This paper proposes a novel CP model that does not rely on this virtual set. Instead, it uses key alignment parameters that allow it to reason directly on the sequences of same-family jobs scheduled on the machines, resulting in a more compact formulation. This new model is further improved by exploiting the problem's structure with tailored search phases and strengthened inference levels of the constraint propagators. The extensive computational experiments on nearly five thousand instances compare the proposed models against existing methods in the literature, including mixed-integer programming formulations, tabu search meta-heuristics, and CP approaches. The results demonstrate the superiority of the proposed models on small-to-medium instances with up to 100 jobs, and their ability to find solutions up to 25\% better than the ones produces by existing methods on large-scale instances with up to 500 jobs, 10 families, and 10 machines.
comment: 14 pages, 12 figures
☆ Liars' Bench: Evaluating Lie Detectors for Language Models
Prior work has introduced techniques for detecting when large language models (LLMs) lie, that is, generating statements they believe are false. However, these techniques are typically validated in narrow settings that do not capture the diverse lies LLMs can generate. We introduce LIARS' BENCH, a testbed consisting of 72,863 examples of lies and honest responses generated by four open-weight models across seven datasets. Our settings capture qualitatively different types of lies and vary along two dimensions: the model's reason for lying and the object of belief targeted by the lie. Evaluating three black- and white-box lie detection techniques on LIARS' BENCH, we find that existing techniques systematically fail to identify certain types of lies, especially in settings where it's not possible to determine whether the model lied from the transcript alone. Overall, LIARS' BENCH reveals limitations in prior techniques and provides a practical testbed for guiding progress in lie detection.
comment: *Kieron Kretschmar and Walter Laurito contributed equally to this work. 10 pages, 2 figures; plus appendix. Code at https://github.com/Cadenza-Labs/liars-bench and datasets at https://huggingface.co/datasets/Cadenza-Labs/liars-bench Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
☆ HGCN2SP: Hierarchical Graph Convolutional Network for Two-Stage Stochastic Programming
Two-stage Stochastic Programming (2SP) is a standard framework for modeling decision-making problems under uncertainty. While numerous methods exist, solving such problems with many scenarios remains challenging. Selecting representative scenarios is a practical method for accelerating solutions. However, current approaches typically rely on clustering or Monte Carlo sampling, failing to integrate scenario information deeply and overlooking the significant impact of the scenario order on solving time. To address these issues, we develop HGCN2SP, a novel model with a hierarchical graph designed for 2SP problems, encoding each scenario and modeling their relationships hierarchically. The model is trained in a reinforcement learning paradigm to utilize the feedback of the solver. The policy network is equipped with a hierarchical graph convolutional network for feature encoding and an attention-based decoder for scenario selection in proper order. Evaluation of two classic 2SP problems demonstrates that HGCN2SP provides high-quality decisions in a short computational time. Furthermore, HGCN2SP exhibits remarkable generalization capabilities in handling large-scale instances, even with a substantial number of variables or scenarios that were unseen during the training phase.
comment: 17 pages, 4 figures
☆ Towards a Safer and Sustainable Manufacturing Process: Material classification in Laser Cutting Using Deep Learning
Laser cutting is a widely adopted technology in material processing across various industries, but it generates a significant amount of dust, smoke, and aerosols during operation, posing a risk to both the environment and workers' health. Speckle sensing has emerged as a promising method to monitor the cutting process and identify material types in real-time. This paper proposes a material classification technique using a speckle pattern of the material's surface based on deep learning to monitor and control the laser cutting process. The proposed method involves training a convolutional neural network (CNN) on a dataset of laser speckle patterns to recognize distinct material types for safe and efficient cutting. Previous methods for material classification using speckle sensing may face issues when the color of the laser used to produce the speckle pattern is changed. Experiments conducted in this study demonstrate that the proposed method achieves high accuracy in material classification, even when the laser color is changed. The model achieved an accuracy of 98.30 % on the training set and 96.88% on the validation set. Furthermore, the model was evaluated on a set of 3000 new images for 30 different materials, achieving an F1-score of 0.9643. The proposed method provides a robust and accurate solution for material-aware laser cutting using speckle sensing.
☆ Physically Realistic Sequence-Level Adversarial Clothing for Robust Human-Detection Evasion
Deep neural networks used for human detection are highly vulnerable to adversarial manipulation, creating safety and privacy risks in real surveillance environments. Wearable attacks offer a realistic threat model, yet existing approaches usually optimize textures frame by frame and therefore fail to maintain concealment across long video sequences with motion, pose changes, and garment deformation. In this work, a sequence-level optimization framework is introduced to generate natural, printable adversarial textures for shirts, trousers, and hats that remain effective throughout entire walking videos in both digital and physical settings. Product images are first mapped to UV space and converted into a compact palette and control-point parameterization, with ICC locking to keep all colors printable. A physically based human-garment pipeline is then employed to simulate motion, multi-angle camera viewpoints, cloth dynamics, and illumination variation. An expectation-over-transformation objective with temporal weighting is used to optimize the control points so that detection confidence is minimized across whole sequences. Extensive experiments demonstrate strong and stable concealment, high robustness to viewpoint changes, and superior cross-model transferability. Physical garments produced with sublimation printing achieve reliable suppression under indoor and outdoor recordings, confirming real-world feasibility.
☆ SpellForger: Prompting Custom Spell Properties In-Game using BERT supervised-trained model
Introduction: The application of Artificial Intelligence in games has evolved significantly, allowing for dynamic content generation. However, its use as a core gameplay co-creation tool remains underexplored. Objective: This paper proposes SpellForger, a game where players create custom spells by writing natural language prompts, aiming to provide a unique experience of personalization and creativity. Methodology: The system uses a supervisedtrained BERT model to interpret player prompts. This model maps textual descriptions to one of many spell prefabs and balances their parameters (damage, cost, effects) to ensure competitive integrity. The game is developed in the Unity Game Engine, and the AI backend is in Python. Expected Results: We expect to deliver a functional prototype that demonstrates the generation of spells in real time, applied to an engaging gameplay loop, where player creativity is central to the experience, validating the use of AI as a direct gameplay mechanic.
comment: Published in Anais Estendidos do XXIV Simpósio Brasileiro de Jogos e Entretenimento Digital (SBGames 2025)
♻ ☆ LLMInit: A Free Lunch from Large Language Models for Selective Initialization of Recommendation EMNLP 2025
Collaborative filtering (CF) is widely adopted in industrial recommender systems (RecSys) for modeling user-item interactions across numerous applications, but often struggles with cold-start and data-sparse scenarios. Recent advancements in pre-trained large language models (LLMs) with rich semantic knowledge, offer promising solutions to these challenges. However, deploying LLMs at scale is hindered by their significant computational demands and latency. In this paper, we propose a novel and scalable LLM-RecSys framework, LLMInit, designed to integrate pretrained LLM embeddings into CF models through selective initialization strategies. Specifically, we identify the embedding collapse issue observed when CF models scale and match the large embedding sizes in LLMs and avoid the problem by introducing efficient sampling methods, including, random, uniform, and variance-based selections. Comprehensive experiments conducted on multiple real-world datasets demonstrate that LLMInit significantly improves recommendation performance while maintaining low computational costs, offering a practical and scalable solution for industrial applications. To facilitate industry adoption and promote future research, we provide open-source access to our implementation at https://github.com/DavidZWZ/LLMInit.
comment: Accepted in EMNLP 2025 Industry Track
♻ ☆ Probing the Critical Point (CritPt) of AI Reasoning: a Frontier Physics Research Benchmark
While large language models (LLMs) with reasoning capabilities are progressing rapidly on high-school math competitions and coding, can they reason effectively through complex, open-ended challenges found in frontier physics research? And crucially, what kinds of reasoning tasks do physicists want LLMs to assist with? To address these questions, we present the CritPt (Complex Research using Integrated Thinking - Physics Test, pronounced "critical point"), the first benchmark designed to test LLMs on unpublished, research-level reasoning tasks that broadly covers modern physics research areas, including condensed matter, quantum physics, atomic, molecular & optical physics, astrophysics, high energy physics, mathematical physics, statistical physics, nuclear physics, nonlinear dynamics, fluid dynamics and biophysics. CritPt consists of 71 composite research challenges designed to simulate full-scale research projects at the entry level, which are also decomposed to 190 simpler checkpoint tasks for more fine-grained insights. All problems are newly created by 50+ active physics researchers based on their own research. Every problem is hand-curated to admit a guess-resistant and machine-verifiable answer and is evaluated by an automated grading pipeline heavily customized for advanced physics-specific output formats. We find that while current state-of-the-art LLMs show early promise on isolated checkpoints, they remain far from being able to reliably solve full research-scale challenges: the best average accuracy among base models is only 5.7%, achieved by GPT-5 (high), moderately rising to around 10% when equipped with coding tools. Through the realistic yet standardized evaluation offered by CritPt, we highlight a large disconnect between current model capabilities and realistic physics research demands, offering a foundation to guide the development of scientifically grounded AI tools.
comment: 39 pages, 6 figures, 6 tables
♻ ☆ Bridging the Gap in XAI-Why Reliable Metrics Matter for Explainability and Compliance
Reliable explainability is not only a technical goal but also a cornerstone of private AI governance. As AI models enter high-stakes sectors, private actors such as auditors, insurers, certification bodies, and procurement agencies require standardized evaluation metrics to assess trustworthiness. However, current XAI evaluation metrics remain fragmented and prone to manipulation, which undermines accountability and compliance. We argue that standardized metrics can function as governance primitives, embedding auditability and accountability within AI systems for effective private oversight. Building upon prior work in XAI benchmarking, we identify key limitations in ensuring faithfulness, tamper resistance, and regulatory alignment. Furthermore, interpretability can directly support model alignment by providing a verifiable means of ensuring behavioral integrity in General Purpose AI (GPAI) systems. This connection between interpretability and alignment positions XAI metrics as both technical and regulatory instruments that help prevent alignment faking, a growing concern among oversight bodies. We propose a Governance by Metrics paradigm that treats explainability evaluation as a central mechanism of private AI governance. Our framework introduces a hierarchical model linking transparency, tamper resistance, scalability, and legal alignment, extending evaluation from model introspection toward systemic accountability. Through conceptual synthesis and alignment with governance standards, we outline a roadmap for integrating explainability metrics into continuous AI assurance pipelines that serve both private oversight and regulatory needs.
comment: Accepted at first EurIPS Workshop on Private AI Governance
♻ ☆ Interpretability as Alignment: Making Internal Understanding a Design Principle
Frontier AI systems require governance mechanisms that can verify internal alignment, not just behavioral compliance. Private governance mechanisms audits, certification, insurance, and procurement are emerging to complement public regulation, but they require technical substrates that generate verifiable causal evidence about model behavior. This paper argues that mechanistic interpretability provides this substrate. We frame interpretability not as post-hoc explanation but as a design constraint embedding auditability, provenance, and bounded transparency within model architectures. Integrating causal abstraction theory and empirical benchmarks such as MIB and LoBOX, we outline how interpretability-first models can underpin private assurance pipelines and role-calibrated transparency frameworks. This reframing situates interpretability as infrastructure for private AI governance bridging the gap between technical reliability and institutional accountability.
comment: Accepted at the first EurIPS Workshop on Private AI Governance
♻ ☆ DiffuSyn Bench: Evaluating Vision-Language Models on Real-World Complexities with Diffusion-Generated Synthetic Benchmarks
This study assesses the ability of Large Vision-Language Models (LVLMs) to differentiate between AI-generated and human-generated images. It introduces a new automated benchmark construction method for this evaluation. The experiment compared common LVLMs with human participants using a mixed dataset of AI and human-created images. Results showed that LVLMs could distinguish between the image types to some extent but exhibited a rightward bias, and perform significantly worse compared to humans. To build on these findings, we developed an automated benchmark construction process using AI. This process involved topic retrieval, narrative script generation, error embedding, and image generation, creating a diverse set of text-image pairs with intentional errors. We validated our method through constructing two caparable benchmarks. This study highlights the strengths and weaknesses of LVLMs in real-world understanding and advances benchmark construction techniques, providing a scalable and automatic approach for AI model evaluation.
♻ ☆ Securing Smart Contract Languages with a Unified Agentic Framework for Vulnerability Repair in Solidity and Move
The rapid growth of the blockchain ecosystem and the increasing value locked in smart contracts necessitate robust security measures. While languages like Solidity and Move aim to improve smart contract security, vulnerabilities persist. This paper presents Smartify, a novel multi-agent framework leveraging Large Language Models (LLMs) to automatically detect and repair vulnerabilities in Solidity and Move smart contracts. Unlike traditional methods that rely solely on vast pre-training datasets, Smartify employs a team of specialized agents working on different specially fine-tuned LLMs to analyze code based on underlying programming concepts and language-specific security principles. We evaluated Smartify on a dataset for Solidity and a curated dataset for Move, demonstrating its effectiveness in fixing a wide range of vulnerabilities. Our results show that Smartify (Gemma2+codegemma) achieves state-of-the-art performance, surpassing existing LLMs and enhancing general-purpose models' capabilities, such as Llama 3.1. Notably, Smartify can incorporate language-specific knowledge, such as the nuances of Move, without requiring massive language-specific pre-training datasets. This work offers a detailed analysis of various LLMs' performance on smart contract repair, highlighting the strengths of our multi-agent approach and providing a blueprint for developing more secure and reliable decentralized applications in the growing blockchain landscape. We also provide a detailed recipe for extending this to other similar use cases.
♻ ☆ STAMP: Spatial-Temporal Adapter with Multi-Head Pooling ML4H
Time series foundation models (TSFMs) pretrained on data from multiple domains have shown strong performance on diverse modeling tasks. Various efforts have been made to develop foundation models specific to electroencephalography (EEG) data, which records brain electrical activity as time series. However, no comparative analysis of EEG-specific foundation models (EEGFMs) versus general TSFMs has been performed on EEG-specific tasks. We introduce a novel Spatial-Temporal Adapter with Multi-Head Pooling (STAMP), which leverages univariate embeddings produced by a general TSFM, implicitly models spatial-temporal characteristics of EEG data, and achieves performance comparable to state-of-the-art EEGFMs. A comprehensive analysis is performed on 8 benchmark datasets of clinical tasks using EEG for classification, along with ablation studies. Our proposed adapter is lightweight in trainable parameters and flexible in the inputs it can accommodate, supporting easy modeling of EEG data using TSFMs.
comment: Accepted as a Proceedings paper at Machine Learning for Health (ML4H) 2025, invited presentation at the Time Series for Health (TS4H) Workshop, NeurIPS 2025. v2: Updated author affiliation and corrected a duplicated word in the text. No other changes
♻ ☆ Automatically Detecting Online Deceptive Patterns
Deceptive patterns in digital interfaces manipulate users into making unintended decisions, exploiting cognitive biases and psychological vulnerabilities. These patterns have become ubiquitous on various digital platforms. While efforts to mitigate deceptive patterns have emerged from legal and technical perspectives, a significant gap remains in creating usable and scalable solutions. We introduce our AutoBot framework to address this gap and help web stakeholders navigate and mitigate online deceptive patterns. AutoBot accurately identifies and localizes deceptive patterns from a screenshot of a website without relying on the underlying HTML code. AutoBot employs a two-stage pipeline that leverages the capabilities of specialized vision models to analyze website screenshots, identify interactive elements, and extract textual features. Next, using a large language model, AutoBot understands the context surrounding these elements to determine the presence of deceptive patterns. We also use AutoBot, to create a synthetic dataset to distill knowledge from 'teacher' LLMs to smaller language models. Through extensive evaluation, we demonstrate AutoBot's effectiveness in detecting deceptive patterns on the web, achieving an F1-score of 0.93 when detecting deceptive patterns, underscoring its potential as an essential tool for mitigating online deceptive patterns. We implement AutoBot, across three downstream applications targeting different web stakeholders: (1) a local browser extension providing users with real-time feedback, (2) a Lighthouse audit to inform developers of potential deceptive patterns on their sites, and (3) as a measurement tool designed for researchers and regulators.
♻ ☆ Efficient Solution and Learning of Robust Factored MDPs
Robust Markov decision processes (r-MDPs) extend MDPs by explicitly modelling epistemic uncertainty about transition dynamics. Learning r-MDPs from interactions with an unknown environment enables the synthesis of robust policies with provable (PAC) guarantees on performance, but this can require a large number of sample interactions. We propose novel methods for solving and learning r-MDPs based on factored state-space representations that leverage the independence between model uncertainty across system components. Although policy synthesis for factored r-MDPs leads to hard, non-convex optimisation problems, we show how to reformulate these into tractable linear programs. Building on these, we also propose methods to learn factored model representations directly. Our experimental results show that exploiting factored structure can yield dimensional gains in sample efficiency, producing more effective robust policies with tighter performance guarantees than state-of-the-art methods.
♻ ☆ Learning to Detect Unknown Jailbreak Attacks in Large Vision-Language Models
Despite extensive alignment efforts, Large Vision-Language Models (LVLMs) remain vulnerable to jailbreak attacks, posing serious safety risks. To address this, existing detection methods either learn attack-specific parameters, which hinders generalization to unseen attacks, or rely on heuristically sound principles, which limit accuracy and efficiency. To overcome these limitations, we propose Learning to Detect (LoD), a general framework that accurately detects unknown jailbreak attacks by shifting the focus from attack-specific learning to task-specific learning. This framework includes a Multi-modal Safety Concept Activation Vector module for safety-oriented representation learning and a Safety Pattern Auto-Encoder module for unsupervised attack classification. Extensive experiments show that our method achieves consistently higher detection AUROC on diverse unknown attacks while improving efficiency. The code is available at https://anonymous.4open.science/r/Learning-to-Detect-51CB.
comment: 16 pages; Previously this version appeared as arXiv:2510.15430 which was submitted as a new work by accident
♻ ☆ KVTuner: Sensitivity-Aware Layer-Wise Mixed-Precision KV Cache Quantization for Efficient and Nearly Lossless LLM Inference ICML25
KV cache quantization can improve Large Language Models (LLMs) inference throughput and latency in long contexts and large batch-size scenarios while preserving LLMs effectiveness. However, current methods have three unsolved issues: overlooking layer-wise sensitivity to KV cache quantization, high overhead of online fine-grained decision-making, and low flexibility to different LLMs and constraints. Therefore, we theoretically analyze the inherent correlation of layer-wise transformer attention patterns to KV cache quantization errors and study why key cache is generally more important than value cache for quantization error reduction. We further propose a simple yet effective framework KVTuner to adaptively search for the optimal hardware-friendly layer-wise KV quantization precision pairs for coarse-grained KV cache with multi-objective optimization and directly utilize the offline searched configurations during online inference. To reduce the computational cost of offline calibration, we utilize the intra-layer KV precision pair pruning and inter-layer clustering to reduce the search space. Experimental results show that we can achieve nearly lossless 3.25-bit mixed precision KV cache quantization for LLMs like Llama-3.1-8B-Instruct and 4.0-bit for sensitive models like Qwen2.5-7B-Instruct on mathematical reasoning tasks. The maximum inference throughput can be improved by 21.25\% compared with KIVI-KV8 quantization over various context lengths. Our code and searched configurations are available at https://github.com/cmd2001/KVTuner.
comment: Accepted by ICML25. Code: https://github.com/cmd2001/KVTuner
♻ ☆ Decoding Deception: Understanding Automatic Speech Recognition Vulnerabilities in Evasion and Poisoning Attacks
Recent studies have demonstrated the vulnerability of Automatic Speech Recognition systems to adversarial examples, which can deceive these systems into misinterpreting input speech commands. While previous research has primarily focused on white-box attacks with constrained optimizations, and transferability based black-box attacks against commercial Automatic Speech Recognition devices, this paper explores cost efficient white-box attack and non transferability black-box adversarial attacks on Automatic Speech Recognition systems, drawing insights from approaches such as Fast Gradient Sign Method and Zeroth-Order Optimization. Further, the novelty of the paper includes how poisoning attack can degrade the performances of state-of-the-art models leading to misinterpretation of audio signals. Through experimentation and analysis, we illustrate how hybrid models can generate subtle yet impactful adversarial examples with very little perturbation having Signal Noise Ratio of 35dB that can be generated within a minute. These vulnerabilities of state-of-the-art open source model have practical security implications, and emphasize the need for adversarial security.
comment: Remove due to conflict in authors
♻ ☆ Fast-DataShapley: Neural Modeling for Training Data Valuation
The value and copyright of training data are crucial in the artificial intelligence industry. Service platforms should protect data providers' legitimate rights and fairly reward them for their contributions. Shapley value, a potent tool for evaluating contributions, outperforms other methods in theory, but its computational overhead escalates exponentially with the number of data providers. Recent works based on Shapley values attempt to mitigate computation complexity by approximation algorithms. However, they need to retrain for each test sample, leading to intolerable costs. We propose Fast-DataShapley, a one-pass training method that leverages the weighted least squares characterization of the Shapley value to train a reusable explainer model with real-time reasoning speed. Given new test samples, no retraining is required to calculate the Shapley values of the training data. Additionally, we propose three methods with theoretical guarantees to reduce training overhead from two aspects: the approximate calculation of the utility function and the group calculation of the training data. We analyze time complexity to show the efficiency of our methods. The experimental evaluations on various image datasets demonstrate superior performance and efficiency compared to baselines. Specifically, the performance is improved to more than 2 times, and the explainer's training speed can be increased by two orders of magnitude.
♻ ☆ CleverDistiller: Simple and Spatially Consistent Cross-modal Distillation BMVC 2025
Vision foundation models (VFMs) such as DINO have led to a paradigm shift in 2D camera-based perception towards extracting generalized features to support many downstream tasks. Recent works introduce self-supervised cross-modal knowledge distillation (KD) as a way to transfer these powerful generalization capabilities into 3D LiDAR-based models. However, they either rely on highly complex distillation losses, pseudo-semantic maps, or limit KD to features useful for semantic segmentation only. In this work, we propose CleverDistiller, a self-supervised, cross-modal 2D-to-3D KD framework introducing a set of simple yet effective design choices: Unlike contrastive approaches relying on complex loss design choices, our method employs a direct feature similarity loss in combination with a multi layer perceptron (MLP) projection head to allow the 3D network to learn complex semantic dependencies throughout the projection. Crucially, our approach does not depend on pseudo-semantic maps, allowing for direct knowledge transfer from a VFM without explicit semantic supervision. Additionally, we introduce the auxiliary self-supervised spatial task of occupancy prediction to enhance the semantic knowledge, obtained from a VFM through KD, with 3D spatial reasoning capabilities. Experiments on standard autonomous driving benchmarks for 2D-to-3D KD demonstrate that CleverDistiller achieves state-of-the-art performance in both semantic segmentation and 3D object detection (3DOD) by up to 10% mIoU, especially when fine tuning on really low data amounts, showing the effectiveness of our simple yet powerful KD strategy
comment: Accepted to BMVC 2025
♻ ☆ Eliciting Reasoning in Language Models with Cognitive Tools
The recent advent of reasoning models like OpenAI's o1 was met with excited speculation by the AI community about the mechanisms underlying these capabilities in closed models, followed by a rush of replication efforts, particularly from the open source community. These speculations were largely settled by the demonstration from DeepSeek-R1 that chains-of-thought and reinforcement learning (RL) can effectively replicate reasoning on top of base LLMs. However, it remains valuable to explore alternative methods for theoretically eliciting reasoning that could help elucidate the underlying mechanisms, as well as providing additional methods that may offer complementary benefits. Here, we build on the long-standing literature in cognitive psychology and cognitive architectures, which postulates that reasoning arises from the orchestrated, sequential execution of a set of modular, predetermined cognitive operations. Crucially, we implement this key idea within a modern agentic tool-calling framework. In particular, we endow an LLM with a small set of "cognitive tools" encapsulating specific reasoning operations, each executed by the LLM itself. Surprisingly, this simple strategy results in considerable gains in performance on standard mathematical reasoning benchmarks compared to base LLMs, for both closed and open-weight models. For instance, providing our "cognitive tools" to GPT-4.1 increases its pass@1 performance on AIME2024 from 32% to 53%, even surpassing the performance of o1-preview. In addition to its practical implications, this demonstration contributes to the debate regarding the role of post-training methods in eliciting reasoning in LLMs versus the role of inherent capabilities acquired during pre-training, and whether post-training merely uncovers these latent abilities.
comment: 25 pages, 2 figures
♻ ☆ When concept-based XAI is imprecise: Do people distinguish between generalisations and misrepresentations?
Concept-based explainable artificial intelligence (C-XAI) can let people see which representations an AI model has learned. This is particularly important when high-level semantic information (e.g., actions and relations) is used to make decisions about abstract categories (e.g., danger). In such tasks, AI models need to generalise beyond situation-specific details, and this ability can be reflected in C-XAI outputs that randomise over irrelevant features. However, it is unclear whether people appreciate such generalisation and can distinguish it from other, less desirable forms of imprecision in C-XAI outputs. Therefore, the present study investigated how the generality and relevance of C-XAI outputs affect people's evaluation of AI. In an experimental railway safety evaluation scenario, participants rated the performance of a simulated AI that classified traffic scenes involving people as dangerous or not. These classification decisions were explained via concepts in the form of similar image snippets. The latter differed in their match with the classified image, either regarding a highly relevant feature (i.e., people's relation to tracks) or a less relevant feature (i.e., people's action). Contrary to the hypotheses, concepts that generalised over less relevant features were rated lower than concepts that matched the classified image precisely. Moreover, their ratings were no better than those for systematic misrepresentations of the less relevant feature. Conversely, participants were highly sensitive to imprecisions in relevant features. These findings cast doubts on the assumption that people can easily infer from C-XAI outputs whether AI models have gained a deeper understanding of complex situations.
♻ ☆ Diagnosing Hallucination Risk in AI Surgical Decision-Support: A Sequential Framework for Sequential Validation
Large language models (LLMs) offer transformative potential for clinical decision support in spine surgery but pose significant risks through hallucinations, which are factually inconsistent or contextually misaligned outputs that may compromise patient safety. This study introduces a clinician-centered framework to quantify hallucination risks by evaluating diagnostic precision, recommendation quality, reasoning robustness, output coherence, and knowledge alignment. We assessed six leading LLMs across 30 expert-validated spinal cases. DeepSeek-R1 demonstrated superior overall performance (total score: 86.03 $\pm$ 2.08), particularly in high-stakes domains such as trauma and infection. A critical finding reveals that reasoning-enhanced model variants did not uniformly outperform standard counterparts: Claude-3.7-Sonnet's extended thinking mode underperformed relative to its standard version (80.79 $\pm$ 1.83 vs. 81.56 $\pm$ 1.92), indicating extended chain-of-thought reasoning alone is insufficient for clinical reliability. Multidimensional stress-testing exposed model-specific vulnerabilities, with recommendation quality degrading by 7.4% under amplified complexity. This decline contrasted with marginal improvements in rationality (+2.0%), readability (+1.7%) and diagnosis (+4.7%), highlighting a concerning divergence between perceived coherence and actionable guidance. Our findings advocate integrating interpretability mechanisms (e.g., reasoning chain visualization) into clinical workflows and establish a safety-aware validation framework for surgical LLM deployment.
♻ ☆ A Distributionally Robust Framework for Nuisance in Causal Effect Estimation ICONIP 2025
Causal inference requires evaluating models on balanced distributions between treatment and control groups, while training data often exhibits imbalance due to historical decision-making policies. Most conventional statistical methods address this distribution shift through inverse probability weighting (IPW), which requires estimating propensity scores as an intermediate step. These methods face two key challenges: inaccurate propensity estimation and instability from extreme weights. We decompose the generalization error to isolate these issues--propensity ambiguity and statistical instability--and address them through an adversarial loss function. Our approach combines distributionally robust optimization for handling propensity uncertainty with weight regularization based on weighted Rademacher complexity. Experiments on synthetic and real-world datasets demonstrate consistent improvements over existing methods.
comment: The Version of Record of this contribution is published in the Neural Information Processing, ICONIP 2025 Proceedings and is available online at https://doi.org/10.1007/978-981-95-4094-5_19
♻ ☆ A survey of using EHR as real-world evidence for discovering and validating new drug indications
Electronic Health Records (EHRs) have been increasingly used as real-world evidence (RWE) to support the discovery and validation of new drug indications. This paper surveys current approaches to EHR-based drug repurposing, covering data sources, processing methodologies, and representation techniques. It discusses study designs and statistical frameworks for evaluating drug efficacy. Key challenges in validation are discussed, with emphasis on the role of large language models (LLMs) and target trial emulation. By synthesizing recent developments and methodological advances, this work provides a foundational resource for researchers aiming to translate real-world data into actionable drug-repurposing evidence.
♻ ☆ CoBA: Counterbias Text Augmentation for Mitigating Various Spurious Correlations via Semantic Triples EMNLP 2025
Deep learning models often learn and exploit spurious correlations in training data, using these non-target features to inform their predictions. Such reliance leads to performance degradation and poor generalization on unseen data. To address these limitations, we introduce a more general form of counterfactual data augmentation, termed counterbias data augmentation, which simultaneously tackles multiple biases (e.g., gender bias, simplicity bias) and enhances out-of-distribution robustness. We present CoBA: CounterBias Augmentation, a unified framework that operates at the semantic triple level: first decomposing text into subject-predicate-object triples, then selectively modifying these triples to disrupt spurious correlations. By reconstructing the text from these adjusted triples, CoBA generates counterbias data that mitigates spurious patterns. Through extensive experiments, we demonstrate that CoBA not only improves downstream task performance, but also effectively reduces biases and strengthens out-of-distribution resilience, offering a versatile and robust solution to the challenges posed by spurious correlations.
comment: Accepted at EMNLP 2025
♻ ☆ BanditSpec: Adaptive Speculative Decoding via Bandit Algorithms ICML
Speculative decoding has emerged as a popular method to accelerate the inference of Large Language Models (LLMs) while retaining their superior text generation performance. Previous methods either adopt a fixed speculative decoding configuration regardless of the prefix tokens, or train draft models in an offline or online manner to align them with the context. This paper proposes a training-free online learning framework to adaptively choose the configuration of the hyperparameters for speculative decoding as text is being generated. We first formulate this hyperparameter selection problem as a Multi-Armed Bandit problem and provide a general speculative decoding framework BanditSpec. Furthermore, two bandit-based hyperparameter selection algorithms, UCBSpec and EXP3Spec, are designed and analyzed in terms of a novel quantity, the stopping time regret. We upper bound this regret under both stochastic and adversarial reward settings. By deriving an information-theoretic impossibility result, it is shown that the regret performance of UCBSpec is optimal up to universal constants. Finally, extensive empirical experiments with LLaMA3 and Qwen2 demonstrate that our algorithms are effective compared to existing methods, and the throughput is close to the oracle best hyperparameter in simulated real-life LLM serving scenarios with diverse input prompts.
comment: 35 pages, 4 figures, accepted to ICML, typos and affiliations are corrected
♻ ☆ TabDistill: Distilling Transformers into Neural Nets for Few-Shot Tabular Classification
Transformer-based models have shown promising performance on tabular data compared to their classical counterparts such as neural networks and Gradient Boosted Decision Trees (GBDTs) in scenarios with limited training data. They utilize their pre-trained knowledge to adapt to new domains, achieving commendable performance with only a few training examples, also called the few-shot regime. However, the performance gain in the few-shot regime comes at the expense of significantly increased complexity and number of parameters. To circumvent this trade-off, we introduce TabDistill, a new strategy to distill the pre-trained knowledge in complex transformer-based models into simpler neural networks for effectively classifying tabular data. Our framework yields the best of both worlds: being parameter-efficient while performing well with limited training data. The distilled neural networks surpass classical baselines such as regular neural networks, XGBoost and logistic regression under equal training data, and in some cases, even the original transformer-based models that they were distilled from.
♻ ☆ Kandinsky 5.0: A Family of Foundation Models for Image and Video Generation
This report introduces Kandinsky 5.0, a family of state-of-the-art foundation models for high-resolution image and 10-second video synthesis. The framework comprises three core line-up of models: Kandinsky 5.0 Image Lite - a line-up of 6B parameter image generation models, Kandinsky 5.0 Video Lite - a fast and lightweight 2B parameter text-to-video and image-to-video models, and Kandinsky 5.0 Video Pro - 19B parameter models that achieves superior video generation quality. We provide a comprehensive review of the data curation lifecycle - including collection, processing, filtering and clustering - for the multi-stage training pipeline that involves extensive pre-training and incorporates quality-enhancement techniques such as self-supervised fine-tuning (SFT) and reinforcement learning (RL)-based post-training. We also present novel architectural, training, and inference optimizations that enable Kandinsky 5.0 to achieve high generation speeds and state-of-the-art performance across various tasks, as demonstrated by human evaluation. As a large-scale, publicly available generative framework, Kandinsky 5.0 leverages the full potential of its pre-training and subsequent stages to be adapted for a wide range of generative applications. We hope that this report, together with the release of our open-source code and training checkpoints, will substantially advance the development and accessibility of high-quality generative models for the research community.
comment: Website: https://kandinskylab.ai/
♻ ☆ Multimodal Evaluation of Russian-language Architectures
Multimodal large language models (MLLMs) are currently at the center of research attention, showing rapid progress in scale and capabilities, yet their intelligence, limitations, and risks remain insufficiently understood. To address these issues, particularly in the context of the Russian language, where no multimodal benchmarks currently exist, we introduce Mera Multi, an open multimodal evaluation framework for Russian-spoken architectures. The benchmark is instruction-based and encompasses default text, image, audio, and video modalities, comprising 18 newly constructed evaluation tasks for both general-purpose models and modality-specific architectures (image-to-text, video-to-text, and audio-to-text). Our contributions include: (i) a universal taxonomy of multimodal abilities; (ii) 18 datasets created entirely from scratch with attention to Russian cultural and linguistic specificity, unified prompts, and metrics; (iii) baseline results for both closed-source and open-source models; (iv) a methodology for preventing benchmark leakage, including watermarking and licenses for private sets. While our current focus is on Russian, the proposed benchmark provides a replicable methodology for constructing multimodal benchmarks in typologically diverse languages, particularly within the Slavic language family.
♻ ☆ Kaggle Chronicles: 15 Years of Competitions, Community and Data Science Innovation
Since 2010, Kaggle has been a platform where data scientists from around the world come together to compete, collaborate, and push the boundaries of Data Science. Over these 15 years, it has grown from a purely competition-focused site into a broader ecosystem with forums, notebooks, models, datasets, and more. With the release of the Kaggle Meta Code and Kaggle Meta Datasets, we now have a unique opportunity to explore these competitions, technologies, and real-world applications of Machine Learning and AI. And so in this study, we take a closer look at 15 years of data science on Kaggle - through metadata, shared code, community discussions, and the competitions themselves. We explore Kaggle's growth, its impact on the data science community, uncover hidden technological trends, analyze competition winners, how Kagglers approach problems in general, and more. We do this by analyzing millions of kernels and discussion threads to perform both longitudinal trend analysis and standard exploratory data analysis. Our findings show that Kaggle is a steadily growing platform with increasingly diverse use cases, and that Kagglers are quick to adapt to new trends and apply them to real-world challenges, while producing - on average - models with solid generalization capabilities. We also offer a snapshot of the platform as a whole, highlighting its history and technological evolution. Finally, this study is accompanied by a video (https://www.youtube.com/watch?v=YVOV9bIUNrM) and a Kaggle write-up (https://kaggle.com/competitions/meta-kaggle-hackathon/writeups/kaggle-chronicles-15-years-of-competitions-communi) for your convenience.
♻ ☆ VisPlay: Self-Evolving Vision-Language Models from Images
Reinforcement learning (RL) provides a principled framework for improving Vision-Language Models (VLMs) on complex reasoning tasks. However, existing RL approaches often rely on human-annotated labels or task-specific heuristics to define verifiable rewards, both of which are costly and difficult to scale. We introduce VisPlay, a self-evolving RL framework that enables VLMs to autonomously improve their reasoning abilities using large amounts of unlabeled image data. Starting from a single base VLM, VisPlay assigns the model into two interacting roles: an Image-Conditioned Questioner that formulates challenging yet answerable visual questions, and a Multimodal Reasoner that generates silver responses. These roles are jointly trained with Group Relative Policy Optimization (GRPO), which incorporates diversity and difficulty rewards to balance the complexity of generated questions with the quality of the silver answers. VisPlay scales efficiently across two model families. When trained on Qwen2.5-VL and MiMo-VL, VisPlay achieves consistent improvements in visual reasoning, compositional generalization, and hallucination reduction across eight benchmarks, including MM-Vet and MMMU, demonstrating a scalable path toward self-evolving multimodal intelligence. The project page is available at https://bruno686.github.io/VisPlay/
♻ ☆ Policy Search, Retrieval, and Composition via Task Similarity in Collaborative Agentic Systems
Agentic AI aims to create systems that set their own goals, adapt proactively to change, and refine behavior through continuous experience. Recent advances suggest that, when facing multiple and unforeseen tasks, agents could benefit from sharing machine-learned knowledge and reusing policies that have already been fully or partially learned by other agents. However, how to query, select, and retrieve policies from a pool of agents, and how to integrate such policies remains a largely unexplored area. This study explores how an agent decides what knowledge to select, from whom, and when and how to integrate it in its own policy in order to accelerate its own learning. The proposed algorithm, \emph{Modular Sharing and Composition in Collective Learning} (MOSAIC), improves learning in agentic collectives by combining (1) knowledge selection using performance signals and cosine similarity on Wasserstein task embeddings, (2) modular and transferable neural representations via masks, and (3) policy integration, composition and fine-tuning. MOSAIC outperforms isolated learners and global sharing approaches in both learning speed and overall performance, and in some cases solves tasks that isolated agents cannot. The results also demonstrate that selective, goal-driven reuse leads to less susceptibility to task interference. We also observe the emergence of self-organization, where agents solving simpler tasks accelerate the learning of harder ones through shared knowledge.
comment: 24 pages, 20 figures, 8 tables
♻ ☆ Provably Robust Pre-Trained Ensembles for Biomarker-Based Cancer Classification IJCAI 2024
Certain cancer types, notably pancreatic cancer, are difficult to detect at an early stage, motivating robust biomarker-based screening. Liquid biopsies enable non-invasive monitoring of circulating biomarkers, but typical machine learning pipelines for high-dimensional tabular data (e.g., random forests, SVMs) rely on expensive hyperparameter tuning and can be brittle under class imbalance. We leverage a meta-trained Hyperfast model for classifying cancer, accomplishing the highest AUC of 0.9929 and simultaneously achieving robustness especially on highly imbalanced datasets compared to other ML algorithms in several binary classification tasks (e.g. breast invasive carcinoma; BRCA vs. non-BRCA). We also propose a novel ensemble model combining pre-trained Hyperfast model, XGBoost, and LightGBM for multi-class classification tasks, achieving an incremental increase in accuracy (0.9464) while merely using 500 PCA features; distinguishable from previous studies where they used more than 2,000 features for similar results. Crucially, we demonstrate robustness under class imbalance: empirically via balanced accuracy and minority-class recall across cancer-vs.-noncancer and cancer-vs.-rest settings, and theoretically by showing (i) a prototype-form final layer for Hyperfast that yields prior-insensitive decisions under bounded bias, and (ii) minority-error reductions for majority vote under mild error diversity. Together, these results indicate that pre-trained tabular models and simple ensembling can deliver state-of-the-art accuracy and improved minority-class performance with far fewer features and no additional tuning.
comment: Accepted to the AIAA Workshop at IJCAI 2024
♻ ☆ Introducing DEFORMISE: A deep learning framework for dementia diagnosis in the elderly using optimized MRI slice selection
Dementia, a debilitating neurological condition affecting millions worldwide, presents significant diagnostic challenges. In this work, we introduce DEFORMISE, a novel DEep learning Framework for dementia diagnOsis of eldeRly patients using 3D brain Magnetic resonance Imaging (MRI) scans with Optimized Slice sElection. Our approach features a unique technique for selectively processing MRI slices, focusing on the most relevant brain regions and excluding less informative sections. This methodology is complemented by a confidence-based classification committee composed of three novel deep learning models. Tested on the Open OASIS datasets, our method achieved an impressive accuracy of 94.12%, surpassing existing methodologies. Furthermore, validation on the ADNI dataset confirmed the robustness and generalizability of our approach. The use of explainable AI (XAI) techniques and comprehensive ablation studies further substantiate the effectiveness of our techniques, providing insights into the decision-making process and the importance of our methodology. This research offers a significant advancement in dementia diagnosis, providing a highly accurate and efficient tool for clinical applications.
♻ ☆ LoRA on the Go: Instance-level Dynamic LoRA Selection and Merging
Low-Rank Adaptation (LoRA) has emerged as a parameter-efficient approach for fine-tuning large language models. However, conventional LoRA adapters are typically trained for a single task, limiting their applicability in real-world settings where inputs may span diverse and unpredictable domains. At inference time, existing approaches combine multiple LoRAs for improving performance on diverse tasks, while usually requiring labeled data or additional task-specific training, which is expensive at scale. In this work, we introduce LoRA on the Go (LoGo), a training-free framework that dynamically selects and merges adapters at the instance level without any additional requirements. LoGo leverages signals extracted from a single forward pass through LoRA adapters, to identify the most relevant adapters and determine their contributions on-the-fly. Across 5 NLP benchmarks, 27 datasets, and 3 model families, LoGo outperforms training-based baselines on some tasks upto a margin of 3.6% while remaining competitive on other tasks and maintaining inference throughput, highlighting its effectiveness and practicality.
♻ ☆ Taming Uncertainty via Automation: Observing, Analyzing, and Optimizing Agentic AI Systems
Large Language Models (LLMs) are increasingly deployed within agentic systems - collections of interacting, LLM-powered agents that execute complex, adaptive workflows using memory, tools, and dynamic planning. While enabling powerful new capabilities, these systems also introduce unique forms of uncertainty stemming from probabilistic reasoning, evolving memory states, and fluid execution paths. Traditional software observability and operations practices fall short in addressing these challenges. This paper presents our vision of AgentOps: a comprehensive framework for observing, analyzing, optimizing, and automating operation of agentic AI systems. We identify distinct needs across four key roles - developers, testers, site reliability engineers (SREs), and business users - each of whom engages with the system at different points in its lifecycle. We present the AgentOps Automation Pipeline, a six-stage process encompassing behavior observation, metric collection, issue detection, root cause analysis, optimized recommendations, and runtime automation. Throughout, we emphasize the critical role of automation in managing uncertainty and enabling self-improving AI systems - not by eliminating uncertainty, but by taming it to ensure safe, adaptive, and effective operation.
♻ ☆ From Confidence to Collapse in LLM Factual Robustness
Ensuring the robustness of factual knowledge in LLMs is critical for reliable applications in tasks such as question answering and reasoning. However, existing evaluation methods predominantly focus on performance-based metrics, often investigating from the perspective of prompt perturbations, which captures only the externally triggered side of knowledge robustness. To bridge this gap, we introduce a principled approach to measure factual robustness from the perspective of the generation process by analyzing token distribution entropy in combination with temperature scaling sensitivity. These two factors build the Factual Robustness Score (FRS), a novel metric which quantifies the stability of a fact against perturbations in decoding conditions, given its initial uncertainty. To validate our approach, we conduct extensive experiments on 5 LLMs across 3 closed-book QA datasets (SQuAD, TriviaQA, and HotpotQA). We show that factual robustness varies significantly -- smaller models report an FRS of $0.76$, larger ones $0.93$ -- with accuracy degrading by ~$60\%$ under increased uncertainty. These insights demonstrate how entropy and temperature scaling impact factual accuracy, and lay a foundation for developing more robust knowledge retention and retrieval in future models.
♻ ☆ Oracular Programming: A Modular Foundation for Building LLM-Enabled Software
Large Language Models can solve a wide range of tasks from just a few examples, but they remain difficult to steer and lack a capability essential for building reliable software at scale: the modular composition of computations under enforceable contracts. As a result, they are typically embedded in larger software pipelines that use domain-specific knowledge to decompose tasks and improve reliability through validation and search. Yet the complexity of writing, tuning, and maintaining such pipelines has so far limited their sophistication. We propose oracular programming: a foundational paradigm for integrating traditional, explicit computations with inductive oracles such as LLMs. It rests on two directing principles: the full separation of core and search logic, and the treatment of few-shot examples as grounded and evolvable program components. Within this paradigm, experts express high-level problem-solving strategies as programs with unresolved choice points. These choice points are resolved at runtime by LLMs, which generalize from user-provided examples of correct and incorrect decisions. An oracular program is composed of three orthogonal components: a strategy that consists in a nondeterministic program with choice points that can be reified into a search tree, a policy that specifies how to navigate this tree with the help of LLM oracles, and a set of demonstrations that describe successful and unsuccessful tree navigation scenarios across diverse problem instances. Each component is expressed in a dedicated programming language and can be independently improved or substituted. We address the key programming language design challenges of modularly composing oracular programs and enforcing consistency between their components as they evolve.
♻ ☆ LSAP: Rethinking Inversion Fidelity, Perception and Editability in GAN Latent Space
As research on image inversion advances, the process is generally divided into two stages. The first step is Image Embedding, involves using an encoder or optimization procedure to embed an image and obtain its corresponding latent code. The second stage, referred to as Result Refinement, further improves the inversion and editing outcomes. Although this refinement stage substantially enhances reconstruction fidelity, perception and editability remain largely unchanged and are highly dependent on the latent codes derived from the first stage. Therefore, a key challenge lies in obtaining latent codes that preserve reconstruction fidelity while simultaneously improving perception and editability. In this work, we first reveal that these two properties are closely related to the degree of alignment (or disalignment) between the inverted latent codes and the synthetic distribution. Based on this insight, we propose the \textbf{ Latent Space Alignment Inversion Paradigm (LSAP)}, which integrates both an evaluation metric and a unified inversion solution. Specifically, we introduce the \textbf{Normalized Style Space ($\mathcal{S^N}$ space)} and \textbf{Normalized Style Space Cosine Distance (NSCD)} to quantify the disalignment of inversion methods. Moreover, our paradigm can be optimized for both encoder-based and optimization-based embeddings, providing a consistent alignment framework. Extensive experiments across various domains demonstrate that NSCD effectively captures perceptual and editable characteristics, and that our alignment paradigm achieves state-of-the-art performance in both stages of inversion.
comment: under review
♻ ☆ Injecting Falsehoods: Adversarial Man-in-the-Middle Attacks Undermining Factual Recall in LLMs
LLMs are now an integral part of information retrieval. As such, their role as question answering chatbots raises significant concerns due to their shown vulnerability to adversarial man-in-the-middle (MitM) attacks. Here, we propose the first principled attack evaluation on LLM factual memory under prompt injection via Xmera, our novel, theory-grounded MitM framework. By perturbing the input given to "victim" LLMs in three closed-book and fact-based QA settings, we undermine the correctness of the responses and assess the uncertainty of their generation process. Surprisingly, trivial instruction-based attacks report the highest success rate (up to ~85.3%) while simultaneously having a high uncertainty for incorrectly answered questions. To provide a simple defense mechanism against Xmera, we train Random Forest classifiers on the response uncertainty levels to distinguish between attacked and unattacked queries (average AUC of up to ~96%). We believe that signaling users to be cautious about the answers they receive from black-box and potentially corrupt LLMs is a first checkpoint toward user cyberspace safety.
♻ ☆ As If We've Met Before: LLMs Exhibit Certainty in Recognizing Seen Files
The remarkable language ability of Large Language Models (LLMs) stems from extensive training on vast datasets, often including copyrighted material, which raises serious concerns about unauthorized use. While Membership Inference Attacks (MIAs) offer potential solutions for detecting such violations, existing approaches face critical limitations and challenges due to LLMs' inherent overconfidence, limited access to ground truth training data, and reliance on empirically determined thresholds. We present COPYCHECK, a novel framework that leverages uncertainty signals to detect whether copyrighted content was used in LLM training sets. Our method turns LLM overconfidence from a limitation into an asset by capturing uncertainty patterns that reliably distinguish between ``seen" (training data) and ``unseen" (non-training data) content. COPYCHECK further implements a two-fold strategy: (1) strategic segmentation of files into smaller snippets to reduce dependence on large-scale training data, and (2) uncertainty-guided unsupervised clustering to eliminate the need for empirically tuned thresholds. Experiment results show that COPYCHECK achieves an average balanced accuracy of 90.1% on LLaMA 7b and 91.6% on LLaMA2 7b in detecting seen files. Compared to the SOTA baseline, COPYCHECK achieves over 90% relative improvement, reaching up to 93.8\% balanced accuracy. It further exhibits strong generalizability across architectures, maintaining high performance on GPT-J 6B. This work presents the first application of uncertainty for copyright detection in LLMs, offering practical tools for training data transparency.
♻ ☆ LLMDistill4Ads: Using Cross-Encoders to Distill from LLM Signals for Advertiser Keyphrase Recommendations
E-commerce sellers are advised to bid on keyphrases to boost their advertising campaigns. These keyphrases must be relevant to prevent irrelevant items from cluttering search systems and to maintain positive seller perception. It is vital that keyphrase suggestions align with seller, search and buyer judgments. Given the challenges in collecting negative feedback in these systems, LLMs have been used as a scalable proxy to human judgments. This paper presents an empirical study on a major ecommerce platform of a distillation framework involving an LLM teacher, a cross-encoder assistant and a bi-encoder Embedding Based Retrieval (EBR) student model, aimed at mitigating click-induced biases in keyphrase recommendations.
♻ ☆ TRADES: Generating Realistic Market Simulations with Diffusion Models
Financial markets are complex systems characterized by high statistical noise, nonlinearity, volatility, and constant evolution. Thus, modeling them is extremely hard. Here, we address the task of generating realistic and responsive Limit Order Book (LOB) market simulations, which are fundamental for calibrating and testing trading strategies, performing market impact experiments, and generating synthetic market data. We propose a novel TRAnsformer-based Denoising Diffusion Probabilistic Engine for LOB Simulations (TRADES). TRADES generates realistic order flows as time series conditioned on the state of the market, leveraging a transformer-based architecture that captures the temporal and spatial characteristics of high-frequency market data. There is a notable absence of quantitative metrics for evaluating generative market simulation models in the literature. To tackle this problem, we adapt the predictive score, a metric measured as an MAE, to market data by training a stock price predictive model on synthetic data and testing it on real data. We compare TRADES with previous works on two stocks, reporting a 3.27 and 3.48 improvement over SoTA according to the predictive score, demonstrating that we generate useful synthetic market data for financial downstream tasks. Furthermore, we assess TRADES's market simulation realism and responsiveness, showing that it effectively learns the conditional data distribution and successfully reacts to an experimental agent, giving sprout to possible calibrations and evaluations of trading strategies and market impact experiments. To perform the experiments, we developed DeepMarket, the first open-source Python framework for LOB market simulation with deep learning. In our repository, we include a synthetic LOB dataset composed of TRADES's generated simulations.
comment: 8 pages
♻ ☆ From Static to Adaptive Defense: Federated Multi-Agent Deep Reinforcement Learning-Driven Moving Target Defense Against DoS Attacks in UAV Swarm Networks
The proliferation of UAVs has enabled a wide range of mission-critical applications and is becoming a cornerstone of low-altitude networks, supporting smart cities, emergency response, and more. However, the open wireless environment, dynamic topology, and resource constraints of UAVs expose low-altitude networks to severe DoS threats. Traditional defense approaches, which rely on fixed configurations or centralized decision-making, cannot effectively respond to the rapidly changing conditions in UAV swarm environments. To address these challenges, we propose a novel federated multi-agent deep reinforcement learning (FMADRL)-driven moving target defense (MTD) framework for proactive DoS mitigation in low-altitude networks. Specifically, we design lightweight and coordinated MTD mechanisms, including leader switching, route mutation, and frequency hopping, to disrupt attacker efforts and enhance network resilience. The defense problem is formulated as a multi-agent partially observable Markov decision process, capturing the uncertain nature of UAV swarms under attack. Each UAV is equipped with a policy agent that autonomously selects MTD actions based on partial observations and local experiences. By employing a policy gradient-based algorithm, UAVs collaboratively optimize their policies via reward-weighted aggregation. Extensive simulations demonstrate that our approach significantly outperforms state-of-the-art baselines, achieving up to a 34.6% improvement in attack mitigation rate, a reduction in average recovery time of up to 94.6%, and decreases in energy consumption and defense cost by as much as 29.3% and 98.3%, respectively, under various DoS attack strategies. These results highlight the potential of intelligent, distributed defense mechanisms to protect low-altitude networks, paving the way for reliable and scalable low-altitude economy.
comment: 15pages; Accepted by IEEE TCCN
♻ ☆ Do Not Merge My Model! Safeguarding Open-Source LLMs Against Unauthorized Model Merging AAAI 2026
Model merging has emerged as an efficient technique for expanding large language models (LLMs) by integrating specialized expert models. However, it also introduces a new threat: model merging stealing, where free-riders exploit models through unauthorized model merging. Unfortunately, existing defense mechanisms fail to provide effective protection. Specifically, we identify three critical protection properties that existing methods fail to simultaneously satisfy: (1) proactively preventing unauthorized merging; (2) ensuring compatibility with general open-source settings; (3) achieving high security with negligible performance loss. To address the above issues, we propose MergeBarrier, a plug-and-play defense that proactively prevents unauthorized merging. The core design of MergeBarrier is to disrupt the Linear Mode Connectivity (LMC) between the protected model and its homologous counterparts, thereby eliminating the low-loss path required for effective model merging. Extensive experiments show that MergeBarrier effectively prevents model merging stealing with negligible accuracy loss.
comment: Accepted by AAAI 2026 Conference
♻ ☆ Benchmarking Multi-Step Legal Reasoning and Analyzing Chain-of-Thought Effects in Large Language Models
Large language models (LLMs) have demonstrated strong reasoning abilities across specialized domains, motivating research into their application to legal reasoning. However, existing legal benchmarks often conflate factual recall with genuine inference, fragment the reasoning process, and overlook the quality of reasoning. To address these limitations, we introduce MSLR, the first Chinese multi-step legal reasoning dataset grounded in real-world judicial decision making. MSLR adopts the IRAC framework (Issue, Rule, Application, Conclusion) to model structured expert reasoning from official legal documents. In addition, we design a scalable Human-LLM collaborative annotation pipeline that efficiently produces fine-grained step-level reasoning annotations and provides a reusable methodological framework for multi-step reasoning datasets. Evaluation of multiple LLMs on MSLR shows only moderate performance, highlighting the challenges of adapting to complex legal reasoning. Further experiments demonstrate that Self-Initiated Chain-of-Thought prompts generated by models autonomously improve reasoning coherence and quality, outperforming human-designed prompts. MSLR contributes to advancing LLM reasoning and Chain-of-Thought strategies and offers open resources for future research. The dataset and code are available at https://github.com/yuwenhan07/MSLR-Bench and https://law.sjtu.edu.cn/flszyjzx/index.html.
comment: 21 pages, 7 figures
♻ ☆ Atomic Calibration of LLMs in Long-Form Generations ACL 2025
Large language models (LLMs) often suffer from hallucinations, posing significant challenges for real-world applications. Confidence calibration, as an effective indicator of hallucination, is thus essential to enhance the trustworthiness of LLMs. Prior work mainly focuses on short-form tasks using a single response-level score (macro calibration), which is insufficient for long-form outputs that may contain both accurate and inaccurate claims. In this work, we systematically study atomic calibration, which evaluates factuality calibration at a fine-grained level by decomposing long responses into atomic claims. We further categorize existing confidence elicitation methods into discriminative and generative types, and propose two new confidence fusion strategies to improve calibration. Our experiments demonstrate that LLMs exhibit poorer calibration at the atomic level during long-form generation. More importantly, atomic calibration uncovers insightful patterns regarding the alignment of confidence methods and the changes of confidence throughout generation. This sheds light on future research directions for confidence estimation in long-form generation.
comment: ACL 2025 KnowFM Oral / AACL-IJCNLP 2025
♻ ☆ How many patients could we save with LLM priors?
Imagine a world where clinical trials need far fewer patients to achieve the same statistical power, thanks to the knowledge encoded in large language models (LLMs). We present a novel framework for hierarchical Bayesian modeling of adverse events in multi-center clinical trials, leveraging LLM-informed prior distributions. Unlike data augmentation approaches that generate synthetic data points, our methodology directly obtains parametric priors from the model. Our approach systematically elicits informative priors for hyperparameters in hierarchical Bayesian models using a pre-trained LLM, enabling the incorporation of external clinical expertise directly into Bayesian safety modeling. Through comprehensive temperature sensitivity analysis and rigorous cross-validation on real-world clinical trial data, we demonstrate that LLM-derived priors consistently improve predictive performance compared to traditional meta-analytical approaches. This methodology paves the way for more efficient and expert-informed clinical trial design, enabling substantial reductions in the number of patients required to achieve robust safety assessment and with the potential to transform drug safety monitoring and regulatory decision making.
comment: 9 pages, 4 figures
♻ ☆ An Iterative Question-Guided Framework for Knowledge Base Question Answering ACL 2025
Large Language Models (LLMs) excel in many natural language processing tasks but often exhibit factual inconsistencies in knowledge-intensive settings. Integrating external knowledge resources, particularly knowledge graphs (KGs), provides a transparent and updatable foundation for more reliable reasoning. Knowledge Base Question Answering (KBQA), which queries and reasons over KGs, is central to this effort, especially for complex, multi-hop queries. However, multi-hop reasoning poses two key challenges: (1)~maintaining coherent reasoning paths, and (2)~avoiding prematurely discarding critical multi-hop connections. To tackle these challenges, we introduce iQUEST, a question-guided KBQA framework that iteratively decomposes complex queries into simpler sub-questions, ensuring a structured and focused reasoning trajectory. Additionally, we integrate a Graph Neural Network (GNN) to look ahead and incorporate 2-hop neighbor information at each reasoning step. This dual approach strengthens the reasoning process, enabling the model to explore viable paths more effectively. Detailed experiments demonstrate the consistent improvement delivered by iQUEST across four benchmark datasets and four LLMs.
comment: Accepted to the 63rd Annual Meeting of the Association for Computational Linguistics (ACL 2025), Main Track
♻ ☆ MAQuA: Adaptive Question-Asking for Multidimensional Mental Health Screening using Item Response Theory
Recent advances in large language models (LLMs) offer new opportunities for scalable, interactive mental health assessment, but excessive querying by LLMs burdens users and is inefficient for real-world screening across transdiagnostic symptom profiles. We introduce MAQuA, an adaptive question-asking framework for simultaneous, multidimensional mental health screening. Combining multi-outcome modeling on language responses with item response theory (IRT) and factor analysis, MAQuA selects the questions with most informative responses across multiple dimensions at each turn to optimize diagnostic information, improving accuracy and potentially reducing response burden. Empirical results on a novel dataset reveal that MAQuA reduces the number of assessment questions required for score stabilization by 50-87% compared to random ordering (e.g., achieving stable depression scores with 71% fewer questions and eating disorder scores with 85% fewer questions). MAQuA demonstrates robust performance across both internalizing (depression, anxiety) and externalizing (substance use, eating disorder) domains, with early stopping strategies further reducing patient time and burden. These findings position MAQuA as a powerful and efficient tool for scalable, nuanced, and interactive mental health screening, advancing the integration of LLM-based agents into real-world clinical workflows.
♻ ☆ Finetuning LLMs for Automatic Form Interaction on Web-Browser in Selenium Testing Framework
Automated web application testing is a critical component of modern software development, with frameworks like Selenium widely adopted for validating functionality through browser automation. Among the essential aspects of such testing is the ability to interact with and validate web forms, a task that requires syntactically correct, executable scripts with high coverage of input fields. Despite its importance, this task remains underexplored in the context of large language models (LLMs), and no public benchmark or dataset exists to evaluate LLMs on form interaction generation systematically. This paper introduces a novel method for training LLMs to generate high-quality test cases in Selenium, specifically targeting form interaction testing. We curate both synthetic and human-annotated datasets for training and evaluation, covering diverse real-world forms and testing scenarios. We define clear metrics for syntax correctness, script executability, and input field coverage. Our empirical study demonstrates that our approach significantly outperforms strong baselines, including GPT-4o and other popular LLMs, across all evaluation metrics. Our work lays the groundwork for future research on LLM-based web testing and provides resources to support ongoing progress in this area.
comment: Published in the Proceedings of KSE 2025
♻ ☆ CaberNet: Causal Representation Learning for Cross-Domain HVAC Energy Prediction
Cross-domain HVAC energy prediction is essential for scalable building energy management, particularly because collecting extensive labeled data for every new building is both costly and impractical. Yet, this task remains highly challenging due to the scarcity and heterogeneity of data across different buildings, climate zones, and seasonal patterns. In particular, buildings situated in distinct climatic regions introduce variability that often leads existing methods to overfit to spurious correlations, rely heavily on expert intervention, or compromise on data diversity. To address these limitations, we propose CaberNet, a causal and interpretable deep sequence model that learns invariant (Markov blanket) representations for robust cross-domain prediction. In a purely data-driven fashion and without requiring any prior knowledge, CaberNet integrates i) a global feature gate trained with a self-supervised Bernoulli regularization to distinguish superior causal features from inferior ones, and ii) a domain-wise training scheme that balances domain contributions, minimizes cross-domain loss variance, and promotes latent factor independence. We evaluate CaberNet on real-world datasets collected from three buildings located in three climatically diverse cities, and it consistently outperforms all baselines, achieving a 22.9% reduction in normalized mean squared error (NMSE) compared to the best benchmark. Our code is available at https://github.com/SusCom-Lab/CaberNet-CRL.
comment: Accepted at ACM e-Energy 2026
♻ ☆ Multi-dimensional Data Analysis and Applications Basing on LLM Agents and Knowledge Graph Interactions
In the current era of big data, extracting deep insights from massive, heterogeneous, and complexly associated multi-dimensional data has become a significant challenge. Large Language Models (LLMs) perform well in natural language understanding and generation, but still suffer from "hallucination" issues when processing structured knowledge and are difficult to update in real-time. Although Knowledge Graphs (KGs) can explicitly store structured knowledge, their static nature limits dynamic interaction and analytical capabilities. Therefore, this paper proposes a multi-dimensional data analysis method based on the interactions between LLM agents and KGs, constructing a dynamic, collaborative analytical ecosystem. This method utilizes LLM agents to automatically extract product data from unstructured data, constructs and visualizes the KG in real-time, and supports users in deep exploration and analysis of graph nodes through an interactive platform. Experimental results show that this method has significant advantages in product ecosystem analysis, relationship mining, and user-driven exploratory analysis, providing new ideas and tools for multi-dimensional data analysis.
comment: 14 pages, 7 figures, 40 references
♻ ☆ HiViS: Hiding Visual Tokens from the Drafter for Speculative Decoding in Vision-Language Models
Speculative decoding has proven effective for accelerating inference in Large Language Models (LLMs), yet its extension to Vision-Language Models (VLMs) remains limited by the computational burden and semantic inconsistency introduced by visual tokens. Recent studies reveal that visual tokens in large VLMs are highly redundant, and most of them can be removed without compromising generation quality. Motivated by this observation, we propose HiViS (Hiding Visual Tokens from the Drafter for Speculative Decoding in Vision-Language Models), a framework that utilizes the target VLM as a semantic fusion model, allowing the drafter to obtain visual information without explicitly processing visual tokens, ensuring that the drafter's prefill sequence length matches that of the textual tokens. Furthermore, HiViS employs a time-step-aware aligned training scheme that allows the drafter to autonomously propagate and refine instructive visual-textual semantics during independent drafting, guided by step-dependent bias-correction residuals. Extensive experiments across representative VLMs and benchmarks demonstrate that HiViS achieves significant improvements in average acceptance length and speedup ratio.
♻ ☆ PepThink-R1: LLM for Interpretable Cyclic Peptide Optimization with CoT SFT and Reinforcement Learning
Designing therapeutic peptides with tailored properties is hindered by the vastness of sequence space, limited experimental data, and poor interpretability of current generative models. To address these challenges, we introduce PepThink-R1, a generative framework that integrates large language models (LLMs) with chain-of-thought (CoT) supervised fine-tuning and reinforcement learning (RL). Unlike prior approaches, PepThink-R1 explicitly reasons about monomer-level modifications during sequence generation, enabling interpretable design choices while optimizing for multiple pharmacological properties. Guided by a tailored reward function balancing chemical validity and property improvements, the model autonomously explores diverse sequence variants. We demonstrate that PepThink-R1 generates cyclic peptides with significantly enhanced lipophilicity, stability, and exposure, outperforming existing general LLMs (e.g., GPT-5) and domain-specific baseline in both optimization success and interpretability. To our knowledge, this is the first LLM-based peptide design framework that combines explicit reasoning with RL-driven property control, marking a step toward reliable and transparent peptide optimization for therapeutic discovery.
♻ ☆ GAPO: Robust Advantage Estimation for Real-World Code LLMs
Reinforcement learning (RL) is widely used for post-training large language models (LLMs) in code editing, where group-relative methods like GRPO are popular for their critic-free, normalized advantage estimation. However, in real-world code-editing scenarios, reward distributions are often skewed with unpredictable outliers, leading to distorted advantage computation and increased noise. To address this issue, we propose Group Adaptive Policy Optimization (GAPO), which adaptively finds an outlier-free highest-density interval (HDI) per prompt and then uses the median of that interval as an adaptive Q to replace the group mean in advantage calculation. This adaptive Q robustly handles skewed distributions while remaining plug-and-play and efficient. We validate GAPO on nine instruction-tuned LLMs (3B-14B) using a large internal dataset of 51,844 real-world, history-aware code-editing tasks across 10 languages, demonstrating consistent improvements in exact match accuracy over GRPO and its variant DAPO. Code is publicly available.
♻ ☆ Beyond Bias Scores: Unmasking Vacuous Neutrality in Small Language Models
The rapid adoption of Small Language Models (SLMs) for resource constrained applications has outpaced our understanding of their ethical and fairness implications. To address this gap, we introduce the Vacuous Neutrality Framework (VaNeu), a multi-dimensional evaluation paradigm designed to assess SLM fairness prior to deployment. The framework examines model robustness across four stages - biases, utility, ambiguity handling, and positional bias over diverse social bias categories. To the best of our knowledge, this work presents the first large-scale audit of SLMs in the 0.5-5B parameter range, an overlooked "middle tier" between BERT-class encoders and flagship LLMs. We evaluate nine widely used SLMs spanning four model families under both ambiguous and disambiguated contexts. Our findings show that models demonstrating low bias in early stages often fail subsequent evaluations, revealing hidden vulnerabilities and unreliable reasoning. These results underscore the need for a more comprehensive understanding of fairness and reliability in SLMs, and position the proposed framework as a principled tool for responsible deployment in socially sensitive settings.
♻ ☆ MMVA: Multimodal Matching Based on Valence and Arousal across Images, Music, and Musical Captions AAAI 2025
We introduce Multimodal Matching based on Valence and Arousal (MMVA), a tri-modal encoder framework designed to capture emotional content across images, music, and musical captions. To support this framework, we expand the Image-Music-Emotion-Matching-Net (IMEMNet) dataset, creating IMEMNet-C which includes 24,756 images and 25,944 music clips with corresponding musical captions. We employ multimodal matching scores based on the continuous valence (emotional positivity) and arousal (emotional intensity) values. This continuous matching score allows for random sampling of image-music pairs during training by computing similarity scores from the valence-arousal values across different modalities. Consequently, the proposed approach achieves state-of-the-art performance in valence-arousal prediction tasks. Furthermore, the framework demonstrates its efficacy in various zeroshot tasks, highlighting the potential of valence and arousal predictions in downstream applications.
comment: Paper accepted in Artificial Intelligence for Music workshop at AAAI 2025
♻ ☆ Towards Efficient Multimodal Unified Reasoning Model via Model Merging
Although Multimodal Large Language Models (MLLMs) have demonstrated remarkable capabilities across diverse tasks, they encounter challenges in terms of reasoning efficiency, large model size and overthinking. However, existing lightweight MLLMs lack the capability to balance high efficiency and performance at a small scale. To this end, we propose Tiny-R1V, a novel lightweight 3B model that achieves faster inference and higher accuracy via a two-stage optimization, while unifying multimodal reasoning across multiple tasks with fewer inference tokens. In the first stage, Tiny-R1V introduces Length-Informed Relative Policy Optimization (LIPO), a new reinforcement learning method, to train each reasoning model, including mathematical reasoning, chart reasoning, and OCR capability. The LIPO dynamically adjusts the advantages of responses within groups by prioritizing concise yet high-quality responses to encourage the generation of shorter and more accurate responses. In the second stage, we propose Adaptive Model Merging (AMM), a training-free model merging method that merges multiple specialist models into a unified architecture. Specifically, AMM adaptively adjusts the weights of task vectors via a novel gradient projection regularization loss function, thus mitigating redundant conflicts between them. Extensive evaluations on ten widely-used reasoning benchmarks covering mathematics, structured data (charts, tables, documents), OCR, and general capabilities showcase the superior performance of Tiny-R1V, enabling lightweight models to excel in diverse multimodal reasoning tasks. Code will be available at \href{https://github.com/buptyqx/Tiny-R1V}{https://github.com/buptyqx/Tiny-R1V}
comment: Technical report, Code will be available at https://github.com/buptyqx/Tiny-R1V
♻ ☆ Recent Advances in Discrete Speech Tokens: A Review
The rapid advancement of speech generation technologies in the era of large language models (LLMs) has established discrete speech tokens as a foundational paradigm for speech representation. These tokens, characterized by their discrete, compact, and concise nature, are not only advantageous for efficient transmission and storage, but also inherently compatible with the language modeling framework, enabling seamless integration of speech into text-dominated LLM architectures. Current research categorizes discrete speech tokens into two principal classes: acoustic tokens and semantic tokens, each of which has evolved into a rich research domain characterized by unique design philosophies and methodological approaches. This survey systematically synthesizes the existing taxonomy and recent innovations in discrete speech tokenization, conducts a critical examination of the strengths and limitations of each paradigm, and presents systematic experimental comparisons across token types. Furthermore, we identify persistent challenges in the field and propose potential research directions, aiming to offer actionable insights to inspire future advancements in the development and application of discrete speech tokens.
comment: 26 pages, 8 figures, 3 tables. This version is a major revision of the previous one, including reorganization of the section structure, more experimental results, and extensive revisions to both text and figures
♻ ☆ Property-guided Inverse Design of Metal-Organic Frameworks Using Quantum Natural Language Processing
In this study, we explore the potential of using quantum natural language processing (QNLP) to inverse design metal-organic frameworks (MOFs) with targeted properties. Specifically, by analyzing 450 hypothetical MOF structures consisting of 3 topologies, 10 metal nodes and 15 organic ligands, we categorize these structures into four distinct classes for pore volume and $CO_{2}$ Henry's constant values. We then compare various QNLP models (i.e. the bag-of-words, DisCoCat (Distributional Compositional Categorical), and sequence-based models) to identify the most effective approach to process the MOF dataset. Using a classical simulator provided by the IBM Qiskit, the bag-of-words model is identified to be the optimum model, achieving validation accuracies of 88.6% and 78.0% for binary classification tasks on pore volume and $CO_{2}$ Henry's constant, respectively. Further, we developed multi-class classification models tailored to the probabilistic nature of quantum circuits, with average test accuracies of 92% and 80% across different classes for pore volume and $CO_{2}$ Henry's constant datasets. Finally, the performance of generating MOF with target properties showed accuracies of 93.5% for pore volume and 87% for $CO_{2}$ Henry's constant, respectively. Although our investigation covers only a fraction of the vast MOF search space, it marks a promising first step towards using quantum computing for materials design, offering a new perspective through which to explore the complex landscape of MOFs.
comment: 46 pages, 7 figures, 6 supplementary figures, 1 table, 2 supplementary tables, 1 supplementary note
♻ ☆ CSI-Bench: A Large-Scale In-the-Wild Dataset for Multi-task WiFi Sensing NeurIPS
WiFi sensing has emerged as a compelling contactless modality for human activity monitoring by capturing fine-grained variations in Channel State Information (CSI). Its ability to operate continuously and non-intrusively while preserving user privacy makes it particularly suitable for health monitoring. However, existing WiFi sensing systems struggle to generalize in real-world settings, largely due to datasets collected in controlled environments with homogeneous hardware and fragmented, session-based recordings that fail to reflect continuous daily activity. We present CSI-Bench, a large-scale, in-the-wild benchmark dataset collected using commercial WiFi edge devices across 26 diverse indoor environments with 35 real users. Spanning over 461 hours of effective data, CSI-Bench captures realistic signal variability under natural conditions. It includes task-specific datasets for fall detection, breathing monitoring, localization, and motion source recognition, as well as a co-labeled multitask dataset with joint annotations for user identity, activity, and proximity. To support the development of robust and generalizable models, CSI-Bench provides standardized evaluation splits and baseline results for both single-task and multi-task learning. CSI-Bench offers a foundation for scalable, privacy-preserving WiFi sensing systems in health and broader human-centric applications.
comment: 26 pages, 5 figures, accepted by Thirty-Ninth Annual Conference on Neural Information Processing Systems (NeurIPS)
♻ ☆ Statistically Assuring Safety of Control Systems using Ensembles of Safety Filters and Conformal Prediction
Safety assurance is a fundamental requirement for deploying learning-enabled autonomous systems. Hamilton-Jacobi (HJ) reachability analysis is a fundamental method for formally verifying safety and generating safe controllers. However, computing the HJ value function that characterizes the backward reachable set (BRS) of a set of user-defined failure states is computationally expensive, especially for high-dimensional systems, motivating the use of reinforcement learning approaches to approximate the value function. Unfortunately, a learned value function and its corresponding safe policy are not guaranteed to be correct. The learned value function evaluated at a given state may not be equal to the actual safety return achieved by following the learned safe policy. To address this challenge, we introduce a conformal prediction-based (CP) framework that bounds such uncertainty. We leverage CP to provide probabilistic safety guarantees when using learned HJ value functions and policies to prevent control systems from reaching failure states. Specifically, we use CP to calibrate the switching between the unsafe nominal controller and the learned HJ-based safe policy and to derive safety guarantees under this switched policy. We also investigate using an ensemble of independently trained HJ value functions as a safety filter and compare this ensemble approach to using individual value functions alone.
♻ ☆ CompTrack: Information Bottleneck-Guided Low-Rank Dynamic Token Compression for Point Cloud Tracking AAAI 2026
3D single object tracking (SOT) in LiDAR point clouds is a critical task in computer vision and autonomous driving. Despite great success having been achieved, the inherent sparsity of point clouds introduces a dual-redundancy challenge that limits existing trackers: (1) vast spatial redundancy from background noise impairs accuracy, and (2) informational redundancy within the foreground hinders efficiency. To tackle these issues, we propose CompTrack, a novel end-to-end framework that systematically eliminates both forms of redundancy in point clouds. First, CompTrack incorporates a Spatial Foreground Predictor (SFP) module to filter out irrelevant background noise based on information entropy, addressing spatial redundancy. Subsequently, its core is an Information Bottleneck-guided Dynamic Token Compression (IB-DTC) module that eliminates the informational redundancy within the foreground. Theoretically grounded in low-rank approximation, this module leverages an online SVD analysis to adaptively compress the redundant foreground into a compact and highly informative set of proxy tokens. Extensive experiments on KITTI, nuScenes and Waymo datasets demonstrate that CompTrack achieves top-performing tracking performance with superior efficiency, running at a real-time 90 FPS on a single RTX 3090 GPU.
comment: Accepted by AAAI 2026 (Oral)
♻ ☆ OEMA: Ontology-Enhanced Multi-Agent Collaboration Framework for Zero-Shot Clinical Named Entity Recognition
With the rapid expansion of unstructured clinical texts in electronic health records (EHRs), clinical named entity recognition (NER) has become a crucial technique for extracting medical information. However, traditional supervised models such as CRF and BioClinicalBERT suffer from high annotation costs. Although zero-shot NER based on large language models (LLMs) reduces the dependency on labeled data, challenges remain in aligning example selection with task granularity and effectively integrating prompt design with self-improvement frameworks. To address these limitations, we propose OEMA, a novel zero-shot clinical NER framework based on multi-agent collaboration. OEMA consists of three core components: (1) a self-annotator that autonomously generates candidate examples; (2) a discriminator that leverages SNOMED CT to filter token-level examples by clinical relevance; and (3) a predictor that incorporates entity-type descriptions to enhance inference accuracy. Experimental results on two benchmark datasets, MTSamples and VAERS, demonstrate that OEMA achieves state-of-the-art performance under exact-match evaluation. Moreover, under related-match criteria, OEMA performs comparably to the supervised BioClinicalBERT model while significantly outperforming the traditional CRF method. OEMA improves zero-shot clinical NER, achieving near-supervised performance under related-match criteria. Future work will focus on continual learning and open-domain adaptation to expand its applicability in clinical NLP.
comment: 12 pages, 4 figures, 4 tables
♻ ☆ SafeRBench: A Comprehensive Benchmark for Safety Assessment in Large Reasoning Models
Large Reasoning Models (LRMs) improve answer quality through explicit chain-of-thought, yet this very capability introduces new safety risks: harmful content can be subtly injected, surface gradually, or be justified by misleading rationales within the reasoning trace. Existing safety evaluations, however, primarily focus on output-level judgments and rarely capture these dynamic risks along the reasoning process. In this paper, we present SafeRBench, the first benchmark that assesses LRM safety end-to-end -- from inputs and intermediate reasoning to final outputs. (1) Input Characterization: We pioneer the incorporation of risk categories and levels into input design, explicitly accounting for affected groups and severity, and thereby establish a balanced prompt suite reflecting diverse harm gradients. (2) Fine-Grained Output Analysis: We introduce a micro-thought chunking mechanism to segment long reasoning traces into semantically coherent units, enabling fine-grained evaluation across ten safety dimensions. (3) Human Safety Alignment: We validate LLM-based evaluations against human annotations specifically designed to capture safety judgments. Evaluations on 19 LRMs demonstrate that SafeRBench enables detailed, multidimensional safety assessment, offering insights into risks and protective mechanisms from multiple perspectives.
comment: 30 pages, 8 figures
Machine Learning
Dataset Distillation for Pre-Trained Self-Supervised Vision Models NeurIPS 2025
The task of dataset distillation aims to find a small set of synthetic images such that training a model on them reproduces the performance of the same model trained on a much larger dataset of real samples. Existing distillation methods focus on synthesizing datasets that enable training randomly initialized models. In contrast, state-of-the-art vision approaches are increasingly building on large, pre-trained self-supervised models rather than training from scratch. In this paper, we investigate the problem of distilling datasets that enable us to optimally train linear probes on top of such large, pre-trained vision models. We introduce a method of dataset distillation for this task called Linear Gradient Matching that optimizes the synthetic images such that, when passed through a pre-trained feature extractor, they induce gradients in the linear classifier similar to those produced by the real data. Our method yields synthetic data that outperform all real-image baselines and, remarkably, generalize across pre-trained vision models, enabling us, for instance, to train a linear CLIP probe that performs competitively using a dataset distilled via a DINO backbone. Further, we show that our distilled datasets are exceptionally effective for fine-grained classification and provide a valuable tool for model interpretability, predicting, among other things, how similar two models' embedding spaces are under the platonic representation hypothesis or whether a model is sensitive to spurious correlations in adversarial datasets.
comment: Accepted at NeurIPS 2025. Project page: https://linear-gradient-matching.github.io/ Code: https://github.com/GeorgeCazenavette/linear-gradient-matching
☆ Taming the Long-Tail: Efficient Reasoning RL Training with Adaptive Drafter
The emergence of Large Language Models (LLMs) with strong reasoning capabilities marks a significant milestone, unlocking new frontiers in complex problem-solving. However, training these reasoning models, typically using Reinforcement Learning (RL), encounters critical efficiency bottlenecks: response generation during RL training exhibits a persistent long-tail distribution, where a few very long responses dominate execution time, wasting resources and inflating costs. To address this, we propose TLT, a system that accelerates reasoning RL training losslessly by integrating adaptive speculative decoding. Applying speculative decoding in RL is challenging due to the dynamic workloads, evolving target model, and draft model training overhead. TLT overcomes these obstacles with two synergistic components: (1) Adaptive Drafter, a lightweight draft model trained continuously on idle GPUs during long-tail generation to maintain alignment with the target model at no extra cost; and (2) Adaptive Rollout Engine, which maintains a memory-efficient pool of pre-captured CUDAGraphs and adaptively select suitable SD strategies for each input batch. Evaluations demonstrate that TLT achieves over 1.7x end-to-end RL training speedup over state-of-the-art systems, preserves the model accuracy, and yields a high-quality draft model as a free byproduct suitable for efficient deployment. Code is released at https://github.com/mit-han-lab/fastrl.
☆ Dexterity from Smart Lenses: Multi-Fingered Robot Manipulation with In-the-Wild Human Demonstrations
Learning multi-fingered robot policies from humans performing daily tasks in natural environments has long been a grand goal in the robotics community. Achieving this would mark significant progress toward generalizable robot manipulation in human environments, as it would reduce the reliance on labor-intensive robot data collection. Despite substantial efforts, progress toward this goal has been bottle-necked by the embodiment gap between humans and robots, as well as by difficulties in extracting relevant contextual and motion cues that enable learning of autonomous policies from in-the-wild human videos. We claim that with simple yet sufficiently powerful hardware for obtaining human data and our proposed framework AINA, we are now one significant step closer to achieving this dream. AINA enables learning multi-fingered policies from data collected by anyone, anywhere, and in any environment using Aria Gen 2 glasses. These glasses are lightweight and portable, feature a high-resolution RGB camera, provide accurate on-board 3D head and hand poses, and offer a wide stereo view that can be leveraged for depth estimation of the scene. This setup enables the learning of 3D point-based policies for multi-fingered hands that are robust to background changes and can be deployed directly without requiring any robot data (including online corrections, reinforcement learning, or simulation). We compare our framework against prior human-to-robot policy learning approaches, ablate our design choices, and demonstrate results across nine everyday manipulation tasks. Robot rollouts are best viewed on our website: https://aina-robot.github.io.
☆ Solving Spatial Supersensing Without Spatial Supersensing
Cambrian-S aims to take the first steps towards improving video world models with spatial supersensing by introducing (i) two benchmarks, VSI-Super-Recall (VSR) and VSI-Super-Counting (VSC), and (ii) bespoke predictive sensing inference strategies tailored to each benchmark. In this work, we conduct a critical analysis of Cambrian-S across both these fronts. First, we introduce a simple baseline, NoSense, which discards almost all temporal structure and uses only a bag-of-words SigLIP model, yet near-perfectly solves VSR, achieving 95% accuracy even on 4-hour videos. This shows benchmarks like VSR can be nearly solved without spatial cognition, world modeling or spatial supersensing. Second, we hypothesize that the tailored inference methods proposed by Cambrian-S likely exploit shortcut heuristics in the benchmark. We illustrate this with a simple sanity check on the VSC benchmark, called VSC-Repeat: We concatenate each video with itself 1-5 times, which does not change the number of unique objects. However, this simple perturbation entirely collapses the mean relative accuracy of Cambrian-S from 42% to 0%. A system that performs spatial supersensing and integrates information across experiences should recognize views of the same scene and keep object-count predictions unchanged; instead, Cambrian-S inference algorithm relies largely on a shortcut in the VSC benchmark that rooms are never revisited. Taken together, our findings suggest that (i) current VSI-Super benchmarks do not yet reliably measure spatial supersensing, and (ii) predictive-sensing inference recipes used by Cambrian-S improve performance by inadvertently exploiting shortcuts rather than from robust spatial supersensing. We include the response from the Cambrian-S authors (in Appendix A) to provide a balanced perspective alongside our claims. We release our code at: https://github.com/bethgelab/supersanity
comment: Tech Report
☆ Evolution Strategies at the Hyperscale
We introduce Evolution Guided General Optimization via Low-rank Learning (EGGROLL), an evolution strategies (ES) algorithm designed to scale backprop-free optimization to large population sizes for modern large neural network architectures with billions of parameters. ES is a set of powerful blackbox optimisation methods that can handle non-differentiable or noisy objectives with excellent scaling potential through parallelisation. Na{ï}ve ES becomes prohibitively expensive at scale due to the computational and memory costs associated with generating matrix perturbations $E\in\mathbb{R}^{m\times n}$ and the batched matrix multiplications needed to compute per-member forward passes. EGGROLL overcomes these bottlenecks by generating random matrices $A\in \mathbb{R}^{m\times r},\ B\in \mathbb{R}^{n\times r}$ with $r\ll \min(m,n)$ to form a low-rank matrix perturbation $A B^\top$ that are used in place of the full-rank perturbation $E$. As the overall update is an average across a population of $N$ workers, this still results in a high-rank update but with significant memory and computation savings, reducing the auxiliary storage from $mn$ to $r(m+n)$ per layer and the cost of a forward pass from $\mathcal{O}(mn)$ to $\mathcal{O}(r(m+n))$ when compared to full-rank ES. A theoretical analysis reveals our low-rank update converges to the full-rank update at a fast $\mathcal{O}\left(\frac{1}{r}\right)$ rate. Our experiments show that (1) EGGROLL does not compromise the performance of ES in tabula-rasa RL settings, despite being faster, (2) it is competitive with GRPO as a technique for improving LLM reasoning, and (3) EGGROLL enables stable pre-training of nonlinear recurrent language models that operate purely in integer datatypes.
comment: 48 pages, 12 figures, Website at https://eshyperscale.github.io/
☆ Stabilizing Policy Gradient Methods via Reward Profiling
Policy gradient methods, which have been extensively studied in the last decade, offer an effective and efficient framework for reinforcement learning problems. However, their performances can often be unsatisfactory, suffering from unreliable reward improvements and slow convergence, due to high variance in gradient estimations. In this paper, we propose a universal reward profiling framework that can be seamlessly integrated with any policy gradient algorithm, where we selectively update the policy based on high-confidence performance estimations. We theoretically justify that our technique will not slow down the convergence of the baseline policy gradient methods, but with high probability, will result in stable and monotonic improvements of their performance. Empirically, on eight continuous-control benchmarks (Box2D and MuJoCo/PyBullet), our profiling yields up to 1.5x faster convergence to near-optimal returns, up to 1.75x reduction in return variance on some setups. Our profiling approach offers a general, theoretically grounded path to more reliable and efficient policy learning in complex environments.
☆ From Polynomials to Databases: Arithmetic Structures in Galois Theory
We develop a computational framework for classifying Galois groups of irreducible degree-7 polynomials over~$\mathbb{Q}$, combining explicit resolvent methods with machine learning techniques. A database of over one million normalized projective septics is constructed, each annotated with algebraic invariants~$J_0, \dots, J_4$ derived from binary transvections. For each polynomial, we compute resolvent factorizations to determine its Galois group among the seven transitive subgroups of~$S_7$ identified by Foulkes. Using this dataset, we train a neurosymbolic classifier that integrates invariant-theoretic features with supervised learning, yielding improved accuracy in detecting rare solvable groups compared to coefficient-based models. The resulting database provides a reproducible resource for constructive Galois theory and supports empirical investigations into group distribution under height constraints. The methodology extends to higher-degree cases and illustrates the utility of hybrid symbolic-numeric techniques in computational algebra.
☆ Rate-optimal community detection near the KS threshold via node-robust algorithms
We study community detection in the \emph{symmetric $k$-stochastic block model}, where $n$ nodes are evenly partitioned into $k$ clusters with intra- and inter-cluster connection probabilities $p$ and $q$, respectively. Our main result is a polynomial-time algorithm that achieves the minimax-optimal misclassification rate \begin{equation*} \exp \Bigl(-\bigl(1 \pm o(1)\bigr) \tfrac{C}{k}\Bigr), \quad \text{where } C = (\sqrt{pn} - \sqrt{qn})^2, \end{equation*} whenever $C \ge K\,k^2\,\log k$ for some universal constant $K$, matching the Kesten--Stigum (KS) threshold up to a $\log k$ factor. Notably, this rate holds even when an adversary corrupts an $η\le \exp\bigl(- (1 \pm o(1)) \tfrac{C}{k}\bigr)$ fraction of the nodes. To the best of our knowledge, the minimax rate was previously only attainable either via computationally inefficient procedures [ZZ15] or via polynomial-time algorithms that require strictly stronger assumptions such as $C \ge K k^3$ [GMZZ17]. In the node-robust setting, the best known algorithm requires the substantially stronger condition $C \ge K k^{102}$ [LM22]. Our results close this gap by providing the first polynomial-time algorithm that achieves the minimax rate near the KS threshold in both settings. Our work has two key technical contributions: (1) we robustify majority voting via the Sum-of-Squares framework, (2) we develop a novel graph bisection algorithm via robust majority voting, which allows us to significantly improve the misclassification rate to $1/\mathrm{poly}(k)$ for the initial estimation near the KS threshold.
☆ Time dependent loss reweighting for flow matching and diffusion models is theoretically justified
This brief note clarifies that, in Generator Matching (which subsumes a large family of flow matching and diffusion models over continuous, manifold, and discrete spaces), both the Bregman divergence loss and the linear parameterization of the generator can depend on both the current state $X_t$ and the time $t$, and we show that the expectation over time in the loss can be taken with respect to a broad class of time distributions. We also show this for Edit Flows, which falls outside of Generator Matching. That the loss can depend on $t$ clarifies that time-dependent loss weighting schemes, often used in practice to stabilize training, are theoretically justified when the specific flow or diffusion scheme is a special case of Generator Matching (or Edit Flows). It also often simplifies the construction of $X_1$-predictor schemes, which are sometimes preferred for model-related reasons. We show examples that rely upon the dependence of linear parameterizations, and of the Bregman divergence loss, on $t$ and $X_t$.
comment: 19 pages, 0 figures
☆ Variational Quantum Integrated Sensing and Communication
The integration of sensing and communication functionalities within a common system is one of the main innovation drivers for next-generation networks. In this paper, we introduce a quantum integrated sensing and communication (QISAC) protocol that leverages entanglement in quantum carriers of information to enable both superdense coding and quantum sensing. The proposed approach adaptively optimizes encoding and quantum measurement via variational circuit learning, while employing classical machine learning-based decoders and estimators to process the measurement outcomes. Numerical results for qudit systems demonstrate that the proposed QISAC protocol can achieve a flexible trade-off between classical communication rate and accuracy of parameter estimation.
comment: Submitted for publication
☆ Toward Artificial Palpation: Representation Learning of Touch on Soft Bodies
Palpation, the use of touch in medical examination, is almost exclusively performed by humans. We investigate a proof of concept for an artificial palpation method based on self-supervised learning. Our key idea is that an encoder-decoder framework can learn a $\textit{representation}$ from a sequence of tactile measurements that contains all the relevant information about the palpated object. We conjecture that such a representation can be used for downstream tasks such as tactile imaging and change detection. With enough training data, it should capture intricate patterns in the tactile measurements that go beyond a simple map of forces -- the current state of the art. To validate our approach, we both develop a simulation environment and collect a real-world dataset of soft objects and corresponding ground truth images obtained by magnetic resonance imaging (MRI). We collect palpation sequences using a robot equipped with a tactile sensor, and train a model that predicts sensory readings at different positions on the object. We investigate the representation learned in this process, and demonstrate its use in imaging and change detection.
☆ gfnx: Fast and Scalable Library for Generative Flow Networks in JAX
In this paper, we present gfnx, a fast and scalable package for training and evaluating Generative Flow Networks (GFlowNets) written in JAX. gfnx provides an extensive set of environments and metrics for benchmarking, accompanied with single-file implementations of core objectives for training GFlowNets. We include synthetic hypergrids, multiple sequence generation environments with various editing regimes and particular reward designs for molecular generation, phylogenetic tree construction, Bayesian structure learning, and sampling from the Ising model energy. Across different tasks, gfnx achieves significant wall-clock speedups compared to Pytorch-based benchmarks (such as torchgfn library) and author implementations. For example, gfnx achieves up to 55 times speedup on CPU-based sequence generation environments, and up to 80 times speedup with the GPU-based Bayesian network structure learning setup. Our package provides a diverse set of benchmarks and aims to standardize empirical evaluation and accelerate research and applications of GFlowNets. The library is available on GitHub (https://github.com/d-tiapkin/gfnx) and on pypi (https://pypi.org/project/gfnx/). Documentation is available on https://gfnx.readthedocs.io.
comment: GitHub: https://github.com/d-tiapkin/gfnx | Documentation: https://gfnx.readthedocs.io
☆ Almost Sure Convergence Analysis of Differentially Private Stochastic Gradient Methods
Differentially private stochastic gradient descent (DP-SGD) has become the standard algorithm for training machine learning models with rigorous privacy guarantees. Despite its widespread use, the theoretical understanding of its long-run behavior remains limited: existing analyses typically establish convergence in expectation or with high probability, but do not address the almost sure convergence of single trajectories. In this work, we prove that DP-SGD converges almost surely under standard smoothness assumptions, both in nonconvex and strongly convex settings, provided the step sizes satisfy some standard decaying conditions. Our analysis extends to momentum variants such as the stochastic heavy ball (DP-SHB) and Nesterov's accelerated gradient (DP-NAG), where we show that careful energy constructions yield similar guarantees. These results provide stronger theoretical foundations for differentially private optimization and suggest that, despite privacy-induced distortions, the algorithm remains pathwise stable in both convex and nonconvex regimes.
comment: 6 pages
☆ Synthesis of Safety Specifications for Probabilistic Systems
Ensuring that agents satisfy safety specifications can be crucial in safety-critical environments. While methods exist for controller synthesis with safe temporal specifications, most existing methods restrict safe temporal specifications to probabilistic-avoidance constraints. Formal methods typically offer more expressive ways to express safety in probabilistic systems, such as Probabilistic Computation Tree Logic (PCTL) formulas. Thus, in this paper, we develop a new approach that supports more general temporal properties expressed in PCTL. Our contribution is twofold. First, we develop a theoretical framework for the Synthesis of safe-PCTL specifications. We show how the reducing global specification satisfaction to local constraints, and define CPCTL, a fragment of safe-PCTL. We demonstrate how the expressiveness of CPCTL makes it a relevant fragment for the Synthesis Problem. Second, we leverage these results and propose a new Value Iteration-based algorithm to solve the synthesis problem for these more general temporal properties, and we prove the soundness and completeness of our method.
comment: 23 pages
☆ ECPv2: Fast, Efficient, and Scalable Global Optimization of Lipschitz Functions AAAI 2026
We propose ECPv2, a scalable and theoretically grounded algorithm for global optimization of Lipschitz-continuous functions with unknown Lipschitz constants. Building on the Every Call is Precious (ECP) framework, which ensures that each accepted function evaluation is potentially informative, ECPv2 addresses key limitations of ECP, including high computational cost and overly conservative early behavior. ECPv2 introduces three innovations: (i) an adaptive lower bound to avoid vacuous acceptance regions, (ii) a Worst-m memory mechanism that restricts comparisons to a fixed-size subset of past evaluations, and (iii) a fixed random projection to accelerate distance computations in high dimensions. We theoretically show that ECPv2 retains ECP's no-regret guarantees with optimal finite-time bounds and expands the acceptance region with high probability. We further empirically validate these findings through extensive experiments and ablation studies. Using principled hyperparameter settings, we evaluate ECPv2 across a wide range of high-dimensional, non-convex optimization problems. Across benchmarks, ECPv2 consistently matches or outperforms state-of-the-art optimizers, while significantly reducing wall-clock time.
comment: Accepted at AAAI 2026 (main technical track), extended version
☆ An Exterior-Embedding Neural Operator Framework for Preserving Conservation Laws
Neural operators have demonstrated considerable effectiveness in accelerating the solution of time-dependent partial differential equations (PDEs) by directly learning governing physical laws from data. However, for PDEs governed by conservation laws(e.g., conservation of mass, energy, or matter), existing neural operators fail to satisfy conservation properties, which leads to degraded model performance and limited generalizability. Moreover, we observe that distinct PDE problems generally require different optimal neural network architectures. This finding underscores the inherent limitations of specialized models in generalizing across diverse problem domains. To address these limitations, we propose Exterior-Embedded Conservation Framework (ECF), a universal conserving framework that can be integrated with various data-driven neural operators to enforce conservation laws strictly in predictions. The framework consists of two key components: a conservation quantity encoder that extracts conserved quantities from input data, and a conservation quantity decoder that adjusts the neural operator's predictions using these quantities to ensure strict conservation compliance in the final output. Since our architecture enforces conservation laws, we theoretically prove that it enhances model performance. To validate the performance of our method, we conduct experiments on multiple conservation-law-constrained PDE scenarios, including adiabatic systems, shallow water equations, and the Allen-Cahn problem. These baselines demonstrate that our method effectively improves model accuracy while strictly enforcing conservation laws in the predictions.
☆ Boosting Predictive Performance on Tabular Data through Data Augmentation with Latent-Space Flow-Based Diffusion
Severe class imbalance is common in real-world tabular learning, where rare but important minority classes are essential for reliable prediction. Existing generative oversampling methods such as GANs, VAEs, and diffusion models can improve minority-class performance, but they often struggle with tabular heterogeneity, training stability, and privacy concerns. We propose a family of latent-space, tree-driven diffusion methods for minority oversampling that use conditional flow matching with gradient-boosted trees as the vector-field learner. The models operate in compact latent spaces to preserve tabular structure and reduce computation. We introduce three variants: PCAForest, which uses linear PCA embedding; EmbedForest, which uses a learned nonlinear embedding; and AttentionForest, which uses an attention-augmented embedding. Each method couples a GBT-based flow with a decoder back to the original feature space. Across 11 datasets from healthcare, finance, and manufacturing, AttentionForest achieves the best average minority recall while maintaining competitive precision, calibration, and distributional similarity. PCAForest and EmbedForest reach similar utility with much faster generation, offering favorable accuracy-efficiency trade-offs. Privacy evaluated with nearest-neighbor distance ratio and distance-to-closest-record is comparable to or better than the ForestDiffusion baseline. Ablation studies show that smaller embeddings tend to improve minority recall, while aggressive learning rates harm stability. Overall, latent-space, tree-driven diffusion provides an efficient and privacy-aware approach to high-fidelity tabular data augmentation under severe class imbalance.
comment: 35 Pages
☆ Toward Valid Generative Clinical Trial Data with Survival Endpoints
Clinical trials face mounting challenges: fragmented patient populations, slow enrollment, and unsustainable costs, particularly for late phase trials in oncology and rare diseases. While external control arms built from real-world data have been explored, a promising alternative is the generation of synthetic control arms using generative AI. A central challenge is the generation of time-to-event outcomes, which constitute primary endpoints in oncology and rare disease trials, but are difficult to model under censoring and small sample sizes. Existing generative approaches, largely GAN-based, are data-hungry, unstable, and rely on strong assumptions such as independent censoring. We introduce a variational autoencoder (VAE) that jointly generates mixed-type covariates and survival outcomes within a unified latent variable framework, without assuming independent censoring. Across synthetic and real trial datasets, we evaluate our model in two realistic scenarios: (i) data sharing under privacy constraints, where synthetic controls substitute for original data, and (ii) control-arm augmentation, where synthetic patients mitigate imbalances between treated and control groups. Our method outperforms GAN baselines on fidelity, utility, and privacy metrics, while revealing systematic miscalibration of type I error and power. We propose a post-generation selection procedure that improves calibration, highlighting both progress and open challenges for generative survival modeling.
comment: P. Chassat and V.T. Nguyen contributed equally to this work
☆ Broad stochastic configuration residual learning system for norm-convergent universal approximation
Universal approximation serves as the foundation of neural network learning algorithms. However, some networks establish their universal approximation property by demonstrating that the iterative errors converge in probability measure rather than the more rigorous norm convergence, which makes the universal approximation property of randomized learning networks highly sensitive to random parameter selection, Broad residual learning system (BRLS), as a member of randomized learning models, also encounters this issue. We theoretically demonstrate the limitation of its universal approximation property, that is, the iterative errors do not satisfy norm convergence if the selection of random parameters is inappropriate and the convergence rate meets certain conditions. To address this issue, we propose the broad stochastic configuration residual learning system (BSCRLS) algorithm, which features a novel supervisory mechanism adaptively constraining the range settings of random parameters on the basis of BRLS framework, Furthermore, we prove the universal approximation theorem of BSCRLS based on the more stringent norm convergence. Three versions of incremental BSCRLS algorithms are presented to satisfy the application requirements of various network updates. Solar panels dust detection experiments are performed on publicly available dataset and compared with 13 deep and broad learning algorithms. Experimental results reveal the effectiveness and superiority of BSCRLS algorithms.
☆ FairLRF: Achieving Fairness through Sparse Low Rank Factorization
As deep learning (DL) techniques become integral to various applications, ensuring model fairness while maintaining high performance has become increasingly critical, particularly in sensitive fields such as medical diagnosis. Although a variety of bias-mitigation methods have been proposed, many rely on computationally expensive debiasing strategies or suffer substantial drops in model accuracy, which limits their practicality in real-world, resource-constrained settings. To address this issue, we propose a fairness-oriented low rank factorization (LRF) framework that leverages singular value decomposition (SVD) to improve DL model fairness. Unlike traditional SVD, which is mainly used for model compression by decomposing and reducing weight matrices, our work shows that SVD can also serve as an effective tool for fairness enhancement. Specifically, we observed that elements in the unitary matrices obtained from SVD contribute unequally to model bias across groups defined by sensitive attributes. Motivated by this observation, we propose a method, named FairLRF, that selectively removes bias-inducing elements from unitary matrices to reduce group disparities, thus enhancing model fairness. Extensive experiments show that our method outperforms conventional LRF methods as well as state-of-the-art fairness-enhancing techniques. Additionally, an ablation study examines how major hyper-parameters may influence the performance of processed models. To the best of our knowledge, this is the first work utilizing SVD not primarily for compression but for fairness enhancement.
☆ The Oracle and The Prism: A Decoupled and Efficient Framework for Generative Recommendation Explanation
The integration of Large Language Models (LLMs) into explainable recommendation systems often leads to a performance-efficiency trade-off in end-to-end architectures, where joint optimization of ranking and explanation can result in suboptimal compromises. To resolve this, we propose Prism, a novel decoupled framework that rigorously separates the recommendation process into a dedicated ranking stage and an explanation generation stage. Inspired by knowledge distillation, Prism leverages a powerful teacher LLM (e.g., FLAN-T5-XXL) as an Oracle to produce high-fidelity explanatory knowledge. A compact, fine-tuned student model (e.g., BART-Base), the Prism, then specializes in synthesizing this knowledge into personalized explanations. This decomposition ensures that each component is optimized for its specific objective, eliminating inherent conflicts in coupled models. Extensive experiments on benchmark datasets demonstrate that our 140M-parameter Prism model significantly outperforms its 11B-parameter teacher in human evaluations of faithfulness and personalization, while achieving a 24 times speedup and a 10 times reduction in memory consumption during inference. These results validate that decoupling, coupled with targeted distillation, provides an efficient and effective pathway to high-quality explainable recommendation.
comment: 11 pages,3 figures
☆ Beyond Tokens in Language Models: Interpreting Activations through Text Genre Chunks
Understanding Large Language Models (LLMs) is key to ensure their safe and beneficial deployment. This task is complicated by the difficulty of interpretability of LLM structures, and the inability to have all their outputs human-evaluated. In this paper, we present the first step towards a predictive framework, where the genre of a text used to prompt an LLM, is predicted based on its activations. Using Mistral-7B and two datasets, we show that genre can be extracted with F1-scores of up to 98% and 71% using scikit-learn classifiers. Across both datasets, results consistently outperform the control task, providing a proof of concept that text genres can be inferred from LLMs with shallow learning models.
comment: 13 pages, 5 figures
☆ Contrastive vision-language learning with paraphrasing and negation
Contrastive vision-language models continue to be the dominant approach for image and text retrieval. Contrastive Language-Image Pre-training (CLIP) trains two neural networks in contrastive manner to align their image and text embeddings in a shared latent space. Recent results evaluating CLIP on negated or paraphrased text have shown mixed performance because negation changes meaning radically with minimal lexical changes, while paraphrasing can create very different textual expressions with the same intended meaning. This poses a significant challenge for improving the evaluation results and alignment of vision-language models. To address this challenge, this paper evaluates the combination of paraphrasing and negation, proposes a new CLIP contrastive loss function accounting for both paraphrasing and negation, and applies LLM-generated training triples consisting of original, paraphrased and negated textual captions to CLIP-like training models. The approach, called SemCLIP, is shown to move paraphrased captions towards the original image embeddings while pushing negated captions further away in embedding space. Empirically, SemCLIP is shown to be capable of preserving CLIP's performance while increasing considerably the distances to negated captions. On the CC-Neg benchmark using an original over negation image-retrieval accuracy metric, SemCLIP improves accuracy from 68.1% to 78.1%. Although results are mixed when compared with CLIP on the Sugarcrepe++ benchmark, SemCLIP's performance is generally better than the models trained with negated captions. This robustness to negation extends to downstream zero-shot classification tasks where SemCLIP pre-trained on Sugarcrepe++ performs better than CLIP on all tested downstream tasks. These results indicate that SemCLIP can achieve significant robustness to semantic transformations.
☆ Dynamic Participation in Federated Learning: Benchmarks and a Knowledge Pool Plugin
Federated learning (FL) enables clients to collaboratively train a shared model in a distributed manner, setting it apart from traditional deep learning paradigms. However, most existing FL research assumes consistent client participation, overlooking the practical scenario of dynamic participation (DPFL), where clients may intermittently join or leave during training. Moreover, no existing benchmarking framework systematically supports the study of DPFL-specific challenges. In this work, we present the first open-source framework explicitly designed for benchmarking FL models under dynamic client participation. Our framework provides configurable data distributions, participation patterns, and evaluation metrics tailored to DPFL scenarios. Using this platform, we benchmark four major categories of widely adopted FL models and uncover substantial performance degradation under dynamic participation. To address these challenges, we further propose Knowledge-Pool Federated Learning (KPFL), a generic plugin that maintains a shared knowledge pool across both active and idle clients. KPFL leverages dual-age and data-bias weighting, combined with generative knowledge distillation, to mitigate instability and prevent knowledge loss. Extensive experiments demonstrate the significant impact of dynamic participation on FL performance and the effectiveness of KPFL in improving model robustness and generalization.
☆ Saving Foundation Flow-Matching Priors for Inverse Problems
Foundation flow-matching (FM) models promise a universal prior for solving inverse problems (IPs), yet today they trail behind domain-specific or even untrained priors. How can we unlock their potential? We introduce FMPlug, a plug-in framework that redefines how foundation FMs are used in IPs. FMPlug combines an instance-guided, time-dependent warm-start strategy with a sharp Gaussianity regularization, adding problem-specific guidance while preserving the Gaussian structures. This leads to a significant performance boost across image restoration and scientific IPs. Our results point to a path for making foundation FM models practical, reusable priors for IP solving.
☆ Loss Functions Robust to the Presence of Label Errors
Methods for detecting label errors in training data require models that are robust to label errors (i.e., not fit to erroneously labelled data points). However, acquiring such models often involves training on corrupted data, which presents a challenge. Adjustments to the loss function present an opportunity for improvement. Motivated by Focal Loss (which emphasizes difficult-to-classify samples), two novel, yet simple, loss functions are proposed that de-weight or ignore these difficult samples (i.e., those likely to have label errors). Results on artificially corrupted data show promise, such that F1 scores for detecting errors are improved from the baselines of conventional categorical Cross Entropy and Focal Loss.
comment: 6 pages, 6 figures, Presented at the 10th Annual Conference on Vision and Intelligent Systems (2024)
☆ ODE-ViT: Plug & Play Attention Layer from the Generalization of the ViT as an Ordinary Differential Equation
In recent years, increasingly large models have achieved outstanding performance across CV tasks. However, these models demand substantial computational resources and storage, and their growing complexity limits our understanding of how they make decisions. Most of these architectures rely on the attention mechanism within Transformer-based designs. Building upon the connection between residual neural networks and ordinary differential equations (ODEs), we introduce ODE-ViT, a Vision Transformer reformulated as an ODE system that satisfies the conditions for well-posed and stable dynamics. Experiments on CIFAR-10 and CIFAR-100 demonstrate that ODE-ViT achieves stable, interpretable, and competitive performance with up to one order of magnitude fewer parameters, surpassing prior ODE-based Transformer approaches in classification tasks. We further propose a plug-and-play teacher-student framework in which a discrete ViT guides the continuous trajectory of ODE-ViT by treating the intermediate representations of the teacher as solutions of the ODE. This strategy improves performance by more than 10% compared to training a free ODE-ViT from scratch.
☆ Large Language Model-Based Reward Design for Deep Reinforcement Learning-Driven Autonomous Cyber Defense AAAI-26
Designing rewards for autonomous cyber attack and defense learning agents in a complex, dynamic environment is a challenging task for subject matter experts. We propose a large language model (LLM)-based reward design approach to generate autonomous cyber defense policies in a deep reinforcement learning (DRL)-driven experimental simulation environment. Multiple attack and defense agent personas were crafted, reflecting heterogeneity in agent actions, to generate LLM-guided reward designs where the LLM was first provided with contextual cyber simulation environment information. These reward structures were then utilized within a DRL-driven attack-defense simulation environment to learn an ensemble of cyber defense policies. Our results suggest that LLM-guided reward designs can lead to effective defense strategies against diverse adversarial behaviors.
comment: Accepted in the AAAI-26 Workshop on Artificial Intelligence for Cyber Security (AICS)
☆ Correlation-Aware Feature Attribution Based Explainable AI
Explainable AI (XAI) is increasingly essential as modern models become more complex and high-stakes applications demand transparency, trust, and regulatory compliance. Existing global attribution methods often incur high computational costs, lack stability under correlated inputs, and fail to scale efficiently to large or heterogeneous datasets. We address these gaps with \emph{ExCIR} (Explainability through Correlation Impact Ratio), a correlation-aware attribution score equipped with a lightweight transfer protocol that reproduces full-model rankings using only a fraction of the data. ExCIR quantifies sign-aligned co-movement between features and model outputs after \emph{robust centering} (subtracting a robust location estimate, e.g., median or mid-mean, from features and outputs). We further introduce \textsc{BlockCIR}, a \emph{groupwise} extension of ExCIR that scores \emph{sets} of correlated features as a single unit. By aggregating the same signed-co-movement numerators and magnitudes over predefined or data-driven groups, \textsc{BlockCIR} mitigates double-counting in collinear clusters (e.g., synonyms or duplicated sensors) and yields smoother, more stable rankings when strong dependencies are present. Across diverse text, tabular, signal, and image datasets, ExCIR shows trustworthy agreement with established global baselines and the full model, delivers consistent top-$k$ rankings across settings, and reduces runtime via lightweight evaluation on a subset of rows. Overall, ExCIR provides \emph{computationally efficient}, \emph{consistent}, and \emph{scalable} explainability for real-world deployment.
comment: Accepted, 2026 International Conference on Advances in Artificial Intelligence and Machine Learning (AAIML 2026)
☆ Limitations of Scalarisation in MORL: A Comparative Study in Discrete Environments
Scalarisation functions are widely employed in MORL algorithms to enable intelligent decision-making. However, these functions often struggle to approximate the Pareto front accurately, rendering them unideal in complex, uncertain environments. This study examines selected Multi-Objective Reinforcement Learning (MORL) algorithms across MORL environments with discrete action and observation spaces. We aim to investigate further the limitations associated with scalarisation approaches for decision-making in multi-objective settings. Specifically, we use an outer-loop multi-policy methodology to assess the performance of a seminal single-policy MORL algorithm, MO Q-Learning implemented with linear scalarisation and Chebyshev scalarisation functions. In addition, we explore a pioneering inner-loop multi-policy algorithm, Pareto Q-Learning, which offers a more robust alternative. Our findings reveal that the performance of the scalarisation functions is highly dependent on the environment and the shape of the Pareto front. These functions often fail to retain the solutions uncovered during learning and favour finding solutions in certain regions of the solution space. Moreover, finding the appropriate weight configurations to sample the entire Pareto front is complex, limiting their applicability in uncertain settings. In contrast, inner-loop multi-policy algorithms may provide a more sustainable and generalizable approach and potentially facilitate intelligent decision-making in dynamic and uncertain environments.
comment: 15 pages, 4 figures, published in the Proceedings of the 46th Annual Conference of the South African Institute of Computer Scientists and Information Technologists (SAICSIT 2025)
☆ A Comparison Between Decision Transformers and Traditional Offline Reinforcement Learning Algorithms
The field of Offline Reinforcement Learning (RL) aims to derive effective policies from pre-collected datasets without active environment interaction. While traditional offline RL algorithms like Conservative Q-Learning (CQL) and Implicit Q-Learning (IQL) have shown promise, they often face challenges in balancing exploration and exploitation, especially in environments with varying reward densities. The recently proposed Decision Transformer (DT) approach, which reframes offline RL as a sequence modelling problem, has demonstrated impressive results across various benchmarks. This paper presents a comparative study evaluating the performance of DT against traditional offline RL algorithms in dense and sparse reward settings for the ANT continous control environment. Our research investigates how these algorithms perform when faced with different reward structures, examining their ability to learn effective policies and generalize across varying levels of feedback. Through empirical analysis in the ANT environment, we found that DTs showed less sensitivity to varying reward density compared to other methods and particularly excelled with medium-expert datasets in sparse reward scenarios. In contrast, traditional value-based methods like IQL showed improved performance in dense reward settings with high-quality data, while CQL offered balanced performance across different data qualities. Additionally, DTs exhibited lower variance in performance but required significantly more computational resources compared to traditional approaches. These findings suggest that sequence modelling approaches may be more suitable for scenarios with uncertain reward structures or mixed-quality data, while value-based methods remain competitive in settings with dense rewards and high-quality demonstrations.
comment: 15 pages, 4 figures, published in the Proceedings of the 46th Annual conference of the South African Institute of Computer Scientists and Information Technologists (SIACSIT 2025)
☆ Optimizing Quantum Key Distribution Network Performance using Graph Neural Networks
This paper proposes an optimization of Quantum Key Distribution (QKD) Networks using Graph Neural Networks (GNN) framework. Today, the development of quantum computers threatens the security systems of classical cryptography. Moreover, as QKD networks are designed for protecting secret communication, they suffer from multiple operational difficulties: adaptive to dynamic conditions, optimization for multiple parameters and effective resource utilization. In order to overcome these obstacles, we propose a GNN-based framework which can model QKD networks as dynamic graphs and extracts exploitable characteristics from these networks' structure. The graph contains not only topological information but also specific characteristics associated with quantum communication (the number of edges between nodes, etc). Experimental results demonstrate that the GNN-optimized QKD network achieves a substantial increase in total key rate (from 27.1 Kbits/s to 470 Kbits/s), a reduced average QBER (from 6.6% to 6.0%), and maintains path integrity with a slight reduction in average transmission distance (from 7.13 km to 6.42 km). Furthermore, we analyze network performance across varying scales (10 to 250 nodes), showing improved link prediction accuracy and enhanced key generation rate in medium-sized networks. This work introduces a novel operation mode for QKD networks, shifting the paradigm of network optimization through adaptive and scalable quantum communication systems that enhance security and performance.
comment: 11 pages, 4 figures, and 2 tables
☆ Anatomy of an Idiom: Tracing Non-Compositionality in Language Models
We investigate the processing of idiomatic expressions in transformer-based language models using a novel set of techniques for circuit discovery and analysis. First discovering circuits via a modified path patching algorithm, we find that idiom processing exhibits distinct computational patterns. We identify and investigate ``Idiom Heads,'' attention heads that frequently activate across different idioms, as well as enhanced attention between idiom tokens due to earlier processing, which we term ``augmented reception.'' We analyze these phenomena and the general features of the discovered circuits as mechanisms by which transformers balance computational efficiency and robustness. Finally, these findings provide insights into how transformers handle non-compositional language and suggest pathways for understanding the processing of more complex grammatical constructions.
☆ PersonaDrift: A Benchmark for Temporal Anomaly Detection in Language-Based Dementia Monitoring
People living with dementia (PLwD) often show gradual shifts in how they communicate, becoming less expressive, more repetitive, or drifting off-topic in subtle ways. While caregivers may notice these changes informally, most computational tools are not designed to track such behavioral drift over time. This paper introduces PersonaDrift, a synthetic benchmark designed to evaluate machine learning and statistical methods for detecting progressive changes in daily communication, focusing on user responses to a digital reminder system. PersonaDrift simulates 60-day interaction logs for synthetic users modeled after real PLwD, based on interviews with caregivers. These caregiver-informed personas vary in tone, modality, and communication habits, enabling realistic diversity in behavior. The benchmark focuses on two forms of longitudinal change that caregivers highlighted as particularly salient: flattened sentiment (reduced emotional tone and verbosity) and off-topic replies (semantic drift). These changes are injected progressively at different rates to emulate naturalistic cognitive trajectories, and the framework is designed to be extensible to additional behaviors in future use cases. To explore this novel application space, we evaluate several anomaly detection approaches, unsupervised statistical methods (CUSUM, EWMA, One-Class SVM), sequence models using contextual embeddings (GRU + BERT), and supervised classifiers in both generalized and personalized settings. Preliminary results show that flattened sentiment can often be detected with simple statistical models in users with low baseline variability, while detecting semantic drift requires temporal modeling and personalized baselines. Across both tasks, personalized classifiers consistently outperform generalized ones, highlighting the importance of individual behavioral context.
☆ Graph Neural Networks for Surgical Scene Segmentation
Purpose: Accurate identification of hepatocystic anatomy is critical to preventing surgical complications during laparoscopic cholecystectomy. Deep learning models often struggle with occlusions, long-range dependencies, and capturing the fine-scale geometry of rare structures. This work addresses these challenges by introducing graph-based segmentation approaches that enhance spatial and semantic understanding in surgical scene analyses. Methods: We propose two segmentation models integrating Vision Transformer (ViT) feature encoders with Graph Neural Networks (GNNs) to explicitly model spatial relationships between anatomical regions. (1) A static k Nearest Neighbours (k-NN) graph with a Graph Convolutional Network with Initial Residual and Identity Mapping (GCNII) enables stable long-range information propagation. (2) A dynamic Differentiable Graph Generator (DGG) with a Graph Attention Network (GAT) supports adaptive topology learning. Both models are evaluated on the Endoscapes-Seg50 and CholecSeg8k benchmarks. Results: The proposed approaches achieve up to 7-8% improvement in Mean Intersection over Union (mIoU) and 6% improvement in Mean Dice (mDice) scores over state-of-the-art baselines. It produces anatomically coherent predictions, particularly on thin, rare and safety-critical structures. Conclusion: The proposed graph-based segmentation methods enhance both performance and anatomical consistency in surgical scene segmentation. By combining ViT-based global context with graph-based relational reasoning, the models improve interpretability and reliability, paving the way for safer laparoscopic and robot-assisted surgery through a precise identification of critical anatomical features.
comment: 12 pages, 4 figures, 3 tables
☆ Generative Modeling of Clinical Time Series via Latent Stochastic Differential Equations
Clinical time series data from electronic health records and medical registries offer unprecedented opportunities to understand patient trajectories and inform medical decision-making. However, leveraging such data presents significant challenges due to irregular sampling, complex latent physiology, and inherent uncertainties in both measurements and disease progression. To address these challenges, we propose a generative modeling framework based on latent neural stochastic differential equations (SDEs) that views clinical time series as discrete-time partial observations of an underlying controlled stochastic dynamical system. Our approach models latent dynamics via neural SDEs with modality-dependent emission models, while performing state estimation and parameter learning through variational inference. This formulation naturally handles irregularly sampled observations, learns complex non-linear interactions, and captures the stochasticity of disease progression and measurement noise within a unified scalable probabilistic framework. We validate the framework on two complementary tasks: (i) individual treatment effect estimation using a simulated pharmacokinetic-pharmacodynamic (PKPD) model of lung cancer, and (ii) probabilistic forecasting of physiological signals using real-world intensive care unit (ICU) data from 12,000 patients. Results show that our framework outperforms ordinary differential equation and long short-term memory baseline models in accuracy and uncertainty estimation. These results highlight its potential for enabling precise, uncertainty-aware predictions to support clinical decision-making.
☆ FreqFlow: Long-term forecasting using lightweight flow matching
Multivariate time-series (MTS) forecasting is fundamental to applications ranging from urban mobility and resource management to climate modeling. While recent generative models based on denoising diffusion have advanced state-of-the-art performance in capturing complex data distributions, they suffer from significant computational overhead due to iterative stochastic sampling procedures that limit real-time deployment. Moreover, these models can be brittle when handling high-dimensional, non-stationary, and multi-scale periodic patterns characteristic of real-world sensor networks. We introduce FreqFlow, a novel framework that leverages conditional flow matching in the frequency domain for deterministic MTS forecasting. Unlike conventional approaches that operate in the time domain, FreqFlow transforms the forecasting problem into the spectral domain, where it learns to model amplitude and phase shifts through a single complex-valued linear layer. This frequency-domain formulation enables the model to efficiently capture temporal dynamics via complex multiplication, corresponding to scaling and temporal translations. The resulting architecture is exceptionally lightweight with only 89k parameters - an order of magnitude smaller than competing diffusion-based models-while enabling single-pass deterministic sampling through ordinary differential equation (ODE) integration. Our approach decomposes MTS signals into trend, seasonal, and residual components, with the flow matching mechanism specifically designed for residual learning to enhance long-term forecasting accuracy. Extensive experiments on real-world traffic speed, volume, and flow datasets demonstrate that FreqFlow achieves state-of-the-art forecasting performance, on average 7\% RMSE improvements, while being significantly faster and more parameter-efficient than existing methods
comment: Accepted at EurIPS, 2025
☆ Classification of worldwide news articles by perceived quality, 2018-2024
This study explored whether supervised machine learning and deep learning models can effectively distinguish perceived lower-quality news articles from perceived higher-quality news articles. 3 machine learning classifiers and 3 deep learning models were assessed using a newly created dataset of 1,412,272 English news articles from the Common Crawl over 2018-2024. Expert consensus ratings on 579 source websites were split at the median, creating perceived low and high-quality classes of about 706,000 articles each, with 194 linguistic features per website-level labelled article. Traditional machine learning classifiers such as the Random Forest demonstrated capable performance (0.7355 accuracy, 0.8131 ROC AUC). For deep learning, ModernBERT-large (256 context length) achieved the best performance (0.8744 accuracy; 0.9593 ROC-AUC; 0.8739 F1), followed by DistilBERT-base (512 context length) at 0.8685 accuracy and 0.9554 ROC-AUC. DistilBERT-base (256 context length) reached 0.8478 accuracy and 0.9407 ROC-AUC, while ModernBERT-base (256 context length) attained 0.8569 accuracy and 0.9470 ROC-AUC. These results suggest that the perceived quality of worldwide news articles can be effectively differentiated by traditional CPU-based machine learning classifiers and deep learning classifiers.
☆ Collaborative Management for Chronic Diseases and Depression: A Double Heterogeneity-based Multi-Task Learning Method
Wearable sensor technologies and deep learning are transforming healthcare management. Yet, most health sensing studies focus narrowly on physical chronic diseases. This overlooks the critical need for joint assessment of comorbid physical chronic diseases and depression, which is essential for collaborative chronic care. We conceptualize multi-disease assessment, including both physical diseases and depression, as a multi-task learning (MTL) problem, where each disease assessment is modeled as a task. This joint formulation leverages inter-disease relationships to improve accuracy, but it also introduces the challenge of double heterogeneity: chronic diseases differ in their manifestation (disease heterogeneity), and patients with the same disease show varied patterns (patient heterogeneity). To address these issues, we first adopt existing techniques and propose a base method. Given the limitations of the base method, we further propose an Advanced Double Heterogeneity-based Multi-Task Learning (ADH-MTL) method that improves the base method through three innovations: (1) group-level modeling to support new patient predictions, (2) a decomposition strategy to reduce model complexity, and (3) a Bayesian network that explicitly captures dependencies while balancing similarities and differences across model components. Empirical evaluations on real-world wearable sensor data demonstrate that ADH-MTL significantly outperforms existing baselines, and each of its innovations is shown to be effective. This study contributes to health information systems by offering a computational solution for integrated physical and mental healthcare and provides design principles for advancing collaborative chronic disease management across the pre-treatment, treatment, and post-treatment phases.
☆ Optimal Fairness under Local Differential Privacy
We investigate how to optimally design local differential privacy (LDP) mechanisms that reduce data unfairness and thereby improve fairness in downstream classification. We first derive a closed-form optimal mechanism for binary sensitive attributes and then develop a tractable optimization framework that yields the corresponding optimal mechanism for multi-valued attributes. As a theoretical contribution, we establish that for discrimination-accuracy optimal classifiers, reducing data unfairness necessarily leads to lower classification unfairness, thus providing a direct link between privacy-aware pre-processing and classification fairness. Empirically, we demonstrate that our approach consistently outperforms existing LDP mechanisms in reducing data unfairness across diverse datasets and fairness metrics, while maintaining accuracy close to that of non-private models. Moreover, compared with leading pre-processing and post-processing fairness methods, our mechanism achieves a more favorable accuracy-fairness trade-off while simultaneously preserving the privacy of sensitive attributes. Taken together, these results highlight LDP as a principled and effective pre-processing fairness intervention technique.
comment: 21 pages, 6 figures, 2 tables
☆ Are Foundation Models Useful for Bankruptcy Prediction? NeurIPS 2025
Foundation models have shown promise across various financial applications, yet their effectiveness for corporate bankruptcy prediction remains systematically unevaluated against established methods. We study bankruptcy forecasting using Llama-3.3-70B-Instruct and TabPFN, evaluated on large, highly imbalanced datasets of over one million company records from the Visegrád Group. We provide the first systematic comparison of foundation models against classical machine learning baselines for this task. Our results show that models such as XGBoost and CatBoost consistently outperform foundation models across all prediction horizons. LLM-based approaches suffer from unreliable probability estimates, undermining their use in risk-sensitive financial settings. TabPFN, while competitive with simpler baselines, requires substantial computational resources with costs not justified by performance gains. These findings suggest that, despite their generality, current foundation models remain less effective than specialized methods for bankruptcy forecasting.
comment: NeurIPS 2025 Workshop: Generative AI in Finance
☆ Unsupervised Graph Neural Network Framework for Balanced Multipatterning in Advanced Electronic Design Automation Layouts
Multipatterning is an essential decomposition strategy in electronic design automation (EDA) that overcomes lithographic limitations when printing dense circuit layouts. Although heuristic-based backtracking and SAT solvers can address these challenges, they often struggle to simultaneously handle both complex constraints and secondary objectives. In this study, we present a hybrid workflow that casts multipatterning as a variant of a constrained graph coloring problem with the primary objective of minimizing feature violations and a secondary objective of balancing the number of features on each mask. Our pipeline integrates two main components: (1) A GNN-based agent, trained in an unsupervised manner to generate initial color predictions, which are refined by (2) refinement strategies (a GNN-based heuristic and simulated annealing) that together enhance solution quality and balance. Experimental evaluation in both proprietary data sets and publicly available open source layouts demonstrate complete conflict-free decomposition and consistent color balancing. The proposed framework provides a reproducible, data-efficient and deployable baseline for scalable layout decomposition in EDA workflows.
comment: manuscript under review
☆ Reducing Instability in Synthetic Data Evaluation with a Super-Metric in MalDataGen
Evaluating the quality of synthetic data remains a persistent challenge in the Android malware domain due to instability and the lack of standardization among existing metrics. This work integrates into MalDataGen a Super-Metric that aggregates eight metrics across four fidelity dimensions, producing a single weighted score. Experiments involving ten generative models and five balanced datasets demonstrate that the Super-Metric is more stable and consistent than traditional metrics, exhibiting stronger correlations with the actual performance of classifiers.
comment: 5 pages, 3 figures, submitted to ERRC/WRSeg 2025
☆ VersaPants: A Loose-Fitting Textile Capacitive Sensing System for Lower-Body Motion Capture
We present VersaPants, the first loose-fitting, textile-based capacitive sensing system for lower-body motion capture, built on the open-hardware VersaSens platform. By integrating conductive textile patches and a compact acquisition unit into a pair of pants, the system reconstructs lower-body pose without compromising comfort. Unlike IMU-based systems that require user-specific fitting or camera-based methods that compromise privacy, our approach operates without fitting adjustments and preserves user privacy. VersaPants is a custom-designed smart garment featuring 6 capacitive channels per leg. We employ a lightweight Transformer-based deep learning model that maps capacitance signals to joint angles, enabling embedded implementation on edge platforms. To test our system, we collected approximately 3.7 hours of motion data from 11 participants performing 16 daily and exercise-based movements. The model achieves a mean per-joint position error (MPJPE) of 11.96 cm and a mean per-joint angle error (MPJAE) of 12.3 degrees across the hip, knee, and ankle joints, indicating the model's ability to generalize to unseen users and movements. A comparative analysis of existing textile-based deep learning architectures reveals that our model achieves competitive reconstruction performance with up to 22 times fewer parameters and 18 times fewer FLOPs, enabling real-time inference at 42 FPS on a commercial smartwatch without quantization. These results position VersaPants as a promising step toward scalable, comfortable, and embedded motion-capture solutions for fitness, healthcare, and wellbeing applications.
comment: 14 pages, 8 figures
☆ Improving Iterative Gaussian Processes via Warm Starting Sequential Posteriors
Scalable Gaussian process (GP) inference is essential for sequential decision-making tasks, yet improving GP scalability remains a challenging problem with many open avenues of research. This paper focuses on iterative GPs, where iterative linear solvers, such as conjugate gradients, stochastic gradient descent or alternative projections, are used to approximate the GP posterior. We propose a new method which improves solver convergence of a large linear system by leveraging the known solution to a smaller system contained within. This is significant for tasks with incremental data additions, and we show that our technique achieves speed-ups when solving to tolerance, as well as improved Bayesian optimisation performance under a fixed compute budget.
☆ Beyond Generative AI: World Models for Clinical Prediction, Counterfactuals, and Planning
Healthcare requires AI that is predictive, reliable, and data-efficient. However, recent generative models lack physical foundation and temporal reasoning required for clinical decision support. As scaling language models show diminishing returns for grounded clinical reasoning, world models are gaining traction because they learn multimodal, temporally coherent, and action-conditioned representations that reflect the physical and causal structure of care. This paper reviews World Models for healthcare systems that learn predictive dynamics to enable multistep rollouts, counterfactual evaluation and planning. We survey recent work across three domains: (i) medical imaging and diagnostics (e.g., longitudinal tumor simulation, projection-transition modeling, and Joint Embedding Predictive Architecture i.e., JEPA-style predictive representation learning), (ii) disease progression modeling from electronic health records (generative event forecasting at scale), and (iii) robotic surgery and surgical planning (action-conditioned guidance and control). We also introduce a capability rubric: L1 temporal prediction, L2 action-conditioned prediction, L3 counterfactual rollouts for decision support, and L4 planning/control. Most reviewed systems achieve L1--L2, with fewer instances of L3 and rare L4. We identify cross-cutting gaps that limit clinical reliability; under-specified action spaces and safety constraints, weak interventional validation, incomplete multimodal state construction, and limited trajectory-level uncertainty calibration. This review outlines a research agenda for clinically robust prediction-first world models that integrate generative backbones (transformers, diffusion, VAE) with causal/mechanical foundation for safe decision support in healthcare.
comment: 2 Figures, 1 Table
☆ Learning-Enhanced Observer for Linear Time-Invariant Systems with Parametric Uncertainty
This work introduces a learning-enhanced observer (LEO) for linear time-invariant systems with uncertain dynamics. Rather than relying solely on nominal models, the proposed framework treats the system matrices as optimizable variables and refines them through gradient-based minimization of a steady-state output discrepancy loss. The resulting data-informed surrogate model enables the construction of an improved observer that effectively compensates for moderate parameter uncertainty while preserving the structure of classical designs. Extensive Monte Carlo studies across diverse system dimensions show systematic and statistically significant reductions, typically exceeding 15\%, in normalized estimation error for both open-loop and Luenberger observers. These results demonstrate that modern learning mechanisms can serve as a powerful complement to traditional observer design, yielding more accurate and robust state estimation in uncertain systems. Codes are available at https://github.com/Hao-B-Shu/LTI_LEO.
comment: 6 pages, ordinary version
☆ Sparse Autoencoders are Topic Models
Sparse autoencoders (SAEs) are used to analyze embeddings, but their role and practical value are debated. We propose a new perspective on SAEs by demonstrating that they can be naturally understood as topic models. We extend Latent Dirichlet Allocation to embedding spaces and derive the SAE objective as a maximum a posteriori estimator under this model. This view implies SAE features are thematic components rather than steerable directions. Based on this, we introduce SAE-TM, a topic modeling framework that: (1) trains an SAE to learn reusable topic atoms, (2) interprets them as word distributions on downstream data, and (3) merges them into any number of topics without retraining. SAE-TM yields more coherent topics than strong baselines on text and image datasets while maintaining diversity. Finally, we analyze thematic structure in image datasets and trace topic changes over time in Japanese woodblock prints. Our work positions SAEs as effective tools for large-scale thematic analysis across modalities. Code and data will be released upon publication.
☆ Optimizing Operation Recipes with Reinforcement Learning for Safe and Interpretable Control of Chemical Processes ECML24
Optimal operation of chemical processes is vital for energy, resource, and cost savings in chemical engineering. The problem of optimal operation can be tackled with reinforcement learning, but traditional reinforcement learning methods face challenges due to hard constraints related to quality and safety that must be strictly satisfied, and the large amount of required training data. Chemical processes often cannot provide sufficient experimental data, and while detailed dynamic models can be an alternative, their complexity makes it computationally intractable to generate the needed data. Optimal control methods, such as model predictive control, also struggle with the complexity of the underlying dynamic models. Consequently, many chemical processes rely on manually defined operation recipes combined with simple linear controllers, leading to suboptimal performance and limited flexibility. In this work, we propose a novel approach that leverages expert knowledge embedded in operation recipes. By using reinforcement learning to optimize the parameters of these recipes and their underlying linear controllers, we achieve an optimized operation recipe. This method requires significantly less data, handles constraints more effectively, and is more interpretable than traditional reinforcement learning methods due to the structured nature of the recipes. We demonstrate the potential of our approach through simulation results of an industrial batch polymerization reactor, showing that it can approach the performance of optimal controllers while addressing the limitations of existing methods.
comment: 16 pages, 3 figures, Part of the workshop 'Machine Learning for Chemistry and Chemical Engineering (ML4CCE)' at the ECML24 conference: Link: https://ml4cce-ecml.com/
☆ Spectral Identifiability for Interpretable Probe Geometry
Linear probes are widely used to interpret and evaluate neural representations, yet their reliability remains unclear, as probes may appear accurate in some regimes but collapse unpredictably in others. We uncover a spectral mechanism behind this phenomenon and formalize it as the Spectral Identifiability Principle (SIP), a verifiable Fisher-inspired condition for probe stability. When the eigengap separating task-relevant directions is larger than the Fisher estimation error, the estimated subspace concentrates and accuracy remains consistent, whereas closing this gap induces instability in a phase-transition manner. Our analysis connects eigengap geometry, sample size, and misclassification risk through finite-sample reasoning, providing an interpretable diagnostic rather than a loose generalization bound. Controlled synthetic studies, where Fisher quantities are computed exactly, confirm these predictions and show how spectral inspection can anticipate unreliable probes before they distort downstream evaluation.
☆ Graph Diffusion Counterfactual Explanation
Machine learning models that operate on graph-structured data, such as molecular graphs or social networks, often make accurate predictions but offer little insight into why certain predictions are made. Counterfactual explanations address this challenge by seeking the closest alternative scenario where the model's prediction would change. Although counterfactual explanations are extensively studied in tabular data and computer vision, the graph domain remains comparatively underexplored. Constructing graph counterfactuals is intrinsically difficult because graphs are discrete and non-euclidean objects. We introduce Graph Diffusion Counterfactual Explanation, a novel framework for generating counterfactual explanations on graph data, combining discrete diffusion models and classifier-free guidance. We empirically demonstrate that our method reliably generates in-distribution as well as minimally structurally different counterfactuals for both discrete classification targets and continuous properties.
☆ GeoPTH: A Lightweight Approach to Category-Based Trajectory Retrieval via Geometric Prototype Trajectory Hashing
Trajectory similarity retrieval is an important part of spatiotemporal data mining, however, existing methods have the following limitations: traditional metrics are computationally expensive, while learning-based methods suffer from substantial training costs and potential instability. This paper addresses these problems by proposing \textbf{Geo}metric \textbf{P}rototype \textbf{T}rajectory \textbf{H}ashing (GeoPTH), a novel, lightweight, and non-learning framework for efficient category-based trajectory retrieval. GeoPTH constructs data-dependent hash functions by using representative trajectory prototypes, i.e., small point sets preserving geometric characteristics, as anchors. The hashing process is efficient, which involves mapping a new trajectory to its closest prototype via a robust, \textit{Hausdorff} metric. Extensive experiments show that GeoPTH's retrieval accuracy is highly competitive with both traditional metrics and state-of-the-art learning methods, and it significantly outperforms binary codes generated through simple binarization of the learned embeddings. Critically, GeoPTH consistently outperforms all competitors in terms of efficiency. Our work demonstrates that a lightweight, prototype-centric approach offers a practical and powerful alternative, achieving an exceptional retrieval performance and computational efficiency.
☆ Pass@k Metric for RLVR: A Diagnostic Tool of Exploration, But Not an Objective
The ability of Large Language Models (LLMs) to perform complex, multi-step reasoning is a central focus of modern AI research. To evaluate and enhance this capability, the pass@k metric, which measures the probability of obtaining at least one correct solution in k independent samples, has received significant attention. Its intuitive appeal has led to its adoption not only as an evaluation standard but also as a direct optimization objective in reinforcement learning. In this paper, we analyze the pass@k objective, derive its gradient, and demonstrate that it is fundamentally a per-example positive reweighting of the simpler pass@1 objective. Our analysis reveals that the pass@k objective provides a vanishing learning signal in regimes where exploration is most critical. We further analyze the dynamics of "exploration collapse", showing that as the policy concentrates probability mass, the gap between pass@k and pass@1 diminishes. We conclude that while pass@k is a useful diagnostic tool, it may be an unsuitable direct objective for optimization. Instead, mechanisms explicitly encouraging efficient exploration could offer a more effective path forward for reinforcement learning in reasoning tasks.
☆ Deep SOR Minimax Q-learning for Two-player Zero-sum Game
In this work, we consider the problem of a two-player zero-sum game. In the literature, the successive over-relaxation Q-learning algorithm has been developed and implemented, and it is seen to result in a lower contraction factor for the associated Q-Bellman operator resulting in a faster value iteration-based procedure. However, this has been presented only for the tabular case and not for the setting with function approximation that typically caters to real-world high-dimensional state-action spaces. Furthermore, such settings in the case of two-player zero-sum games have not been considered. We thus propose a deep successive over-relaxation minimax Q-learning algorithm that incorporates deep neural networks as function approximators and is suitable for high-dimensional spaces. We prove the finite-time convergence of the proposed algorithm. Through numerical experiments, we show the effectiveness of the proposed method over the existing Q-learning algorithm. Our ablation studies demonstrate the effect of different values of the crucial successive over-relaxation parameter.
☆ Real-Time Inference for Distributed Multimodal Systems under Communication Delay Uncertainty
Connected cyber-physical systems perform inference based on real-time inputs from multiple data streams. Uncertain communication delays across data streams challenge the temporal flow of the inference process. State-of-the-art (SotA) non-blocking inference methods rely on a reference-modality paradigm, requiring one modality input to be fully received before processing, while depending on costly offline profiling. We propose a novel, neuro-inspired non-blocking inference paradigm that primarily employs adaptive temporal windows of integration (TWIs) to dynamically adjust to stochastic delay patterns across heterogeneous streams while relaxing the reference-modality requirement. Our communication-delay-aware framework achieves robust real-time inference with finer-grained control over the accuracy-latency tradeoff. Experiments on the audio-visual event localization (AVEL) task demonstrate superior adaptability to network dynamics compared to SotA approaches.
comment: 6 pages, 3 figures, submitted to IEEE ICC 2026
☆ Mind the Gap: Bridging Prior Shift in Realistic Few-Shot Crop-Type Classification
Real-world agricultural distributions often suffer from severe class imbalance, typically following a long-tailed distribution. Labeled datasets for crop-type classification are inherently scarce and remain costly to obtain. When working with such limited data, training sets are frequently constructed to be artificially balanced -- in particular in the case of few-shot learning -- failing to reflect real-world conditions. This mismatch induces a shift between training and test label distributions, degrading real-world generalization. To address this, we propose Dirichlet Prior Augmentation (DirPA), a novel method that simulates an unknown label distribution skew of the target domain proactively during model training. Specifically, we model the real-world distribution as Dirichlet-distributed random variables, effectively performing a prior augmentation during few-shot learning. Our experiments show that DirPA successfully shifts the decision boundary and stabilizes the training process by acting as a dynamic feature regularizer.
comment: 7 pages, 4 figures
☆ FlipVQA-Miner: Cross-Page Visual Question-Answer Mining from Textbooks
The development of Large Language Models (LLMs) increasingly depends on high-quality supervised data, yet existing instruction-tuning and RL datasets remain costly to curate and often rely on synthetic samples that introduce hallucination and limited diversity. At the same time, textbooks and exercise materials contain abundant, high-quality human-authored Question-Answer(QA) content that remains underexploited due to the difficulty of transforming raw PDFs into AI-ready supervision. Although modern OCR and vision-language models can accurately parse document structure, their outputs lack the semantic alignment required for training. We propose an automated pipeline that extracts well-formed QA and visual-QA (VQA) pairs from educational documents by combining layout-aware OCR with LLM-based semantic parsing. Experiments across diverse document types show that the method produces accurate, aligned, and low-noise QA/VQA pairs. This approach enables scalable use of real-world educational content and provides a practical alternative to synthetic data generation for improving reasoning-oriented LLM training. All code and data-processing pipelines are open-sourced at https://github.com/OpenDCAI/DataFlow.
☆ Towards Overcoming Data Scarcity in Nuclear Energy: A Study on Critical Heat Flux with Physics-consistent Conditional Diffusion Model
Deep generative modeling provides a powerful pathway to overcome data scarcity in energy-related applications where experimental data are often limited, costly, or difficult to obtain. By learning the underlying probability distribution of the training dataset, deep generative models, such as the diffusion model (DM), can generate high-fidelity synthetic samples that statistically resemble the training data. Such synthetic data generation can significantly enrich the size and diversity of the available training data, and more importantly, improve the robustness of downstream machine learning models in predictive tasks. The objective of this paper is to investigate the effectiveness of DM for overcoming data scarcity in nuclear energy applications. By leveraging a public dataset on critical heat flux (CHF) that cover a wide range of commercial nuclear reactor operational conditions, we developed a DM that can generate an arbitrary amount of synthetic samples for augmenting of the CHF dataset. Since a vanilla DM can only generate samples randomly, we also developed a conditional DM capable of generating targeted CHF data under user-specified thermal-hydraulic conditions. The performance of the DM was evaluated based on their ability to capture empirical feature distributions and pair-wise correlations, as well as to maintain physical consistency. The results showed that both the DM and conditional DM can successfully generate realistic and physics-consistent CHF data. Furthermore, uncertainty quantification was performed to establish confidence in the generated data. The results demonstrated that the conditional DM is highly effective in augmenting CHF data while maintaining acceptable levels of uncertainty.
☆ Causal Synthetic Data Generation in Recruitment ECAI 2025
The importance of Synthetic Data Generation (SDG) has increased significantly in domains where data quality is poor or access is limited due to privacy and regulatory constraints. One such domain is recruitment, where publicly available datasets are scarce due to the sensitive nature of information typically found in curricula vitae, such as gender, disability status, or age. % This lack of accessible, representative data presents a significant obstacle to the development of fair and transparent machine learning models, particularly ranking algorithms that require large volumes of data to effectively learn how to recommend candidates. In the absence of such data, these models are prone to poor generalisation and may fail to perform reliably in real-world scenarios. % Recent advances in Causal Generative Models (CGMs) offer a promising solution. CGMs enable the generation of synthetic datasets that preserve the underlying causal relationships within the data, providing greater control over fairness and interpretability in the data generation process. % In this study, we present a specialised SDG method involving two CGMs: one modelling job offers and the other modelling curricula. Each model is structured according to a causal graph informed by domain expertise. We use these models to generate synthetic datasets and evaluate the fairness of candidate rankings under controlled scenarios that introduce specific biases.
comment: Published. Conference: AEQUITAS 2025: Workshop on Fairness and Bias in AI | co-located with ECAI 2025, Bologna, Italy
☆ A Switching Framework for Online Interval Scheduling with Predictions AAAI 2026
We study online interval scheduling in the irrevocable setting, where each interval must be immediately accepted or rejected upon arrival. The objective is to maximize the total length of accepted intervals while ensuring that no two accepted intervals overlap. We consider this problem in a learning-augmented setting, where the algorithm has access to (machine-learned) predictions. The goal is to design algorithms that leverage these predictions to improve performance while maintaining robust guarantees in the presence of prediction errors. Our main contribution is the SemiTrust-and-Switch framework, which provides a unified approach for combining prediction-based and classical interval scheduling algorithms. This framework applies to both deterministic and randomized algorithms and captures the trade-off between consistency (performance under accurate predictions) and robustness (performance under adversarial inputs). Moreover, we provide lower bounds, proving the tightness of this framework in particular settings. We further design a randomized algorithm that smoothly interpolates between prediction-based and robust algorithms. This algorithm achieves both robustness and smoothness--its performance degrades gracefully with the quality of the prediction.
comment: This paper will appear in AAAI 2026
☆ ART: A Graph-based Framework for Investigating Illicit Activity in Monero via Address-Ring-Transaction Structures
As Law Enforcement Agencies advance in cryptocurrency forensics, criminal actors aiming to conceal illicit fund movements increasingly turn to "mixin" services or privacy-based cryptocurrencies. Monero stands out as a leading choice due to its strong privacy preserving and untraceability properties, making conventional blockchain analysis ineffective. Understanding the behavior and operational patterns of criminal actors within Monero is therefore challenging and it is essential to support future investigative strategies and disrupt illicit activities. In this work, we propose a case study in which we leverage a novel graph-based methodology to extract structural and temporal patterns from Monero transactions linked to already discovered criminal activities. By building Address-Ring-Transaction graphs from flagged transactions, we extract structural and temporal features and use them to train Machine Learning models capable of detecting similar behavioral patterns that could highlight criminal modus operandi. This represents a first partial step toward developing analytical tools that support investigative efforts in privacy-preserving blockchain ecosystems
comment: Paper accepted @ BLOCKCHAIN & CRYPTOCURRENCY CONFERENCE (B2C'2025)
☆ CausalMamba: Interpretable State Space Modeling for Temporal Rumor Causality
Rumor detection on social media remains a challenging task due to the complex propagation dynamics and the limited interpretability of existing models. While recent neural architectures capture content and structural features, they often fail to reveal the underlying causal mechanisms of misinformation spread. We propose CausalMamba, a novel framework that integrates Mamba-based sequence modeling, graph convolutional networks (GCNs), and differentiable causal discovery via NOTEARS. CausalMamba learns joint representations of temporal tweet sequences and reply structures, while uncovering latent causal graphs to identify influential nodes within each propagation chain. Experiments on the Twitter15 dataset show that our model achieves competitive classification performance compared to strong baselines, and uniquely enables counterfactual intervention analysis. Qualitative results demonstrate that removing top-ranked causal nodes significantly alters graph connectivity, offering interpretable insights into rumor dynamics. Our framework provides a unified approach for rumor classification and influence analysis, paving the way for more explainable and actionable misinformation detection systems.
comment: Preprint. 9 pages, 3 figures, 2 tables. Code and implementation details available at: https://github.com/XiaotongZhan/Causal_Mamba
☆ Achieving Skilled and Reliable Daily Probabilistic Forecasts of Wind Power at Subseasonal-to-Seasonal Timescales over France
Accurate and reliable wind power forecasts are crucial for grid stability, balancing supply and demand, and market risk management. Even though short-term weather forecasts have been thoroughly used to provide short-term renewable power predictions, forecasts involving longer prediction horizons still need investigations. Despite the recent progress in subseasonal-to-seasonal weather probabilistic forecasting, their use for wind power prediction usually involves both temporal and spatial aggregation achieve reasonable skill. In this study, we present a forecasting pipeline enabling to transform ECMWF subseasonal-to-seasonal weather forecasts into wind power forecasts for lead times ranging from 1 day to 46 days at daily resolution. This framework also include post-processing of the resulting power ensembles to account for the biases and lack of dispersion of the weather forecasts. We show that our method is able to outperform a climatological baseline by 50 % in terms of both Continuous Ranked Probability Skill Score and Ensemble Mean Squared Error while also providing near perfect calibration of the forecasts for lead times ranging from 15 to 46 days.
☆ MagBotSim: Physics-Based Simulation and Reinforcement Learning Environments for Magnetic Robotics
Magnetic levitation is about to revolutionize in-machine material flow in industrial automation. Such systems are flexibly configurable and can include a large number of independently actuated shuttles (movers) that dynamically rebalance production capacity. Beyond their capabilities for dynamic transportation, these systems possess the inherent yet unexploited potential to perform manipulation. By merging the fields of transportation and manipulation into a coordinated swarm of magnetic robots (MagBots), we enable manufacturing systems to achieve significantly higher efficiency, adaptability, and compactness. To support the development of intelligent algorithms for magnetic levitation systems, we introduce MagBotSim (Magnetic Robotics Simulation): a physics-based simulation for magnetic levitation systems. By framing magnetic levitation systems as robot swarms and providing a dedicated simulation, this work lays the foundation for next generation manufacturing systems powered by Magnetic Robotics. MagBotSim's documentation, videos, experiments, and code are available at: https://ubi-coro.github.io/MagBotSim/
☆ Approximation rates of quantum neural networks for periodic functions via Jackson's inequality
Quantum neural networks (QNNs) are an analog of classical neural networks in the world of quantum computing, which are represented by a unitary matrix with trainable parameters. Inspired by the universal approximation property of classical neural networks, ensuring that every continuous function can be arbitrarily well approximated uniformly on a compact set of a Euclidean space, some recent works have established analogous results for QNNs, ranging from single-qubit to multi-qubit QNNs, and even hybrid classical-quantum models. In this paper, we study the approximation capabilities of QNNs for periodic functions with respect to the supremum norm. We use the Jackson inequality to approximate a given function by implementing its approximating trigonometric polynomial via a suitable QNN. In particular, we see that by restricting to the class of periodic functions, one can achieve a quadratic reduction of the number of parameters, producing better approximation results than in the literature. Moreover, the smoother the function, the fewer parameters are needed to construct a QNN to approximate the function.
☆ Enhancing Nuclear Reactor Core Simulation through Data-Based Surrogate Models
In recent years, there has been an increasing need for Nuclear Power Plants (NPPs) to improve flexibility in order to match the rapid growth of renewable energies. The Operator Assistance Predictive System (OAPS) developed by Framatome addresses this problem through Model Predictive Control (MPC). In this work, we aim to improve MPC methods through data-driven simulation schemes. Thus, from a set of nonlinear stiff ordinary differential equations (ODEs), this paper introduces two surrogate models acting as alternative simulation schemes to enhance nuclear reactor core simulation. We show that both data-driven and physics-informed models can rapidly integrate complex dynamics, with a very low computational time (up to 1000x time reduction).
☆ Labels Matter More Than Models: Quantifying the Benefit of Supervised Time Series Anomaly Detection
Time series anomaly detection (TSAD) is a critical data mining task often constrained by label scarcity. Consequently, current research predominantly focuses on Unsupervised Time-series Anomaly Detection (UTAD), relying on complex architectures to model normal data distributions. However, this approach often overlooks the significant performance gains available from limited anomaly labels achievable in practical scenarios. This paper challenges the premise that architectural complexity is the optimal path for TSAD. We conduct the first methodical comparison between supervised and unsupervised paradigms and introduce STAND, a streamlined supervised baseline. Extensive experiments on five public datasets demonstrate that: (1) Labels matter more than models: under a limited labeling budget, simple supervised models significantly outperform complex state-of-the-art unsupervised methods; (2) Supervision yields higher returns: the performance gain from minimal supervision far exceeds that from architectural innovations; and (3) Practicality: STAND exhibits superior prediction consistency and anomaly localization compared to unsupervised counterparts. These findings advocate for a data-centric shift in TSAD research, emphasizing label utilization over purely algorithmic complexity. The code is publicly available at https://github.com/EmorZz1G/STAND.
comment: 16 pages, 14 figures, 7 tables. Under review
☆ An Interpretability-Guided Framework for Responsible Synthetic Data Generation in Emotional Text
Emotion recognition from social media is critical for understanding public sentiment, but accessing training data has become prohibitively expensive due to escalating API costs and platform restrictions. We introduce an interpretability-guided framework where Shapley Additive Explanations (SHAP) provide principled guidance for LLM-based synthetic data generation. With sufficient seed data, SHAP-guided approach matches real data performance, significantly outperforms naïve generation, and substantially improves classification for underrepresented emotion classes. However, our linguistic analysis reveals that synthetic text exhibits reduced vocabulary richness and fewer personal or temporally complex expressions than authentic posts. This work provides both a practical framework for responsible synthetic data generation and a critical perspective on its limitations, underscoring that the future of trustworthy AI depends on navigating the trade-offs between synthetic utility and real-world authenticity.
☆ Angular Graph Fractional Fourier Transform: Theory and Application
Graph spectral representations are fundamental in graph signal processing, offering a rigorous framework for analyzing and processing graph-structured data. The graph fractional Fourier transform (GFRFT) extends the classical graph Fourier transform (GFT) with a fractional-order parameter, enabling flexible spectral analysis while preserving mathematical consistency. The angular graph Fourier transform (AGFT) introduces angular control via GFT eigenvector rotation; however, existing constructions fail to degenerate to the GFT at zero angle, which is a critical flaw that undermines theoretical consistency and interpretability. To resolve these complementary limitations - GFRFT's lack of angular regulation and AGFT's defective degeneracy - this study proposes an angular GFRFT (AGFRFT), a unified framework that integrates fractional-order and angular spectral analyses with theoretical rigor. A degeneracy-friendly rotation matrix family ensures exact GFT degeneration at zero angle, with two AGFRFT variants (I-AGFRFT and II-AGFRFT) defined accordingly. Rigorous theoretical analyses confirm their unitarity, invertibility, and smooth parameter dependence. Both support learnable joint parameterization of the angle and fractional order, enabling adaptive spectral processing for diverse graph signals. Extensive experiments on real-world data denoising, image denoising, and point cloud denoising demonstrate that AGFRFT outperforms GFRFT and AGFT in terms of spectral concentration, reconstruction quality, and controllable spectral manipulation, establishing a robust and flexible tool for integrated angular fractional spectral analysis in graph signal processing.
☆ Pathlet Variational Auto-Encoder for Robust Trajectory Generation
Trajectory generation has recently drawn growing interest in privacy-preserving urban mobility studies and location-based service applications. Although many studies have used deep learning or generative AI methods to model trajectories and have achieved promising results, the robustness and interpretability of such models are largely unexplored. This limits the application of trajectory generation algorithms on noisy real-world data and their trustworthiness in downstream tasks. To address this issue, we exploit the regular structure in urban trajectories and propose a deep generative model based on the pathlet representation, which encode trajectories with binary vectors associated with a learned dictionary of trajectory segments. Specifically, we introduce a probabilistic graphical model to describe the trajectory generation process, which includes a Variational Autoencoder (VAE) component and a linear decoder component. During training, the model can simultaneously learn the latent embedding of pathlet representations and the pathlet dictionary that captures mobility patterns in the trajectory dataset. The conditional version of our model can also be used to generate customized trajectories based on temporal and spatial constraints. Our model can effectively learn data distribution even using noisy data, achieving relative improvements of $35.4\%$ and $26.3\%$ over strong baselines on two real-world trajectory datasets. Moreover, the generated trajectories can be conveniently utilized for multiple downstream tasks, including trajectory prediction and data denoising. Lastly, the framework design offers a significant efficiency advantage, saving $64.8\%$ of the time and $56.5\%$ of GPU memory compared to previous approaches.
☆ HybSpecNet: A Critical Analysis of Architectural Instability in Hybrid-Domain Spectral GNNs
Spectral Graph Neural Networks offer a principled approach to graph filtering but face a fundamental "Stability-vs-Adaptivity" trade-off. This trade-off is dictated by the choice of spectral domain. Filters in the finite [-1, 1] domain (e.g., ChebyNet) are numerically stable at high polynomial degrees (K) but are static and low-pass, causing them to fail on heterophilic graphs. Conversely, filters in the semi-infinite [0, infty) domain (e.g., KrawtchoukNet) are highly adaptive and achieve SOTA results on heterophily by learning non-low-pass responses. However, as we demonstrate, these adaptive filters can also suffer from numerical instability, leading to catastrophic performance collapse at high K. In this paper, we propose to resolve this trade-off by designing a hybrid-domain GNN, HybSpecNet, which combines a stable `ChebyNet` branch with an adaptive `KrawtchoukNet` branch. We first demonstrate that a "naive" hybrid architecture, which fuses the branches via concatenation, successfully unifies performance at low K, achieving strong results on both homophilic and heterophilic benchmarks. However, we then prove that this naive architecture fails the stability test. Our K-ablation experiments show that this architecture catastrophically collapses at K=25, exactly mirroring the collapse of its unstable `KrawtchoukNet` branch. We identify this critical finding as "Instability Poisoning," where `NaN`/`Inf` gradients from the adaptive branch destroy the training of the model. Finally, we propose and validate an advanced architecture that uses "Late Fusion" to completely isolate the gradient pathways. We demonstrate that this successfully solves the instability problem, remaining perfectly stable up to K=30 while retaining its SOTA performance across all graph types. This work identifies a critical architectural pitfall in hybrid GNN design and provides the robust architectural solution.
☆ Mitigating Estimation Bias with Representation Learning in TD Error-Driven Regularization
Deterministic policy gradient algorithms for continuous control suffer from value estimation biases that degrade performance. While double critics reduce such biases, the exploration potential of double actors remains underexplored. Building on temporal-difference error-driven regularization (TDDR), a double actor-critic framework, this work introduces enhanced methods to achieve flexible bias control and stronger representation learning. We propose three convex combination strategies, symmetric and asymmetric, that balance pessimistic estimates to mitigate overestimation and optimistic exploration via double actors to alleviate underestimation. A single hyperparameter governs this mechanism, enabling tunable control across the bias spectrum. To further improve performance, we integrate augmented state and action representations into the actor and critic networks. Extensive experiments show that our approach consistently outperforms benchmarks, demonstrating the value of tunable bias and revealing that both overestimation and underestimation can be exploited differently depending on the environment.
☆ AssayMatch: Learning to Select Data for Molecular Activity Models
The performance of machine learning models in drug discovery is highly dependent on the quality and consistency of the underlying training data. Due to limitations in dataset sizes, many models are trained by aggregating bioactivity data from diverse sources, including public databases such as ChEMBL. However, this approach often introduces significant noise due to variability in experimental protocols. We introduce AssayMatch, a framework for data selection that builds smaller, more homogenous training sets attuned to the test set of interest. AssayMatch leverages data attribution methods to quantify the contribution of each training assay to model performance. These attribution scores are used to finetune language embeddings of text-based assay descriptions to capture not just semantic similarity, but also the compatibility between assays. Unlike existing data attribution methods, our approach enables data selection for a test set with unknown labels, mirroring real-world drug discovery campaigns where the activities of candidate molecules are not known in advance. At test time, embeddings finetuned with AssayMatch are used to rank all available training data. We demonstrate that models trained on data selected by AssayMatch are able to surpass the performance of the model trained on the complete dataset, highlighting its ability to effectively filter out harmful or noisy experiments. We perform experiments on two common machine learning architectures and see increased prediction capability over a strong language-only baseline for 9/12 model-target pairs. AssayMatch provides a data-driven mechanism to curate higher-quality datasets, reducing noise from incompatible experiments and improving the predictive power and data efficiency of models for drug discovery. AssayMatch is available at https://github.com/Ozymandias314/AssayMatch.
☆ L-JacobiNet and S-JacobiNet: An Analysis of Adaptive Generalization, Stabilization, and Spectral Domain Trade-offs in GNNs
Spectral GNNs, like ChebyNet, are limited by heterophily and over-smoothing due to their static, low-pass filter design. This work investigates the "Adaptive Orthogonal Polynomial Filter" (AOPF) class as a solution. We introduce two models operating in the [-1, 1] domain: 1) `L-JacobiNet`, the adaptive generalization of `ChebyNet` with learnable alpha, beta shape parameters, and 2) `S-JacobiNet`, a novel baseline representing a LayerNorm-stabilized static `ChebyNet`. Our analysis, comparing these models against AOPFs in the [0, infty) domain (e.g., `LaguerreNet`), reveals critical, previously unknown trade-offs. We find that the [0, infty) domain is superior for modeling heterophily, while the [-1, 1] domain (Jacobi) provides superior numerical stability at high K (K>20). Most significantly, we discover that `ChebyNet`'s main flaw is stabilization, not its static nature. Our static `S-JacobiNet` (ChebyNet+LayerNorm) outperforms the adaptive `L-JacobiNet` on 4 out of 5 benchmark datasets, identifying `S-JacobiNet` as a powerful, overlooked baseline and suggesting that adaptation in the [-1, 1] domain can lead to overfitting.
☆ Operon: Incremental Construction of Ragged Data via Named Dimensions
Modern data processing workflows frequently encounter ragged data: collections with variable-length elements that arise naturally in domains like natural language processing, scientific measurements, and autonomous AI agents. Existing workflow engines lack native support for tracking the shapes and dependencies inherent to ragged data, forcing users to manage complex indexing and dependency bookkeeping manually. We present Operon, a Rust-based workflow engine that addresses these challenges through a novel formalism of named dimensions with explicit dependency relations. Operon provides a domain-specific language where users declare pipelines with dimension annotations that are statically verified for correctness, while the runtime system dynamically schedules tasks as data shapes are incrementally discovered during execution. We formalize the mathematical foundation for reasoning about partial shapes and prove that Operon's incremental construction algorithm guarantees deterministic and confluent execution in parallel settings. The system's explicit modeling of partially-known states enables robust persistence and recovery mechanisms, while its per-task multi-queue architecture achieves efficient parallelism across heterogeneous task types. Empirical evaluation demonstrates that Operon outperforms an existing workflow engine with 14.94x baseline overhead reduction while maintaining near-linear end-to-end output rates as workloads scale, making it particularly suitable for large-scale data generation pipelines in machine learning applications.
☆ A Mathematical Framework for Custom Reward Functions in Job Application Evaluation using Reinforcement Learning
Conventional Applicant Tracking Systems (ATS) tend to be inflexible keyword-matchers, and deny gifted candidates a role due to a few minor semantic mismatches. This article describes a new two-step process to design a more refined resume evaluation model based on a small language model (<600M parameters) that is finetuned using GRPO on a custom reward function. To begin with, Supervised Fine-Tuning (SFT) was used to build a solid baseline model. Second, this SFT model was also optimized with the help of Reinforcement Learning (RL) through GRPO under the guidance of a new, multi-component reward function that can holistically assess candidates beyond simple keyword matching. We indicate that the RL application presents a critical problem of reward hacking due to the initial experiments of aggressive penalties, which produces faulty, excessively negative model behaviors. We have overcome this challenge by refining the reward function repeatedly and training hyperparameters into a stable "gentle polishing process" of the reward function. Our resulting GRPO-polished model demonstrates significant real-world efficacy, achieving a final accuracy of 91% on unseen test data. The model shows a strong ability to correctly identify qualified candidates (recall of 0.85 for the 'SELECTED' class) while also showing exceptional precision (1.0), confirming its reliability. These results indicate that a properly executed, two-step fine-tuning procedure can indeed effectively refine a small language model to be able to conduct fine-tuned and human-like candidate scoring, overcoming the drawbacks of both traditional ATS and naive RL usage.
comment: 13 pages, 4 figures, 2 equations, 3 Tables
☆ ILoRA: Federated Learning with Low-Rank Adaptation for Heterogeneous Client Aggregation
Federated Learning with Low-Rank Adaptation (LoRA) faces three critical challenges under client heterogeneity: (1) Initialization-Induced Instability due to random initialization misaligning client subspaces; (2) Rank Incompatibility and Aggregation Error when averaging LoRA parameters of different ranks, which biases the global model; and (3) exacerbated Client Drift under Non-IID Data, impairing generalization. To address these challenges, we propose ILoRA, a unified framework that integrates three core innovations: a QR-based orthonormal initialization to ensure all clients start in a coherent subspace; a Concatenated QR Aggregation mechanism that fuses heterogeneous-rank updates via concatenation and decomposition, preserving information while maintaining dimension alignment; and an AdamW optimizer with rank-aware control variates to correct local updates and mitigate client drift. Supported by theoretical convergence guarantees, extensive experiments on vision and NLP benchmarks demonstrate that ILoRA consistently achieves superior accuracy and convergence stability compared to existing federated LoRA methods.
☆ Gauge-Equivariant Graph Networks via Self-Interference Cancellation
Graph Neural Networks (GNNs) excel on homophilous graphs but often fail under heterophily due to self-reinforcing and phase-inconsistent signals. We propose a Gauge-Equivariant Graph Network with Self-Interference Cancellation (GESC), which replaces additive aggregation with a projection-based interference mechanism. Unlike prior magnetic or gauge-equivariant GNNs that typically focus on phase handling in spectral filtering while largely relying on scalar weighting, GESC introduces a $\mathrm{U}(1)$ phase connection followed by a rank-1 projection that attenuates self-parallel components before attention. A sign- and phase-aware gate further regulates neighbor influence, attenuating components aligned with current node states and acting as a local notch on low-frequency modes. Across diverse graph benchmarks, our method consistently outperforms recent state-of-the-art models while offering a unified, interference-aware view of message passing. Our code is available at \href{here}{https://anonymous.4open.science/r/GESC-1B22}.
☆ Change-of-Basis Pruning via Rotational Invariance
Structured pruning removes entire neurons or channels, but its effectiveness depends on how importance is distributed across the representation space. Change-of-basis (CoB) pruning addresses this challenge by applying orthogonal linear transformations that concentrate importance within certain dimensions. However, many standard deep learning architectures are not inherently invariant to such transformations. To enable compatibility, we introduce two-subspace radial activations (TSRAs): an activation family that is invariant to orthogonal linear transformations applied independently within its two activation subspaces. This invariance allows CoB transformations to be merged into surrounding weights without incurring extra parameters. We position this work as a proof-of-concept that a rotationally invariant design may offer a principled approach towards change-of-basis pruning. We do not provide an analysis of multiple TSRA candidates nor do we explore weight initialization for any TSRAs. These limitations, combined with other necessary modifications we make to permit rotational invariance, result in a slight accuracy drop of $4.52\%$ compared to a ReLU-based control. However, using activation-magnitude importance, VGG-16 implementing our CoB+TSRA framework shows encouraging results on CIFAR-10. Under fixed-ratio structured pruning, CoB improves accuracy over a TSRA baseline at all pruning ratios and extends reliable pruning frontier from roughly $30\%$ to $70\%$ of parameters without post-prune fine tuning. Under threshold-based pruning strategies, CoB prunes $90-96\%$ of parameters while maintaining $1-6\%$ accuracy drop after fine-tuning. Together, these results indicate that rotationally invariant architectures may offer a promising path towards CoB pruning.
comment: 14 pages, 5 figures
☆ Agent0: Unleashing Self-Evolving Agents from Zero Data via Tool-Integrated Reasoning
Large Language Model (LLM) Agents, often trained with Reinforcement Learning (RL), are constrained by a dependency on human-curated data, limiting scalability and tethering AI to human knowledge. Existing self-evolution frameworks offer an alternative but are typically restricted by the model's inherent capabilities and single-round interactions, hindering the development of complex curricula involving tool use or dynamic reasoning. We introduce Agent0, a fully autonomous framework that evolves high-performing agents without external data through multi-step co-evolution and seamless tool integration. Agent0 establishes a symbiotic competition between two agents initialized from the same base LLM: a curriculum agent that proposes increasingly challenging frontier tasks, and an executor agent that learns to solve them. We integrate external tools to enhance the executor's problem-solving capacity; this improvement, in turn, pressures the curriculum agent to construct more complex, tool-aware tasks. Through this iterative process, Agent0 establishes a self-reinforcing cycle that continuously produces high-quality curricula. Empirically, Agent0 substantially boosts reasoning capabilities, improving the Qwen3-8B-Base model by 18% on mathematical reasoning and 24% on general reasoning benchmarks. Code is available at https://github.com/aiming-lab/Agent0.
☆ HGCN2SP: Hierarchical Graph Convolutional Network for Two-Stage Stochastic Programming
Two-stage Stochastic Programming (2SP) is a standard framework for modeling decision-making problems under uncertainty. While numerous methods exist, solving such problems with many scenarios remains challenging. Selecting representative scenarios is a practical method for accelerating solutions. However, current approaches typically rely on clustering or Monte Carlo sampling, failing to integrate scenario information deeply and overlooking the significant impact of the scenario order on solving time. To address these issues, we develop HGCN2SP, a novel model with a hierarchical graph designed for 2SP problems, encoding each scenario and modeling their relationships hierarchically. The model is trained in a reinforcement learning paradigm to utilize the feedback of the solver. The policy network is equipped with a hierarchical graph convolutional network for feature encoding and an attention-based decoder for scenario selection in proper order. Evaluation of two classic 2SP problems demonstrates that HGCN2SP provides high-quality decisions in a short computational time. Furthermore, HGCN2SP exhibits remarkable generalization capabilities in handling large-scale instances, even with a substantial number of variables or scenarios that were unseen during the training phase.
comment: 17 pages, 4 figures
☆ Towards a Safer and Sustainable Manufacturing Process: Material classification in Laser Cutting Using Deep Learning
Laser cutting is a widely adopted technology in material processing across various industries, but it generates a significant amount of dust, smoke, and aerosols during operation, posing a risk to both the environment and workers' health. Speckle sensing has emerged as a promising method to monitor the cutting process and identify material types in real-time. This paper proposes a material classification technique using a speckle pattern of the material's surface based on deep learning to monitor and control the laser cutting process. The proposed method involves training a convolutional neural network (CNN) on a dataset of laser speckle patterns to recognize distinct material types for safe and efficient cutting. Previous methods for material classification using speckle sensing may face issues when the color of the laser used to produce the speckle pattern is changed. Experiments conducted in this study demonstrate that the proposed method achieves high accuracy in material classification, even when the laser color is changed. The model achieved an accuracy of 98.30 % on the training set and 96.88% on the validation set. Furthermore, the model was evaluated on a set of 3000 new images for 30 different materials, achieving an F1-score of 0.9643. The proposed method provides a robust and accurate solution for material-aware laser cutting using speckle sensing.
☆ CARE: Turning LLMs Into Causal Reasoning Expert
Large language models (LLMs) have recently demonstrated impressive capabilities across a range of reasoning and generation tasks. However, research studies have shown that LLMs lack the ability to identify causal relationships, a fundamental cornerstone of human intelligence. We first conduct an exploratory investigation of LLMs' behavior when asked to perform a causal-discovery task and find that they mostly rely on the semantic meaning of variable names, ignoring the observation data. This is unsurprising, given that LLMs were never trained to process structural datasets. To first tackle this challenge, we prompt the LLMs with the outputs of established causal discovery algorithms designed for observational datasets. These algorithm outputs effectively serve as the sufficient statistics of the observation data. However, quite surprisingly, we find that prompting the LLMs with these sufficient statistics decreases the LLMs' performance in causal discovery. To address this current limitation, we propose CARE, a framework that enhances LLMs' causal-reasoning ability by teaching them to effectively utilize the outputs of established causal-discovery algorithms through supervised fine-tuning. Experimental results show that a finetuned Qwen2.5-1.5B model produced by CARE significantly outperforms both traditional causal-discovery algorithms and state-of-the-art LLMs with over a thousand times more parameters, demonstrating effective utilization of its own knowledge and the external algorithmic clues.
☆ Physics-Guided Inductive Spatiotemporal Kriging for PM2.5 with Satellite Gradient Constraints
High-resolution mapping of fine particulate matter (PM2.5) is a cornerstone of sustainable urbanism but remains critically hindered by the spatial sparsity of ground monitoring networks. While traditional data-driven methods attempt to bridge this gap using satellite Aerosol Optical Depth (AOD), they often suffer from severe, non-random data missingness (e.g., due to cloud cover or nighttime) and inversion biases. To overcome these limitations, this study proposes the Spatiotemporal Physics-Guided Inference Network (SPIN), a novel framework designed for inductive spatiotemporal kriging. Unlike conventional approaches, SPIN synergistically integrates domain knowledge into deep learning by explicitly modeling physical advection and diffusion processes via parallel graph kernels. Crucially, we introduce a paradigm-shifting training strategy: rather than using error-prone AOD as a direct input, we repurpose it as a spatial gradient constraint within the loss function. This allows the model to learn structural pollution patterns from satellite data while remaining robust to data voids. Validated in the highly polluted Beijing-Tianjin-Hebei and Surrounding Areas (BTHSA), SPIN achieves a new state-of-the-art with a Mean Absolute Error (MAE) of 9.52 ug/m^3, effectively generating continuous, physically plausible pollution fields even in unmonitored areas. This work provides a robust, low-cost, and all-weather solution for fine-grained environmental management.
♻ ☆ Distance-Preserving Representations for Genomic Spatial Reconstruction
The spatial context of single-cell gene expression data is crucial for many downstream analyses, yet often remains inaccessible due to practical and technical limitations, restricting the utility of such datasets. In this paper, we propose a generic representation learning and transfer learning framework dp-VAE, capable of reconstructing the spatial coordinates associated with the provided gene expression data. Central to our approach is a distance-preserving regularizer integrated into the loss function during training, ensuring the model effectively captures and utilizes spatial context signals from reference datasets. During the inference stage, the produced latent representation of the model can be used to reconstruct or impute the spatial context of the provided gene expression by solving a constrained optimization problem. We also explore the theoretical connections between distance-preserving loss, distortion, and the bi-Lipschitz condition within generative models. Finally, we demonstrate the effectiveness of dp-VAE in different tasks involving training robustness, out-of-sample evaluation, and transfer learning inference applications by testing it over 27 publicly available datasets. This underscores its applicability to a wide range of genomics studies that were previously hindered by the absence of spatial data.
♻ ☆ LLMInit: A Free Lunch from Large Language Models for Selective Initialization of Recommendation EMNLP 2025
Collaborative filtering (CF) is widely adopted in industrial recommender systems (RecSys) for modeling user-item interactions across numerous applications, but often struggles with cold-start and data-sparse scenarios. Recent advancements in pre-trained large language models (LLMs) with rich semantic knowledge, offer promising solutions to these challenges. However, deploying LLMs at scale is hindered by their significant computational demands and latency. In this paper, we propose a novel and scalable LLM-RecSys framework, LLMInit, designed to integrate pretrained LLM embeddings into CF models through selective initialization strategies. Specifically, we identify the embedding collapse issue observed when CF models scale and match the large embedding sizes in LLMs and avoid the problem by introducing efficient sampling methods, including, random, uniform, and variance-based selections. Comprehensive experiments conducted on multiple real-world datasets demonstrate that LLMInit significantly improves recommendation performance while maintaining low computational costs, offering a practical and scalable solution for industrial applications. To facilitate industry adoption and promote future research, we provide open-source access to our implementation at https://github.com/DavidZWZ/LLMInit.
comment: Accepted in EMNLP 2025 Industry Track
♻ ☆ Modular Jump Gaussian Processes
Gaussian processes (GPs) furnish accurate nonlinear predictions with well-calibrated uncertainty. However, the typical GP setup has a built-in stationarity assumption, making it ill-suited for modeling data from processes with sudden changes, or "jumps" in the output variable. The "jump GP" (JGP) was developed for modeling data from such processes, combining local GPs and latent "level" variables under a joint inferential framework. But joint modeling can be fraught with difficulty. We aim to simplify by suggesting a more modular setup, eschewing joint inference but retaining the main JGP themes: (a) learning optimal neighborhood sizes that locally respect manifolds of discontinuity; and (b) a new cluster-based (latent) feature to capture regions of distinct output levels on both sides of the manifold. We show that each of (a) and (b) separately leads to dramatic improvements when modeling processes with jumps. In tandem (but without requiring joint inference) that benefit is compounded, as illustrated on real and synthetic benchmark examples from the recent literature.
comment: 19 pages, 13 figures
♻ ☆ Leveraging Reinforcement Learning, Genetic Algorithms and Transformers for background determination in particle physics
Experimental studies of beauty hadron decays face significant challenges due to a wide range of backgrounds arising from the numerous possible decay channels with similar final states. For a particular signal decay, the process for ascertaining the most relevant background processes necessitates a detailed analysis of final state particles, potential misidentifications, and kinematic overlaps, which, due to computational limitations, is restricted to the simulation of only the most relevant backgrounds. Moreover, this process typically relies on the physicist's intuition and expertise, as no systematic method exists. This paper has two primary goals. First, from a particle physics perspective, we present a novel approach that utilises Reinforcement Learning (RL) to overcome the aforementioned challenges by systematically determining the critical backgrounds affecting beauty hadron decay measurements. While beauty hadron physics serves as the case study in this work, the proposed strategy is broadly adaptable to other types of particle physics measurements. Second, from a Machine Learning perspective, we introduce a novel algorithm which exploits the synergy between RL and Genetic Algorithms (GAs) for environments with highly sparse rewards and a large trajectory space. This strategy leverages GAs to efficiently explore the trajectory space and identify successful trajectories, which are used to guide the RL agent's training. Our method also incorporates a transformer architecture for the RL agent to handle token sequences representing decays.
comment: 34 pages, 12 figures
♻ ☆ Nonparametric estimation of conditional probability distributions using a generative approach based on conditional push-forward neural networks
We introduce conditional push-forward neural networks (CPFN), a generative framework for conditional distribution estimation. Instead of directly modeling the conditional density $f_{Y|X}$, CPFN learns a stochastic map $\varphi=\varphi(x,u)$ such that $\varphi(x,U)$ and $Y|X=x$ follow approximately the same law, with $U$ a suitable random vector of pre-defined latent variables. This enables efficient conditional sampling and straightforward estimation of conditional statistics through Monte Carlo methods. The model is trained via an objective function derived from a Kullback-Leibler formulation, without requiring invertibility or adversarial training. We establish a near-asymptotic consistency result and demonstrate experimentally that CPFN can achieve performance competitive with, or even superior to, state-of-the-art methods, including kernel estimators, tree-based algorithms, and popular deep learning techniques, all while remaining lightweight and easy to train.
♻ ☆ Measuring and Controlling Solution Degeneracy across Task-Trained Recurrent Neural Networks
Task-trained recurrent neural networks (RNNs) are widely used in neuroscience and machine learning to model dynamical computations. To gain mechanistic insight into how neural systems solve tasks, prior work often reverse-engineers individual trained networks. However, different RNNs trained on the same task and achieving similar performance can exhibit strikingly different internal solutions, a phenomenon known as solution degeneracy. Here, we develop a unified framework to systematically quantify and control solution degeneracy across three levels: behavior, neural dynamics, and weight space. We apply this framework to 3,400 RNNs trained on four neuroscience-relevant tasks: flip-flop memory, sine wave generation, delayed discrimination, and path integration, while systematically varying task complexity, learning regime, network size, and regularization. We find that higher task complexity and stronger feature learning reduce degeneracy in neural dynamics but increase it in weight space, with mixed effects on behavior. In contrast, larger networks and structural regularization reduce degeneracy at all three levels. These findings empirically validate the Contravariance Principle and provide practical guidance for researchers seeking to tune the variability of RNN solutions, either to uncover shared neural mechanisms or to model the individual variability observed in biological systems. This work provides a principled framework for quantifying and controlling solution degeneracy in task-trained RNNs, offering new tools for building more interpretable and biologically grounded models of neural computation.
♻ ☆ Complex variational autoencoders admit Kähler structure
It has been discovered that latent-Euclidean variational autoencoders (VAEs) admit, in various capacities, Riemannian structure. We adapt these arguments but for complex VAEs with a complex latent stage. We show that complex VAEs reveal to some level Kähler geometric structure. Our methods will be tailored for decoder geometry. We derive the Fisher information metric in the complex case under a latent complex Gaussian regularization with trivial relation matrix. It is well known from statistical information theory that the Fisher information coincides with the Hessian of the Kullback-Leibler (KL) divergence. Thus, the metric Kähler potential relation is exactly achieved under relative entropy. We propose a Kähler potential derivative of complex Gaussian mixtures that has rough equivalence to the Fisher information metric while still being faithful to the underlying Kähler geometry. Computation of the metric via this potential is efficient, and through our potential, valid as a plurisubharmonic (PSH) function, large scale computational burden of automatic differentiation is displaced to small scale. We show that we can regularize the latent space with decoder geometry, and that we can sample in accordance with a weighted complex volume element. We demonstrate these strategies, at the exchange of sample variation, yield consistently smoother representations and fewer semantic outliers.
♻ ☆ Bridging the Gap in XAI-Why Reliable Metrics Matter for Explainability and Compliance
Reliable explainability is not only a technical goal but also a cornerstone of private AI governance. As AI models enter high-stakes sectors, private actors such as auditors, insurers, certification bodies, and procurement agencies require standardized evaluation metrics to assess trustworthiness. However, current XAI evaluation metrics remain fragmented and prone to manipulation, which undermines accountability and compliance. We argue that standardized metrics can function as governance primitives, embedding auditability and accountability within AI systems for effective private oversight. Building upon prior work in XAI benchmarking, we identify key limitations in ensuring faithfulness, tamper resistance, and regulatory alignment. Furthermore, interpretability can directly support model alignment by providing a verifiable means of ensuring behavioral integrity in General Purpose AI (GPAI) systems. This connection between interpretability and alignment positions XAI metrics as both technical and regulatory instruments that help prevent alignment faking, a growing concern among oversight bodies. We propose a Governance by Metrics paradigm that treats explainability evaluation as a central mechanism of private AI governance. Our framework introduces a hierarchical model linking transparency, tamper resistance, scalability, and legal alignment, extending evaluation from model introspection toward systemic accountability. Through conceptual synthesis and alignment with governance standards, we outline a roadmap for integrating explainability metrics into continuous AI assurance pipelines that serve both private oversight and regulatory needs.
comment: Accepted at first EurIPS Workshop on Private AI Governance
♻ ☆ Interpretability as Alignment: Making Internal Understanding a Design Principle
Frontier AI systems require governance mechanisms that can verify internal alignment, not just behavioral compliance. Private governance mechanisms audits, certification, insurance, and procurement are emerging to complement public regulation, but they require technical substrates that generate verifiable causal evidence about model behavior. This paper argues that mechanistic interpretability provides this substrate. We frame interpretability not as post-hoc explanation but as a design constraint embedding auditability, provenance, and bounded transparency within model architectures. Integrating causal abstraction theory and empirical benchmarks such as MIB and LoBOX, we outline how interpretability-first models can underpin private assurance pipelines and role-calibrated transparency frameworks. This reframing situates interpretability as infrastructure for private AI governance bridging the gap between technical reliability and institutional accountability.
comment: Accepted at the first EurIPS Workshop on Private AI Governance
♻ ☆ Self-Supervised Discriminative Feature Learning for Deep Multi-View Clustering
Multi-view clustering is an important research topic due to its capability to utilize complementary information from multiple views. However, there are few methods to consider the negative impact caused by certain views with unclear clustering structures, resulting in poor multi-view clustering performance. To address this drawback, we propose self-supervised discriminative feature learning for deep multi-view clustering (SDMVC). Concretely, deep autoencoders are applied to learn embedded features for each view independently. To leverage the multi-view complementary information, we concatenate all views' embedded features to form the global features, which can overcome the negative impact of some views' unclear clustering structures. In a self-supervised manner, pseudo-labels are obtained to build a unified target distribution to perform multi-view discriminative feature learning. During this process, global discriminative information can be mined to supervise all views to learn more discriminative features, which in turn are used to update the target distribution. Besides, this unified target distribution can make SDMVC learn consistent cluster assignments, which accomplishes the clustering consistency of multiple views while preserving their features' diversity. Experiments on various types of multi-view datasets show that SDMVC outperforms 14 competitors including classic and state-of-the-art methods. The code is available at https://github.com/SubmissionsIn/SDMVC.
♻ ☆ Context-Aware Multimodal Representation Learning for Spatio-Temporally Explicit Environmental Modelling
Earth observation (EO) foundation models have emerged as an effective approach to derive latent representations of the Earth system from various remote sensing sensors. These models produce embeddings that can be used as analysis-ready datasets, enabling the modelling of ecosystem dynamics without extensive sensor-specific preprocessing. However, existing models typically operate at fixed spatial or temporal scales, limiting their use for ecological analyses that require both fine spatial detail and high temporal fidelity. To overcome these limitations, we propose a representation learning framework that integrates different EO modalities into a unified feature space at high spatio-temporal resolution. We introduce the framework using Sentinel-1 and Sentinel-2 data as representative modalities. Our approach produces a latent space at native 10 m resolution and the temporal frequency of cloud-free Sentinel-2 acquisitions. Each sensor is first modeled independently to capture its sensor-specific characteristics. Their representations are then combined into a shared model. This two-stage design enables modality-specific optimisation and easy extension to new sensors, retaining pretrained encoders while retraining only fusion layers. This enables the model to capture complementary remote sensing data and to preserve coherence across space and time. Qualitative analyses reveal that the learned embeddings exhibit high spatial and semantic consistency across heterogeneous landscapes. Quantitative evaluation in modelling Gross Primary Production reveals that they encode ecologically meaningful patterns and retain sufficient temporal fidelity to support fine-scale analyses. Overall, the proposed framework provides a flexible, analysis-ready representation learning approach for environmental applications requiring diverse spatial and temporal resolutions.
comment: 10 pages (incliding 2 pages of references), 7 figures
♻ ☆ A low-rank non-convex norm method for multiview graph clustering
This study introduces a novel technique for multi-view clustering known as the "Consensus Graph-Based Multi-View Clustering Method Using Low-Rank Non-Convex Norm" (CGMVC-NC). Multi-view clustering is a challenging task in machine learning as it requires the integration of information from multiple data sources or views to cluster data points accurately. The suggested approach makes use of the structural characteristics of multi-view data tensors, introducing a non-convex tensor norm to identify correlations between these views. In contrast to conventional methods, this approach demonstrates superior clustering accuracy across several benchmark datasets. Despite the non-convex nature of the tensor norm used, the proposed method remains amenable to efficient optimization using existing algorithms. The approach provides a valuable tool for multi-view data analysis and has the potential to enhance our understanding of complex systems in various fields. Further research can explore the application of this method to other types of data and extend it to other machine-learning tasks.
♻ ☆ STAMP: Spatial-Temporal Adapter with Multi-Head Pooling ML4H
Time series foundation models (TSFMs) pretrained on data from multiple domains have shown strong performance on diverse modeling tasks. Various efforts have been made to develop foundation models specific to electroencephalography (EEG) data, which records brain electrical activity as time series. However, no comparative analysis of EEG-specific foundation models (EEGFMs) versus general TSFMs has been performed on EEG-specific tasks. We introduce a novel Spatial-Temporal Adapter with Multi-Head Pooling (STAMP), which leverages univariate embeddings produced by a general TSFM, implicitly models spatial-temporal characteristics of EEG data, and achieves performance comparable to state-of-the-art EEGFMs. A comprehensive analysis is performed on 8 benchmark datasets of clinical tasks using EEG for classification, along with ablation studies. Our proposed adapter is lightweight in trainable parameters and flexible in the inputs it can accommodate, supporting easy modeling of EEG data using TSFMs.
comment: Accepted as a Proceedings paper at Machine Learning for Health (ML4H) 2025, invited presentation at the Time Series for Health (TS4H) Workshop, NeurIPS 2025. v2: Updated author affiliation and corrected a duplicated word in the text. No other changes
♻ ☆ Active Measurement: Efficient Estimation at Scale NeurIPS 2025
AI has the potential to transform scientific discovery by analyzing vast datasets with little human effort. However, current workflows often do not provide the accuracy or statistical guarantees that are needed. We introduce active measurement, a human-in-the-loop AI framework for scientific measurement. An AI model is used to predict measurements for individual units, which are then sampled for human labeling using importance sampling. With each new set of human labels, the AI model is improved and an unbiased Monte Carlo estimate of the total measurement is refined. Active measurement can provide precise estimates even with an imperfect AI model, and requires little human effort when the AI model is very accurate. We derive novel estimators, weighting schemes, and confidence intervals, and show that active measurement reduces estimation error compared to alternatives in several measurement tasks.
comment: NeurIPS 2025
♻ ☆ Efficient Solution and Learning of Robust Factored MDPs
Robust Markov decision processes (r-MDPs) extend MDPs by explicitly modelling epistemic uncertainty about transition dynamics. Learning r-MDPs from interactions with an unknown environment enables the synthesis of robust policies with provable (PAC) guarantees on performance, but this can require a large number of sample interactions. We propose novel methods for solving and learning r-MDPs based on factored state-space representations that leverage the independence between model uncertainty across system components. Although policy synthesis for factored r-MDPs leads to hard, non-convex optimisation problems, we show how to reformulate these into tractable linear programs. Building on these, we also propose methods to learn factored model representations directly. Our experimental results show that exploiting factored structure can yield dimensional gains in sample efficiency, producing more effective robust policies with tighter performance guarantees than state-of-the-art methods.
♻ ☆ KVTuner: Sensitivity-Aware Layer-Wise Mixed-Precision KV Cache Quantization for Efficient and Nearly Lossless LLM Inference ICML25
KV cache quantization can improve Large Language Models (LLMs) inference throughput and latency in long contexts and large batch-size scenarios while preserving LLMs effectiveness. However, current methods have three unsolved issues: overlooking layer-wise sensitivity to KV cache quantization, high overhead of online fine-grained decision-making, and low flexibility to different LLMs and constraints. Therefore, we theoretically analyze the inherent correlation of layer-wise transformer attention patterns to KV cache quantization errors and study why key cache is generally more important than value cache for quantization error reduction. We further propose a simple yet effective framework KVTuner to adaptively search for the optimal hardware-friendly layer-wise KV quantization precision pairs for coarse-grained KV cache with multi-objective optimization and directly utilize the offline searched configurations during online inference. To reduce the computational cost of offline calibration, we utilize the intra-layer KV precision pair pruning and inter-layer clustering to reduce the search space. Experimental results show that we can achieve nearly lossless 3.25-bit mixed precision KV cache quantization for LLMs like Llama-3.1-8B-Instruct and 4.0-bit for sensitive models like Qwen2.5-7B-Instruct on mathematical reasoning tasks. The maximum inference throughput can be improved by 21.25\% compared with KIVI-KV8 quantization over various context lengths. Our code and searched configurations are available at https://github.com/cmd2001/KVTuner.
comment: Accepted by ICML25. Code: https://github.com/cmd2001/KVTuner
♻ ☆ Fast-DataShapley: Neural Modeling for Training Data Valuation
The value and copyright of training data are crucial in the artificial intelligence industry. Service platforms should protect data providers' legitimate rights and fairly reward them for their contributions. Shapley value, a potent tool for evaluating contributions, outperforms other methods in theory, but its computational overhead escalates exponentially with the number of data providers. Recent works based on Shapley values attempt to mitigate computation complexity by approximation algorithms. However, they need to retrain for each test sample, leading to intolerable costs. We propose Fast-DataShapley, a one-pass training method that leverages the weighted least squares characterization of the Shapley value to train a reusable explainer model with real-time reasoning speed. Given new test samples, no retraining is required to calculate the Shapley values of the training data. Additionally, we propose three methods with theoretical guarantees to reduce training overhead from two aspects: the approximate calculation of the utility function and the group calculation of the training data. We analyze time complexity to show the efficiency of our methods. The experimental evaluations on various image datasets demonstrate superior performance and efficiency compared to baselines. Specifically, the performance is improved to more than 2 times, and the explainer's training speed can be increased by two orders of magnitude.
♻ ☆ AutoJudge: Judge Decoding Without Manual Annotation NeurIPS 2025
We introduce AutoJudge, a method that accelerates large language model (LLM) inference with task-specific lossy speculative decoding. Instead of matching the original model output distribution token-by-token, we identify which of the generated tokens affect the downstream quality of the response, relaxing the distribution match guarantee so that the "unimportant" tokens can be generated faster. Our approach relies on a semi-greedy search algorithm to test which of the mismatches between target and draft models should be corrected to preserve quality and which ones may be skipped. We then train a lightweight classifier based on existing LLM embeddings to predict, at inference time, which mismatching tokens can be safely accepted without compromising the final answer quality. We evaluate the effectiveness of AutoJudge with multiple draft/target model pairs on mathematical reasoning and programming benchmarks, achieving significant speedups at the cost of a minor accuracy reduction. Notably, on GSM8k with the Llama 3.1 70B target model, our approach achieves up to $\approx2\times$ speedup over speculative decoding at the cost of $\le 1\%$ drop in accuracy. When applied to the LiveCodeBench benchmark, AutoJudge automatically detects programming-specific important tokens, accepting $\ge 25$ tokens per speculation cycle at $2\%$ drop in Pass@1. Our approach requires no human annotation and is easy to integrate with modern LLM inference frameworks.
comment: Accepted at NeurIPS 2025
♻ ☆ Efficient and Accurate Spatial Mixing of Machine Learned Interatomic Potentials for Materials Science
Machine-learned interatomic potentials can offer near first-principles accuracy but are computationally expensive, limiting their application to large-scale molecular dynamics simulations. Inspired by quantum mechanics/molecular mechanics methods we present ML-MIX, a CPU- and GPU-compatible LAMMPS package to accelerate simulations by spatially mixing interatomic potentials of different complexities allowing deployment of modern MLIPs even under restricted computational budgets. We demonstrate our method for ACE, UF3, SNAP and MACE potential architectures and demonstrate how linear 'cheap' potentials can be distilled from a given 'expensive' potential, allowing close matching in relevant regions of configuration space. The functionality of ML-MIX is demonstrated through tests on point defects in Si, Fe and W-He, in which speedups of up to 11x over ~ 8,000 atoms are demonstrated, without sacrificing accuracy. The scientific potential of ML-MIX is demonstrated via two case studies in W, measuring the mobility of b = 1/2 111 screw dislocations with ACE/ACE mixing and the implantation of He with MACE/SNAP mixing. The latter returns He reflection coefficients which (for the first time) match experimental observations up to an He incident energy of 80 eV - demonstrating the benefits of deploying state-of-the-art models on large, realistic systems.
comment: 30 pages, 17 figures. To access the ML-MIX GitHub, go to https://github.com/kermodegroup/ML-MIX
♻ ☆ Diagnosing Hallucination Risk in AI Surgical Decision-Support: A Sequential Framework for Sequential Validation
Large language models (LLMs) offer transformative potential for clinical decision support in spine surgery but pose significant risks through hallucinations, which are factually inconsistent or contextually misaligned outputs that may compromise patient safety. This study introduces a clinician-centered framework to quantify hallucination risks by evaluating diagnostic precision, recommendation quality, reasoning robustness, output coherence, and knowledge alignment. We assessed six leading LLMs across 30 expert-validated spinal cases. DeepSeek-R1 demonstrated superior overall performance (total score: 86.03 $\pm$ 2.08), particularly in high-stakes domains such as trauma and infection. A critical finding reveals that reasoning-enhanced model variants did not uniformly outperform standard counterparts: Claude-3.7-Sonnet's extended thinking mode underperformed relative to its standard version (80.79 $\pm$ 1.83 vs. 81.56 $\pm$ 1.92), indicating extended chain-of-thought reasoning alone is insufficient for clinical reliability. Multidimensional stress-testing exposed model-specific vulnerabilities, with recommendation quality degrading by 7.4% under amplified complexity. This decline contrasted with marginal improvements in rationality (+2.0%), readability (+1.7%) and diagnosis (+4.7%), highlighting a concerning divergence between perceived coherence and actionable guidance. Our findings advocate integrating interpretability mechanisms (e.g., reasoning chain visualization) into clinical workflows and establish a safety-aware validation framework for surgical LLM deployment.
♻ ☆ A Distributionally Robust Framework for Nuisance in Causal Effect Estimation ICONIP 2025
Causal inference requires evaluating models on balanced distributions between treatment and control groups, while training data often exhibits imbalance due to historical decision-making policies. Most conventional statistical methods address this distribution shift through inverse probability weighting (IPW), which requires estimating propensity scores as an intermediate step. These methods face two key challenges: inaccurate propensity estimation and instability from extreme weights. We decompose the generalization error to isolate these issues--propensity ambiguity and statistical instability--and address them through an adversarial loss function. Our approach combines distributionally robust optimization for handling propensity uncertainty with weight regularization based on weighted Rademacher complexity. Experiments on synthetic and real-world datasets demonstrate consistent improvements over existing methods.
comment: The Version of Record of this contribution is published in the Neural Information Processing, ICONIP 2025 Proceedings and is available online at https://doi.org/10.1007/978-981-95-4094-5_19
♻ ☆ Generative AI, Managerial Expectations, and Economic Activity
We use generative AI to extract managerial expectations about their economic outlook from 120,000+ corporate conference call transcripts. The resulting AI Economy Score predicts GDP growth, production, and employment up to 10 quarters ahead, beyond existing measures like survey forecasts. Moreover, industry and firm-level measures provide valuable information about sector-specific and individual firm activities. A composite measure that integrates managerial expectations about firm, industry, and macroeconomic conditions further significantly improves the forecasting power and predictive horizon of national and sectoral growth. Our findings show managerial expectations offer unique insights into economic activity, with implications for both macroeconomic and microeconomic decision-making.
comment: 27 Pages, 5 Figures, 17 Tables
♻ ☆ Exploring the Hidden Reasoning Process of Large Language Models by Misleading Them
Large language models (LLMs) have been able to perform various forms of reasoning tasks in a wide range of scenarios, but are they truly engaging in task abstraction and rule-based reasoning beyond mere memorization? To answer this question, we propose a novel experimental approach, Misleading Fine-Tuning (MisFT), to examine whether LLMs perform abstract reasoning by altering their original understanding of fundamental rules. In particular, by constructing datasets with math expressions or logical formulas that contradict correct principles, we fine-tune the model to learn those contradictory rules and assess its generalization ability on unseen test domains. Through a series of experiments, we find that current LLMs are capable of applying contradictory rules to solve practical math word problems and natural language reasoning tasks, implying the presence of an internal mechanism in LLMs that abstracts before reasoning.
♻ ☆ BanditSpec: Adaptive Speculative Decoding via Bandit Algorithms ICML
Speculative decoding has emerged as a popular method to accelerate the inference of Large Language Models (LLMs) while retaining their superior text generation performance. Previous methods either adopt a fixed speculative decoding configuration regardless of the prefix tokens, or train draft models in an offline or online manner to align them with the context. This paper proposes a training-free online learning framework to adaptively choose the configuration of the hyperparameters for speculative decoding as text is being generated. We first formulate this hyperparameter selection problem as a Multi-Armed Bandit problem and provide a general speculative decoding framework BanditSpec. Furthermore, two bandit-based hyperparameter selection algorithms, UCBSpec and EXP3Spec, are designed and analyzed in terms of a novel quantity, the stopping time regret. We upper bound this regret under both stochastic and adversarial reward settings. By deriving an information-theoretic impossibility result, it is shown that the regret performance of UCBSpec is optimal up to universal constants. Finally, extensive empirical experiments with LLaMA3 and Qwen2 demonstrate that our algorithms are effective compared to existing methods, and the throughput is close to the oracle best hyperparameter in simulated real-life LLM serving scenarios with diverse input prompts.
comment: 35 pages, 4 figures, accepted to ICML, typos and affiliations are corrected
♻ ☆ TabDistill: Distilling Transformers into Neural Nets for Few-Shot Tabular Classification
Transformer-based models have shown promising performance on tabular data compared to their classical counterparts such as neural networks and Gradient Boosted Decision Trees (GBDTs) in scenarios with limited training data. They utilize their pre-trained knowledge to adapt to new domains, achieving commendable performance with only a few training examples, also called the few-shot regime. However, the performance gain in the few-shot regime comes at the expense of significantly increased complexity and number of parameters. To circumvent this trade-off, we introduce TabDistill, a new strategy to distill the pre-trained knowledge in complex transformer-based models into simpler neural networks for effectively classifying tabular data. Our framework yields the best of both worlds: being parameter-efficient while performing well with limited training data. The distilled neural networks surpass classical baselines such as regular neural networks, XGBoost and logistic regression under equal training data, and in some cases, even the original transformer-based models that they were distilled from.
♻ ☆ Non-Asymptotic Analysis of Data Augmentation for Precision Matrix Estimation NeurIPS 2025
This paper addresses the problem of inverse covariance (also known as precision matrix) estimation in high-dimensional settings. Specifically, we focus on two classes of estimators: linear shrinkage estimators with a target proportional to the identity matrix, and estimators derived from data augmentation (DA). Here, DA refers to the common practice of enriching a dataset with artificial samples--typically generated via a generative model or through random transformations of the original data--prior to model fitting. For both classes of estimators, we derive estimators and provide concentration bounds for their quadratic error. This allows for both method comparison and hyperparameter tuning, such as selecting the optimal proportion of artificial samples. On the technical side, our analysis relies on tools from random matrix theory. We introduce a novel deterministic equivalent for generalized resolvent matrices, accommodating dependent samples with specific structure. We support our theoretical results with numerical experiments.
comment: Conference paper at NeurIPS 2025 (Spotlight)
♻ ☆ Kandinsky 5.0: A Family of Foundation Models for Image and Video Generation
This report introduces Kandinsky 5.0, a family of state-of-the-art foundation models for high-resolution image and 10-second video synthesis. The framework comprises three core line-up of models: Kandinsky 5.0 Image Lite - a line-up of 6B parameter image generation models, Kandinsky 5.0 Video Lite - a fast and lightweight 2B parameter text-to-video and image-to-video models, and Kandinsky 5.0 Video Pro - 19B parameter models that achieves superior video generation quality. We provide a comprehensive review of the data curation lifecycle - including collection, processing, filtering and clustering - for the multi-stage training pipeline that involves extensive pre-training and incorporates quality-enhancement techniques such as self-supervised fine-tuning (SFT) and reinforcement learning (RL)-based post-training. We also present novel architectural, training, and inference optimizations that enable Kandinsky 5.0 to achieve high generation speeds and state-of-the-art performance across various tasks, as demonstrated by human evaluation. As a large-scale, publicly available generative framework, Kandinsky 5.0 leverages the full potential of its pre-training and subsequent stages to be adapted for a wide range of generative applications. We hope that this report, together with the release of our open-source code and training checkpoints, will substantially advance the development and accessibility of high-quality generative models for the research community.
comment: Website: https://kandinskylab.ai/
♻ ☆ Kaggle Chronicles: 15 Years of Competitions, Community and Data Science Innovation
Since 2010, Kaggle has been a platform where data scientists from around the world come together to compete, collaborate, and push the boundaries of Data Science. Over these 15 years, it has grown from a purely competition-focused site into a broader ecosystem with forums, notebooks, models, datasets, and more. With the release of the Kaggle Meta Code and Kaggle Meta Datasets, we now have a unique opportunity to explore these competitions, technologies, and real-world applications of Machine Learning and AI. And so in this study, we take a closer look at 15 years of data science on Kaggle - through metadata, shared code, community discussions, and the competitions themselves. We explore Kaggle's growth, its impact on the data science community, uncover hidden technological trends, analyze competition winners, how Kagglers approach problems in general, and more. We do this by analyzing millions of kernels and discussion threads to perform both longitudinal trend analysis and standard exploratory data analysis. Our findings show that Kaggle is a steadily growing platform with increasingly diverse use cases, and that Kagglers are quick to adapt to new trends and apply them to real-world challenges, while producing - on average - models with solid generalization capabilities. We also offer a snapshot of the platform as a whole, highlighting its history and technological evolution. Finally, this study is accompanied by a video (https://www.youtube.com/watch?v=YVOV9bIUNrM) and a Kaggle write-up (https://kaggle.com/competitions/meta-kaggle-hackathon/writeups/kaggle-chronicles-15-years-of-competitions-communi) for your convenience.
♻ ☆ VisPlay: Self-Evolving Vision-Language Models from Images
Reinforcement learning (RL) provides a principled framework for improving Vision-Language Models (VLMs) on complex reasoning tasks. However, existing RL approaches often rely on human-annotated labels or task-specific heuristics to define verifiable rewards, both of which are costly and difficult to scale. We introduce VisPlay, a self-evolving RL framework that enables VLMs to autonomously improve their reasoning abilities using large amounts of unlabeled image data. Starting from a single base VLM, VisPlay assigns the model into two interacting roles: an Image-Conditioned Questioner that formulates challenging yet answerable visual questions, and a Multimodal Reasoner that generates silver responses. These roles are jointly trained with Group Relative Policy Optimization (GRPO), which incorporates diversity and difficulty rewards to balance the complexity of generated questions with the quality of the silver answers. VisPlay scales efficiently across two model families. When trained on Qwen2.5-VL and MiMo-VL, VisPlay achieves consistent improvements in visual reasoning, compositional generalization, and hallucination reduction across eight benchmarks, including MM-Vet and MMMU, demonstrating a scalable path toward self-evolving multimodal intelligence. The project page is available at https://bruno686.github.io/VisPlay/
♻ ☆ Policy Search, Retrieval, and Composition via Task Similarity in Collaborative Agentic Systems
Agentic AI aims to create systems that set their own goals, adapt proactively to change, and refine behavior through continuous experience. Recent advances suggest that, when facing multiple and unforeseen tasks, agents could benefit from sharing machine-learned knowledge and reusing policies that have already been fully or partially learned by other agents. However, how to query, select, and retrieve policies from a pool of agents, and how to integrate such policies remains a largely unexplored area. This study explores how an agent decides what knowledge to select, from whom, and when and how to integrate it in its own policy in order to accelerate its own learning. The proposed algorithm, \emph{Modular Sharing and Composition in Collective Learning} (MOSAIC), improves learning in agentic collectives by combining (1) knowledge selection using performance signals and cosine similarity on Wasserstein task embeddings, (2) modular and transferable neural representations via masks, and (3) policy integration, composition and fine-tuning. MOSAIC outperforms isolated learners and global sharing approaches in both learning speed and overall performance, and in some cases solves tasks that isolated agents cannot. The results also demonstrate that selective, goal-driven reuse leads to less susceptibility to task interference. We also observe the emergence of self-organization, where agents solving simpler tasks accelerate the learning of harder ones through shared knowledge.
comment: 24 pages, 20 figures, 8 tables
♻ ☆ Enhancing Visual Feature Attribution via Weighted Integrated Gradients
Integrated Gradients (IG) is a widely used attribution method in explainable AI, particularly in computer vision applications where reliable feature attribution is essential. A key limitation of IG is its sensitivity to the choice of baseline (reference) images. Multi-baseline extensions such as Expected Gradients (EG) assume uniform weighting over baselines, implicitly treating baseline images as equally informative. In high-dimensional vision models, this assumption often leads to noisy or unstable explanations. This paper proposes Weighted Integrated Gradients (WG), a principled approach that evaluates and weights baselines to enhance attribution reliability. WG introduces an unsupervised criterion for baseline suitability, enabling adaptive selection and weighting of baselines on a per-input basis. The method not only preserves core axiomatic properties of IG but also provides improved theoretical guarantees on the quality of explanation over EG. Experiments on commonly used image datasets and models show that WG consistently outperforms EG, yielding 10 to 35 percent improvements in attribution fidelity. WG further identifies informative baseline subsets, reducing unnecessary variability while maintaining high attribution accuracy. By moving beyond the idea that all baselines matter equally, Weighted Integrated Gradients offers a clearer and more reliable way to explain computer-vision models, improving both understanding and practical usability in explainable AI.
♻ ☆ Modelling Global Trade with Optimal Transport
Global trade is shaped by a complex mix of factors beyond supply and demand, including tangible variables like transport costs and tariffs, as well as less quantifiable influences such as political and economic relations. Traditionally, economists model trade using gravity models, which rely on explicit covariates that might struggle to capture these subtler drivers of trade. In this work, we employ optimal transport and a deep neural network to learn a time-dependent cost function from data, without imposing a specific functional form. This approach consistently outperforms traditional gravity models in accuracy and has similar performance to three-way gravity models, while providing natural uncertainty quantification. Applying our framework to global food and agricultural trade, we show that the Global South suffered disproportionately from the war in Ukraine's impact on wheat markets. We also analyse the effects of free-trade agreements and trade disputes with China, as well as Brexit's impact on British trade with Europe, uncovering hidden patterns that trade volumes alone cannot reveal.
♻ ☆ Provably Robust Pre-Trained Ensembles for Biomarker-Based Cancer Classification IJCAI 2024
Certain cancer types, notably pancreatic cancer, are difficult to detect at an early stage, motivating robust biomarker-based screening. Liquid biopsies enable non-invasive monitoring of circulating biomarkers, but typical machine learning pipelines for high-dimensional tabular data (e.g., random forests, SVMs) rely on expensive hyperparameter tuning and can be brittle under class imbalance. We leverage a meta-trained Hyperfast model for classifying cancer, accomplishing the highest AUC of 0.9929 and simultaneously achieving robustness especially on highly imbalanced datasets compared to other ML algorithms in several binary classification tasks (e.g. breast invasive carcinoma; BRCA vs. non-BRCA). We also propose a novel ensemble model combining pre-trained Hyperfast model, XGBoost, and LightGBM for multi-class classification tasks, achieving an incremental increase in accuracy (0.9464) while merely using 500 PCA features; distinguishable from previous studies where they used more than 2,000 features for similar results. Crucially, we demonstrate robustness under class imbalance: empirically via balanced accuracy and minority-class recall across cancer-vs.-noncancer and cancer-vs.-rest settings, and theoretically by showing (i) a prototype-form final layer for Hyperfast that yields prior-insensitive decisions under bounded bias, and (ii) minority-error reductions for majority vote under mild error diversity. Together, these results indicate that pre-trained tabular models and simple ensembling can deliver state-of-the-art accuracy and improved minority-class performance with far fewer features and no additional tuning.
comment: Accepted to the AIAA Workshop at IJCAI 2024
♻ ☆ Estimation of Cardiac and Non-cardiac Diagnosis from Electrocardiogram Features
Ensuring timely and accurate diagnosis of medical conditions is paramount for effective patient care. Electrocardiogram (ECG) signals are fundamental for evaluating a patient's cardiac health and are readily available. Despite this, little attention has been given to the remarkable potential of ECG data in detecting non-cardiac conditions. In our study, we used publicly available datasets (MIMIC-IV-ECG-ICD and ECG-VIEW II) to investigate the feasibility of inferring general diagnostic conditions from ECG features. To this end, we trained a tree-based model (XGBoost) based on ECG features and basic demographic features to estimate a wide range of diagnoses, encompassing both cardiac and non-cardiac conditions. Our results demonstrate the reliability of estimating 23 cardiac as well as 21 non-cardiac conditions above 0.7 AUROC in a statistically significant manner across a wide range of physiological categories. Our findings underscore the predictive potential of ECG data in identifying well-known cardiac conditions. However, even more striking, this research represents a pioneering effort in systematically expanding the scope of ECG-based diagnosis to conditions not traditionally associated with the cardiac system.
comment: Accepted by Computer in Cardiology 2024, 4 pages, source code under https://github.com/AI4HealthUOL/CardioDiag
♻ ☆ CardioLab: Laboratory Values Estimation from Electrocardiogram Features - An Exploratory Study
Laboratory value represents a cornerstone of medical diagnostics, but suffers from slow turnaround times, and high costs and only provides information about a single point in time. The continuous estimation of laboratory values from non-invasive data such as electrocardiogram (ECG) would therefore mark a significant frontier in healthcare monitoring. Despite its potential, this domain remains relatively underexplored. In this preliminary study, we used a publicly available dataset (MIMIC-IV-ECG) to investigate the feasibility of inferring laboratory values from ECG features and patient demographics using tree-based models (XGBoost). We define the prediction task as a binary problem of whether the lab value falls into low or high abnormalities. We assessed model performance with AUROC. Our findings demonstrate promising results in the estimation of laboratory values related to different organ systems. While further research and validation are warranted to fully assess the clinical utility and generalizability of the approach, our findings lay the groundwork for future investigations for laboratory value estimation using ECG data. Such advancements hold promise for revolutionizing predictive healthcare applications, offering faster, non-invasive, and more affordable means of patient monitoring.
comment: Accepted by Computing in Cardiology 2024, 4 pages, code under https://github.com/AI4HealthUOL/CardioLab
♻ ☆ Explainable machine learning for neoplasms diagnosis via electrocardiograms: an externally validated study
Background: Neoplasms are a major cause of mortality globally, where early diagnosis is essential for improving outcomes. Current diagnostic methods are often invasive, expensive, and inaccessible in resource-limited settings. This study explores the potential of electrocardiogram (ECG) data, a widely available and non-invasive tool for diagnosing neoplasms through cardiovascular changes linked to neoplastic presence. Methods: A diagnostic pipeline combining tree-based machine learning models with Shapley value analysis for explainability was developed. The model was trained and internally validated on a large dataset and externally validated on an independent cohort to ensure robustness and generalizability. Key ECG features contributing to predictions were identified and analyzed. Results: The model achieved high diagnostic accuracy in both internal testing and external validation cohorts. Shapley value analysis highlighted significant ECG features, including novel predictors. The approach is cost-effective, scalable, and suitable for resource-limited settings, offering insights into cardiovascular changes associated with neoplasms and their therapies. Conclusions: This study demonstrates the feasibility of using ECG signals and machine learning for non-invasive neoplasm diagnosis. By providing interpretable insights into cardio-neoplasm interactions, this method addresses gaps in diagnostics and supports integration into broader diagnostic and therapeutic frameworks.
comment: Accepted by Cardio-Oncology BMC, 28 pages, 6 figures, code under https://github.com/AI4HealthUOL/CardioDiag
♻ ☆ Bipartite Graph Variational Auto-Encoder with Fair Latent Representation to Account for Sampling Bias in Ecological Networks
Citizen science monitoring programs can generate large amounts of valuable data, but are often affected by sampling bias. We focus on a citizen science initiative that records plant-pollinator interactions, with the goal of learning embeddings that summarize the observed interactions while accounting for such bias. In our approach, plant and pollinator species are embedded based on their probability of interaction. These embeddings are derived using an adaptation of variational graph autoencoders for bipartite graphs. To mitigate the influence of sampling bias, we incorporate the Hilbert-Schmidt Independence Criterion (HSIC) to ensure independence from continuous variables related to the sampling process. This allows us to integrate a fairness perspective, commonly explored in the social sciences, into the analysis of ecological data. We validate our method through a simulation study replicating key aspects of the sampling process and demonstrate its applicability and effectiveness using the Spipoll dataset.
♻ ☆ LoRA on the Go: Instance-level Dynamic LoRA Selection and Merging
Low-Rank Adaptation (LoRA) has emerged as a parameter-efficient approach for fine-tuning large language models. However, conventional LoRA adapters are typically trained for a single task, limiting their applicability in real-world settings where inputs may span diverse and unpredictable domains. At inference time, existing approaches combine multiple LoRAs for improving performance on diverse tasks, while usually requiring labeled data or additional task-specific training, which is expensive at scale. In this work, we introduce LoRA on the Go (LoGo), a training-free framework that dynamically selects and merges adapters at the instance level without any additional requirements. LoGo leverages signals extracted from a single forward pass through LoRA adapters, to identify the most relevant adapters and determine their contributions on-the-fly. Across 5 NLP benchmarks, 27 datasets, and 3 model families, LoGo outperforms training-based baselines on some tasks upto a margin of 3.6% while remaining competitive on other tasks and maintaining inference throughput, highlighting its effectiveness and practicality.
♻ ☆ Generalized Gradient Norm Clipping & Non-Euclidean $(L_0,L_1)$-Smoothness
This work introduces a hybrid non-Euclidean optimization method which generalizes gradient norm clipping by combining steepest descent and conditional gradient approaches. The method achieves the best of both worlds by establishing a descent property under a generalized notion of ($L_0$,$L_1$)-smoothness. Weight decay is incorporated in a principled manner by identifying a connection to the Frank-Wolfe short step. In the stochastic case, we show an order optimal $O(n^{-1/4})$ convergence rate by leveraging a momentum based gradient estimator. We discuss how to instantiate the algorithms for deep learning, which we dub Clipped Scion, and demonstrate their properties on image classification and language modeling. The code is available at https://github.com/LIONS-EPFL/ClippedScion.
♻ ☆ Testing the spin-bath view of self-attention: A Hamiltonian analysis of GPT-2 Transformer
The recently proposed physics-based framework by Huo and Johnson~\cite{huo2024capturing} models the attention mechanism of Large Language Models (LLMs) as an interacting two-body spin system, offering a first-principles explanation for phenomena like repetition and bias. Building on this hypothesis, we extract the complete Query-Key weight matrices from a production-grade GPT-2 model and derive the corresponding effective Hamiltonian for every attention head. From these Hamiltonians, we obtain analytic phase boundaries and logit gap criteria that predict which token should dominate the next-token distribution for a given context. A systematic evaluation on 144 heads across 20 factual-recall prompts reveals a strong negative correlation between the theoretical logit gaps and the model's empirical token rankings ($r\approx-0.70$, $p<10^{-3}$).Targeted ablations further show that suppressing the heads most aligned with the spin-bath predictions induces the anticipated shifts in output probabilities, confirming a causal link rather than a coincidental association. Taken together, our findings provide the first strong empirical evidence for the spin-bath analogy in a production-grade model. In this work, we utilize the context-field lens, which provides physics-grounded interpretability and motivates the development of novel generative models bridging theoretical condensed matter physics and artificial intelligence.
♻ ☆ LSAP: Rethinking Inversion Fidelity, Perception and Editability in GAN Latent Space
As research on image inversion advances, the process is generally divided into two stages. The first step is Image Embedding, involves using an encoder or optimization procedure to embed an image and obtain its corresponding latent code. The second stage, referred to as Result Refinement, further improves the inversion and editing outcomes. Although this refinement stage substantially enhances reconstruction fidelity, perception and editability remain largely unchanged and are highly dependent on the latent codes derived from the first stage. Therefore, a key challenge lies in obtaining latent codes that preserve reconstruction fidelity while simultaneously improving perception and editability. In this work, we first reveal that these two properties are closely related to the degree of alignment (or disalignment) between the inverted latent codes and the synthetic distribution. Based on this insight, we propose the \textbf{ Latent Space Alignment Inversion Paradigm (LSAP)}, which integrates both an evaluation metric and a unified inversion solution. Specifically, we introduce the \textbf{Normalized Style Space ($\mathcal{S^N}$ space)} and \textbf{Normalized Style Space Cosine Distance (NSCD)} to quantify the disalignment of inversion methods. Moreover, our paradigm can be optimized for both encoder-based and optimization-based embeddings, providing a consistent alignment framework. Extensive experiments across various domains demonstrate that NSCD effectively captures perceptual and editable characteristics, and that our alignment paradigm achieves state-of-the-art performance in both stages of inversion.
comment: under review
♻ ☆ QUASAR: An Evolutionary Algorithm to Accelerate High-Dimensional Optimization
High-dimensional numerical optimization presents a persistent challenge. This paper introduces Quasi-Adaptive Search with Asymptotic Reinitialization (QUASAR), an evolutionary algorithm to accelerate convergence in complex, non-differentiable problems afflicted by the curse of dimensionality. Evaluated on the notoriously difficult CEC2017 benchmark suite of 29 functions, QUASAR achieved the lowest overall rank sum (150) using the Friedman test, significantly outperforming L-SHADE (229) and standard DE (305) in the dimension-variant trials. QUASAR also proves computationally efficient, with run times averaging $1.4 \text{x}$ faster than DE and $7.8 \text{x}$ faster than L-SHADE ($p \ll 0.001$) in the population-variant trials. Building upon Differential Evolution (DE), QUASAR introduces a highly stochastic architecture to dynamically balance exploration and exploitation. Inspired by the probabilistic behavior of quantum particles in a stellar core, the algorithm implements three primary components that augment standard DE mechanisms: 1) probabilistically selected mutation strategies and scaling factors; 2) rank-based crossover rates; 3) asymptotically decaying reinitialization that leverages a covariance matrix of the best solutions to introduce high-quality genetic diversity. QUASAR's performance establishes it as an effective, user-friendly optimizer for complex high-dimensional problems.
comment: 8 pages, 6 figures. Open-source package containing QUASAR is available on PyPI via 'pip install hdim_opt'; source code (with experiments) is also maintained on GitHub at [https://www.github.com/jgsoltes/hdim-opt]
♻ ☆ Multi-Objective $\textit{min-max}$ Online Convex Optimization
In online convex optimization (OCO), a single loss function sequence is revealed over a time horizon of $T$, and an online algorithm has to choose its action at time $t$, before the loss function at time $t$ is revealed. The goal of the online algorithm is to incur minimal penalty (called $\textit{regret}$ compared to a static optimal action made by an optimal offline algorithm knowing all functions of the sequence in advance. In this paper, we broaden the horizon of OCO, and consider multi-objective OCO, where there are $K$ distinct loss function sequences, and an algorithm has to choose its action at time $t$, before the $K$ loss functions at time $t$ are revealed. To capture the tradeoff between tracking the $K$ different sequences, we consider the $\textit{min-max}$ regret, where the benchmark (optimal offline algorithm) takes a static action across all time slots that minimizes the maximum of the total loss (summed across time slots) incurred by each of the $K$ sequences. An online algorithm is allowed to change its action across time slots, and its {\it min-max} regret is defined as the difference between its $\textit{min-max}$ cost and that of the benchmark. The $\textit{min-max}$ regret is a stringent performance measure and an algorithm with small regret needs to `track' all loss function sequences closely at all times. We consider this $\textit{min-max}$ regret in the i.i.d. input setting where all loss functions are i.i.d. generated from an unknown distribution. For the i.i.d. model we propose a simple algorithm that combines the well-known $\textit{Hedge}$ and online gradient descent (OGD) and show via a remarkably simple proof that its expected $\textit{min-max}$ regret is $O(\sqrt{T \log K})$.
♻ ☆ LLMDistill4Ads: Using Cross-Encoders to Distill from LLM Signals for Advertiser Keyphrase Recommendations
E-commerce sellers are advised to bid on keyphrases to boost their advertising campaigns. These keyphrases must be relevant to prevent irrelevant items from cluttering search systems and to maintain positive seller perception. It is vital that keyphrase suggestions align with seller, search and buyer judgments. Given the challenges in collecting negative feedback in these systems, LLMs have been used as a scalable proxy to human judgments. This paper presents an empirical study on a major ecommerce platform of a distillation framework involving an LLM teacher, a cross-encoder assistant and a bi-encoder Embedding Based Retrieval (EBR) student model, aimed at mitigating click-induced biases in keyphrase recommendations.
♻ ☆ TRADES: Generating Realistic Market Simulations with Diffusion Models
Financial markets are complex systems characterized by high statistical noise, nonlinearity, volatility, and constant evolution. Thus, modeling them is extremely hard. Here, we address the task of generating realistic and responsive Limit Order Book (LOB) market simulations, which are fundamental for calibrating and testing trading strategies, performing market impact experiments, and generating synthetic market data. We propose a novel TRAnsformer-based Denoising Diffusion Probabilistic Engine for LOB Simulations (TRADES). TRADES generates realistic order flows as time series conditioned on the state of the market, leveraging a transformer-based architecture that captures the temporal and spatial characteristics of high-frequency market data. There is a notable absence of quantitative metrics for evaluating generative market simulation models in the literature. To tackle this problem, we adapt the predictive score, a metric measured as an MAE, to market data by training a stock price predictive model on synthetic data and testing it on real data. We compare TRADES with previous works on two stocks, reporting a 3.27 and 3.48 improvement over SoTA according to the predictive score, demonstrating that we generate useful synthetic market data for financial downstream tasks. Furthermore, we assess TRADES's market simulation realism and responsiveness, showing that it effectively learns the conditional data distribution and successfully reacts to an experimental agent, giving sprout to possible calibrations and evaluations of trading strategies and market impact experiments. To perform the experiments, we developed DeepMarket, the first open-source Python framework for LOB market simulation with deep learning. In our repository, we include a synthetic LOB dataset composed of TRADES's generated simulations.
comment: 8 pages
♻ ☆ xLSTM-Mixer: Multivariate Time Series Forecasting by Mixing via Scalar Memories NeurIPS 2025
Time series data is prevalent across numerous fields, necessitating the development of robust and accurate forecasting models. Capturing patterns both within and between temporal and multivariate components is crucial for reliable predictions. We introduce xLSTM-Mixer, a model designed to effectively integrate temporal sequences, joint time-variate information, and multiple perspectives for robust forecasting. Our approach begins with a linear forecast shared across variates, which is then refined by xLSTM blocks. They serve as key elements for modeling the complex dynamics of challenging time series data. xLSTM-Mixer ultimately reconciles two distinct views to produce the final forecast. Our extensive evaluations demonstrate its superior long-term forecasting performance compared to recent state-of-the-art methods while requiring very little memory. A thorough model analysis provides further insights into its key components and confirms its robustness and effectiveness. This work contributes to the resurgence of recurrent models in forecasting by combining them, for the first time, with mixing architectures.
comment: Poster at NeurIPS 2025
♻ ☆ Do-PFN: In-Context Learning for Causal Effect Estimation
Estimation of causal effects is critical to a range of scientific disciplines. Existing methods for this task either require interventional data, knowledge about the ground truth causal graph, or rely on assumptions such as unconfoundedness, restricting their applicability in real-world settings. In the domain of tabular machine learning, Prior-data fitted networks (PFNs) have achieved state-of-the-art predictive performance, having been pre-trained on synthetic data to solve tabular prediction problems via in-context learning. To assess whether this can be transferred to the harder problem of causal effect estimation, we pre-train PFNs on synthetic data drawn from a wide variety of causal structures, including interventions, to predict interventional outcomes given observational data. Through extensive experiments on synthetic case studies, we show that our approach allows for the accurate estimation of causal effects without knowledge of the underlying causal graph. We also perform ablation studies that elucidate Do-PFN's scalability and robustness across datasets with a variety of causal characteristics.
comment: Neurips 2025
♻ ☆ From Static to Adaptive Defense: Federated Multi-Agent Deep Reinforcement Learning-Driven Moving Target Defense Against DoS Attacks in UAV Swarm Networks
The proliferation of UAVs has enabled a wide range of mission-critical applications and is becoming a cornerstone of low-altitude networks, supporting smart cities, emergency response, and more. However, the open wireless environment, dynamic topology, and resource constraints of UAVs expose low-altitude networks to severe DoS threats. Traditional defense approaches, which rely on fixed configurations or centralized decision-making, cannot effectively respond to the rapidly changing conditions in UAV swarm environments. To address these challenges, we propose a novel federated multi-agent deep reinforcement learning (FMADRL)-driven moving target defense (MTD) framework for proactive DoS mitigation in low-altitude networks. Specifically, we design lightweight and coordinated MTD mechanisms, including leader switching, route mutation, and frequency hopping, to disrupt attacker efforts and enhance network resilience. The defense problem is formulated as a multi-agent partially observable Markov decision process, capturing the uncertain nature of UAV swarms under attack. Each UAV is equipped with a policy agent that autonomously selects MTD actions based on partial observations and local experiences. By employing a policy gradient-based algorithm, UAVs collaboratively optimize their policies via reward-weighted aggregation. Extensive simulations demonstrate that our approach significantly outperforms state-of-the-art baselines, achieving up to a 34.6% improvement in attack mitigation rate, a reduction in average recovery time of up to 94.6%, and decreases in energy consumption and defense cost by as much as 29.3% and 98.3%, respectively, under various DoS attack strategies. These results highlight the potential of intelligent, distributed defense mechanisms to protect low-altitude networks, paving the way for reliable and scalable low-altitude economy.
comment: 15pages; Accepted by IEEE TCCN
♻ ☆ Interpreting Emergent Features in Deep Learning-based Side-channel Analysis
Side-channel analysis (SCA) poses a real-world threat by exploiting unintentional physical signals to extract secret information from secure devices. Evaluation labs also use the same techniques to certify device security. In recent years, deep learning has emerged as a prominent method for SCA, achieving state-of-the-art attack performance at the cost of interpretability. Understanding how neural networks extract secrets is crucial for security evaluators aiming to defend against such attacks, as only by understanding the attack can one propose better countermeasures. In this work, we apply mechanistic interpretability to neural networks trained for SCA, revealing \textit{how} models exploit \textit{what} leakage in side-channel traces. We focus on sudden jumps in performance to reverse engineer learned representations, ultimately recovering secret masks and moving the evaluation process from black-box to white-box. Our results show that mechanistic interpretability can scale to realistic SCA settings, even when relevant inputs are sparse, model accuracies are low, and side-channel protections prevent standard input interventions.
comment: 17 pages, 13 figures, 1 table
♻ ☆ Linear time small coresets for k-mean clustering of segments with applications
We study the $k$-means problem for a set $\mathcal{S} \subseteq \mathbb{R}^d$ of $n$ segments, aiming to find $k$ centers $X \subseteq \mathbb{R}^d$ that minimize $D(\mathcal{S},X) := \sum_{S \in \mathcal{S}} \min_{x \in X} D(S,x)$, where $D(S,x) := \int_{p \in S} |p - x| dp$ measures the total distance from each point along a segment to a center. Variants of this problem include handling outliers, employing alternative distance functions such as M-estimators, weighting distances to achieve balanced clustering, or enforcing unique cluster assignments. For any $\varepsilon > 0$, an $\varepsilon$-coreset is a weighted subset $C \subseteq \mathbb{R}^d$ that approximates $D(\mathcal{S},X)$ within a factor of $1 \pm \varepsilon$ for any set of $k$ centers, enabling efficient streaming, distributed, or parallel computation. We propose the first coreset construction that provably handles arbitrary input segments. For constant $k$ and $\varepsilon$, it produces a coreset of size $O(\log^2 n)$ computable in $O(nd)$ time. Experiments, including a real-time video tracking application, demonstrate substantial speedups with minimal loss in clustering accuracy, confirming both the practical efficiency and theoretical guarantees of our method.
comment: First published in WALCOM 2026 by Springer Nature
♻ ☆ A Hybrid Deep Learning based Carbon Price Forecasting Framework with Structural Breakpoints Detection and Signal Denoising
Accurately forecasting carbon prices is essential for informed energy market decision-making, guiding sustainable energy planning, and supporting effective decarbonization strategies. However, it remains challenging due to structural breaks and high-frequency noise caused by frequent policy interventions and market shocks. Existing studies, including the most recent baseline approaches, have attempted to incorporate breakpoints but often treat denoising and modeling as separate processes and lack systematic evaluation across advanced deep learning architectures, limiting the robustness and the generalization capability. To address these gaps, this paper proposes a comprehensive hybrid framework that integrates structural break detection (Bai-Perron, ICSS, and PELT algorithms), wavelet signal denoising, and three state-of-the-art deep learning models (LSTM, GRU, and TCN). Using European Union Allowance (EUA) spot prices from 2007 to 2024 and exogenous features such as energy prices and policy indicators, the framework constructs univariate and multivariate datasets for comparative evaluation. Experimental results demonstrate that our proposed PELT-WT-TCN achieves the highest prediction accuracy, reducing forecasting errors by 22.35% in RMSE and 18.63% in MAE compared to the state-of-the-art baseline model (Breakpoints with Wavelet and LSTM), and by 70.55% in RMSE and 74.42% in MAE compared to the original LSTM without decomposition from the same baseline study. These findings underscore the value of integrating structural awareness and multiscale decomposition into deep learning architectures to enhance accuracy and interpretability in carbon price forecasting and other nonstationary financial time series.
♻ ☆ Learning the Inverse Ryu--Takayanagi Formula with Transformers
We study the inverse problem of holographic entanglement entropy in AdS$_3$ using a data-driven generative model. Training data consist of randomly generated geometries and their holographic entanglement entropies using the Ryu--Takayanagi formula. After training, the Transformer reconstructs the blackening function within our metric ansatz from previously unseen inputs. The Transformer achieves accurate reconstructions on smooth black hole geometries and extrapolates to horizonless backgrounds. We describe the architecture and data generation process, and we quantify accuracy on both $f(z)$ and the reconstructed $S(\ell)$. Code and evaluation scripts are available at the provided repository.
comment: 15 pages, 6 figures, miner changes
♻ ☆ Decentralized Bilevel Optimization: A Perspective from Transient Iteration Complexity
Stochastic bilevel optimization (SBO) is becoming increasingly essential in machine learning due to its versatility in handling nested structures. To address large-scale SBO, decentralized approaches have emerged as effective paradigms in which nodes communicate with immediate neighbors without a central server, thereby improving communication efficiency and enhancing algorithmic robustness. However, most decentralized SBO algorithms focus solely on asymptotic convergence rates, overlooking transient iteration complexity-the number of iterations required before asymptotic rates dominate, which results in limited understanding of the influence of network topology, data heterogeneity, and the nested bilevel algorithmic structures. To address this issue, this paper introduces D-SOBA, a Decentralized Stochastic One-loop Bilevel Algorithm framework. D-SOBA comprises two variants: D-SOBA-SO, which incorporates second-order Hessian and Jacobian matrices, and D-SOBA-FO, which relies entirely on first-order gradients. We provide a comprehensive non-asymptotic convergence analysis and establish the transient iteration complexity of D-SOBA. This provides the first theoretical understanding of how network topology, data heterogeneity, and nested bilevel structures influence decentralized SBO. Extensive experimental results demonstrate the efficiency and theoretical advantages of D-SOBA.
comment: 64 pages
♻ ☆ DEVAL: A Framework for Evaluating and Improving the Derivation Capability of Large Language Models
Assessing the reasoning ability of Large Language Models (LLMs) over data remains an open and pressing research question. Compared with LLMs, human reasoning can derive corresponding modifications to the output based on certain kinds of changes to the input. This reasoning pattern, which relies on abstract rules that govern relationships between changes of data, has not been comprehensively described or evaluated in LLMs. In this paper, we formally define this reasoning pattern as the Derivation Relation (DR) and introduce the concept of Derivation Capability (DC), i.e. applying DR by making the corresponding modification to the output whenever the input takes certain changes. To assess DC, a systematically constructed evaluation framework named DEVAL is proposed and used to evaluate five popular LLMs and one Large Reasoning Model in seven mainstream tasks. The evaluation results show that mainstream LLMs, such as GPT-4o and Claude3.5, exhibit moderate DR recognition capabilities but reveal significant drop-offs on applying DR effectively in problem-solving scenarios. To improve this, we propose a novel prompt engineering approach called Derivation Prompting (DP). It achieves an average improvement of 15.2% in DC for all tested LLMs, outperforming commonly used prompt engineering techniques.
♻ ☆ CaberNet: Causal Representation Learning for Cross-Domain HVAC Energy Prediction
Cross-domain HVAC energy prediction is essential for scalable building energy management, particularly because collecting extensive labeled data for every new building is both costly and impractical. Yet, this task remains highly challenging due to the scarcity and heterogeneity of data across different buildings, climate zones, and seasonal patterns. In particular, buildings situated in distinct climatic regions introduce variability that often leads existing methods to overfit to spurious correlations, rely heavily on expert intervention, or compromise on data diversity. To address these limitations, we propose CaberNet, a causal and interpretable deep sequence model that learns invariant (Markov blanket) representations for robust cross-domain prediction. In a purely data-driven fashion and without requiring any prior knowledge, CaberNet integrates i) a global feature gate trained with a self-supervised Bernoulli regularization to distinguish superior causal features from inferior ones, and ii) a domain-wise training scheme that balances domain contributions, minimizes cross-domain loss variance, and promotes latent factor independence. We evaluate CaberNet on real-world datasets collected from three buildings located in three climatically diverse cities, and it consistently outperforms all baselines, achieving a 22.9% reduction in normalized mean squared error (NMSE) compared to the best benchmark. Our code is available at https://github.com/SusCom-Lab/CaberNet-CRL.
comment: Accepted at ACM e-Energy 2026
♻ ☆ Rep-GLS: Report-Guided Generalized Label Smoothing for Robust Disease Detection
Unlike nature image classification where groundtruth label is explicit and of no doubt, physicians commonly interpret medical image conditioned on certainty like using phrase "probable" or "likely". Existing medical image datasets either simply overlooked the nuance and polarise into binary label. Here, we propose a novel framework that leverages a Large Language Model (LLM) to directly mine medical reports to utilise the uncertainty relevant expression for supervision signal. At first, we collect uncertainty keywords from medical reports. Then, we use Qwen-3 4B to identify the textual uncertainty and map them into an adaptive Generalized Label Smoothing (GLS) rate. This rate allows our model to treat uncertain labels not as errors, but as informative signals, effectively incorporating expert skepticism into the training process. We establish a new clinical expert uncertainty-aware benchmark to rigorously evaluate this problem. Experiments demonstrate that our approach significantly outperforms state-of-the-art methods in medical disease detection. The curated uncertainty words database, code, and benchmark will be made publicly available upon acceptance.
♻ ☆ HiViS: Hiding Visual Tokens from the Drafter for Speculative Decoding in Vision-Language Models
Speculative decoding has proven effective for accelerating inference in Large Language Models (LLMs), yet its extension to Vision-Language Models (VLMs) remains limited by the computational burden and semantic inconsistency introduced by visual tokens. Recent studies reveal that visual tokens in large VLMs are highly redundant, and most of them can be removed without compromising generation quality. Motivated by this observation, we propose HiViS (Hiding Visual Tokens from the Drafter for Speculative Decoding in Vision-Language Models), a framework that utilizes the target VLM as a semantic fusion model, allowing the drafter to obtain visual information without explicitly processing visual tokens, ensuring that the drafter's prefill sequence length matches that of the textual tokens. Furthermore, HiViS employs a time-step-aware aligned training scheme that allows the drafter to autonomously propagate and refine instructive visual-textual semantics during independent drafting, guided by step-dependent bias-correction residuals. Extensive experiments across representative VLMs and benchmarks demonstrate that HiViS achieves significant improvements in average acceptance length and speedup ratio.
♻ ☆ PepThink-R1: LLM for Interpretable Cyclic Peptide Optimization with CoT SFT and Reinforcement Learning
Designing therapeutic peptides with tailored properties is hindered by the vastness of sequence space, limited experimental data, and poor interpretability of current generative models. To address these challenges, we introduce PepThink-R1, a generative framework that integrates large language models (LLMs) with chain-of-thought (CoT) supervised fine-tuning and reinforcement learning (RL). Unlike prior approaches, PepThink-R1 explicitly reasons about monomer-level modifications during sequence generation, enabling interpretable design choices while optimizing for multiple pharmacological properties. Guided by a tailored reward function balancing chemical validity and property improvements, the model autonomously explores diverse sequence variants. We demonstrate that PepThink-R1 generates cyclic peptides with significantly enhanced lipophilicity, stability, and exposure, outperforming existing general LLMs (e.g., GPT-5) and domain-specific baseline in both optimization success and interpretability. To our knowledge, this is the first LLM-based peptide design framework that combines explicit reasoning with RL-driven property control, marking a step toward reliable and transparent peptide optimization for therapeutic discovery.
♻ ☆ GAPO: Robust Advantage Estimation for Real-World Code LLMs
Reinforcement learning (RL) is widely used for post-training large language models (LLMs) in code editing, where group-relative methods like GRPO are popular for their critic-free, normalized advantage estimation. However, in real-world code-editing scenarios, reward distributions are often skewed with unpredictable outliers, leading to distorted advantage computation and increased noise. To address this issue, we propose Group Adaptive Policy Optimization (GAPO), which adaptively finds an outlier-free highest-density interval (HDI) per prompt and then uses the median of that interval as an adaptive Q to replace the group mean in advantage calculation. This adaptive Q robustly handles skewed distributions while remaining plug-and-play and efficient. We validate GAPO on nine instruction-tuned LLMs (3B-14B) using a large internal dataset of 51,844 real-world, history-aware code-editing tasks across 10 languages, demonstrating consistent improvements in exact match accuracy over GRPO and its variant DAPO. Code is publicly available.
♻ ☆ Sparse-PGD: A Unified Framework for Sparse Adversarial Perturbations Generation
This work studies sparse adversarial perturbations, including both unstructured and structured ones. We propose a framework based on a white-box PGD-like attack method named Sparse-PGD to effectively and efficiently generate such perturbations. Furthermore, we combine Sparse-PGD with a black-box attack to comprehensively and more reliably evaluate the models' robustness against unstructured and structured sparse adversarial perturbations. Moreover, the efficiency of Sparse-PGD enables us to conduct adversarial training to build robust models against various sparse perturbations. Extensive experiments demonstrate that our proposed attack algorithm exhibits strong performance in different scenarios. More importantly, compared with other robust models, our adversarially trained model demonstrates state-of-the-art robustness against various sparse attacks. Codes are available at https://github.com/CityU-MLO/sPGD.
comment: Accepted by TPAMI
♻ ☆ AdamNX: An Adam improvement algorithm based on a novel exponential decay mechanism for the second-order moment estimate
Since the 21st century, artificial intelligence has been leading a new round of industrial revolution. Under the training framework, the optimization algorithm aims to stably converge high-dimensional optimization to local and even global minima. Entering the era of large language models, although the scale of model parameters and data has increased, Adam remains the mainstream optimization algorithm. However, compared with stochastic gradient descent (SGD) based optimization algorithms, Adam is more likely to converge to non-flat minima. To address this issue, the AdamNX algorithm is proposed. Its core innovation lies in the proposition of a novel type of second-order moment estimation exponential decay rate, which gradually weakens the learning step correction strength as training progresses, and degrades to momentum SGD in the stable training period, thereby improving the stability of training in the stable period and possibly enhancing generalization ability. Experimental results show that our second-order moment estimation exponential decay rate is better than the current second-order moment estimation exponential decay rate, and AdamNX can stably outperform Adam and its variants in terms of performance. Our code is open-sourced at https://github.com/mengzhu0308/AdamNX.
comment: 25 pages, 6 figures, 12 tables. v3: The algorithm formerly known as "AdamX" has been renamed to "AdamNX" to avoid confusion with prior work [DOI 10.27272/d.cnki.gshdu.2022.006950]. No changes to methodology
♻ ☆ Formal Models and Convergence Analysis for Context-Aware Security Verification
Traditional security scanners fail when facing new attack patterns they haven't seen before. They rely on fixed rules and predetermined signatures, making them blind to novel threats. We present a fundamentally different approach: instead of memorizing specific attack patterns, we learn what makes systems genuinely secure. Our key insight is simple yet powerful: context determines vulnerability. A SQL query that's safe in one environment becomes dangerous in another. By modeling this context-vulnerability relationship, we achieve something remarkable: our system detects attacks it has never seen before. We introduce context-aware verification that learns from genuine system behavior. Through reconstruction learning on secure systems, we capture their essential characteristics. When an unknown attack deviates from these patterns, our system recognizes it, even without prior knowledge of that specific attack type. We prove this capability theoretically, showing detection rates improve exponentially with context information I(W;C). Our framework combines three components: (1) reconstruction learning that models secure behavior, (2) multi-scale graph reasoning that aggregates contextual clues, and (3) attention mechanisms guided by reconstruction differences. Extensive experiments validate our approach: detection accuracy jumps from 58 percent to 82 percent with full context, unknown attack detection improves by 31 percent, and our system maintains above 90 percent accuracy even against completely novel attack vectors.
♻ ☆ Property-guided Inverse Design of Metal-Organic Frameworks Using Quantum Natural Language Processing
In this study, we explore the potential of using quantum natural language processing (QNLP) to inverse design metal-organic frameworks (MOFs) with targeted properties. Specifically, by analyzing 450 hypothetical MOF structures consisting of 3 topologies, 10 metal nodes and 15 organic ligands, we categorize these structures into four distinct classes for pore volume and $CO_{2}$ Henry's constant values. We then compare various QNLP models (i.e. the bag-of-words, DisCoCat (Distributional Compositional Categorical), and sequence-based models) to identify the most effective approach to process the MOF dataset. Using a classical simulator provided by the IBM Qiskit, the bag-of-words model is identified to be the optimum model, achieving validation accuracies of 88.6% and 78.0% for binary classification tasks on pore volume and $CO_{2}$ Henry's constant, respectively. Further, we developed multi-class classification models tailored to the probabilistic nature of quantum circuits, with average test accuracies of 92% and 80% across different classes for pore volume and $CO_{2}$ Henry's constant datasets. Finally, the performance of generating MOF with target properties showed accuracies of 93.5% for pore volume and 87% for $CO_{2}$ Henry's constant, respectively. Although our investigation covers only a fraction of the vast MOF search space, it marks a promising first step towards using quantum computing for materials design, offering a new perspective through which to explore the complex landscape of MOFs.
comment: 46 pages, 7 figures, 6 supplementary figures, 1 table, 2 supplementary tables, 1 supplementary note
♻ ☆ Revisiting Model Inversion Evaluation: From Misleading Standards to Reliable Privacy Assessment
Model Inversion (MI) attacks aim to reconstruct information from private training data by exploiting access to machine learning models T. To evaluate such attacks, the standard evaluation framework relies on an evaluation model E, trained under the same task design as T. This framework has become the de facto standard for assessing progress in MI research, used across nearly all recent MI studies without question. In this paper, we present the first in-depth study of this evaluation framework. In particular, we identify a critical issue of this standard framework: Type-I adversarial examples. These are reconstructions that do not capture the visual features of private training data, yet are still deemed successful by T and ultimately transferable to E. Such false positives undermine the reliability of the standard MI evaluation framework. To address this issue, we introduce a new MI evaluation framework that replaces the evaluation model E with advanced Multimodal Large Language Models (MLLMs). By leveraging their general-purpose visual understanding, our MLLM-based framework does not depend on training of shared task design as in T, thus reducing Type-I transferability and providing more faithful assessments of reconstruction success. Using our MLLM-based evaluation framework, we reevaluate 27 diverse MI attack setups and empirically reveal consistently high false positive rates under the standard evaluation framework. Importantly, we demonstrate that many state-of-the-art (SOTA) MI methods report inflated attack accuracy, indicating that actual privacy leakage is significantly lower than previously believed. By uncovering this critical issue and proposing a robust solution, our work enables a reassessment of progress in MI research and sets a new standard for reliable and robust evaluation. Code can be found in https://github.com/hosytuyen/MI-Eval-MLLM
comment: To support future work, we release our MLLM-based MI evaluation framework and benchmarking suite at https://github.com/hosytuyen/MI-Eval-MLLM
♻ ☆ Statistically Assuring Safety of Control Systems using Ensembles of Safety Filters and Conformal Prediction
Safety assurance is a fundamental requirement for deploying learning-enabled autonomous systems. Hamilton-Jacobi (HJ) reachability analysis is a fundamental method for formally verifying safety and generating safe controllers. However, computing the HJ value function that characterizes the backward reachable set (BRS) of a set of user-defined failure states is computationally expensive, especially for high-dimensional systems, motivating the use of reinforcement learning approaches to approximate the value function. Unfortunately, a learned value function and its corresponding safe policy are not guaranteed to be correct. The learned value function evaluated at a given state may not be equal to the actual safety return achieved by following the learned safe policy. To address this challenge, we introduce a conformal prediction-based (CP) framework that bounds such uncertainty. We leverage CP to provide probabilistic safety guarantees when using learned HJ value functions and policies to prevent control systems from reaching failure states. Specifically, we use CP to calibrate the switching between the unsafe nominal controller and the learned HJ-based safe policy and to derive safety guarantees under this switched policy. We also investigate using an ensemble of independently trained HJ value functions as a safety filter and compare this ensemble approach to using individual value functions alone.
♻ ☆ Spatial-and-Frequency-aware Restoration method for Images based on Diffusion Models
Diffusion models have recently emerged as a promising framework for Image Restoration (IR), owing to their ability to produce high-quality reconstructions and their compatibility with established methods. Existing methods for solving noisy inverse problems in IR, considers the pixel-wise data-fidelity. In this paper, we propose SaFaRI, a spatial-and-frequency-aware diffusion model for IR with Gaussian noise. Our model encourages images to preserve data-fidelity in both the spatial and frequency domains, resulting in enhanced reconstruction quality. We comprehensively evaluate the performance of our model on a variety of noisy inverse problems, including inpainting, denoising, and super-resolution. Our thorough evaluation demonstrates that SaFaRI achieves state-of-the-art performance on both the ImageNet datasets and FFHQ datasets, outperforming existing zero-shot IR methods in terms of LPIPS and FID metrics.
Computer Vision and Pattern Recognition
Dataset Distillation for Pre-Trained Self-Supervised Vision Models NeurIPS 2025
The task of dataset distillation aims to find a small set of synthetic images such that training a model on them reproduces the performance of the same model trained on a much larger dataset of real samples. Existing distillation methods focus on synthesizing datasets that enable training randomly initialized models. In contrast, state-of-the-art vision approaches are increasingly building on large, pre-trained self-supervised models rather than training from scratch. In this paper, we investigate the problem of distilling datasets that enable us to optimally train linear probes on top of such large, pre-trained vision models. We introduce a method of dataset distillation for this task called Linear Gradient Matching that optimizes the synthetic images such that, when passed through a pre-trained feature extractor, they induce gradients in the linear classifier similar to those produced by the real data. Our method yields synthetic data that outperform all real-image baselines and, remarkably, generalize across pre-trained vision models, enabling us, for instance, to train a linear CLIP probe that performs competitively using a dataset distilled via a DINO backbone. Further, we show that our distilled datasets are exceptionally effective for fine-grained classification and provide a valuable tool for model interpretability, predicting, among other things, how similar two models' embedding spaces are under the platonic representation hypothesis or whether a model is sensitive to spurious correlations in adversarial datasets.
comment: Accepted at NeurIPS 2025. Project page: https://linear-gradient-matching.github.io/ Code: https://github.com/GeorgeCazenavette/linear-gradient-matching
☆ EvoLMM: Self-Evolving Large Multimodal Models with Continuous Rewards
Recent advances in large multimodal models (LMMs) have enabled impressive reasoning and perception abilities, yet most existing training pipelines still depend on human-curated data or externally verified reward models, limiting their autonomy and scalability. In this work, we strive to improve LMM reasoning capabilities in a purely unsupervised fashion (without any annotated data or reward distillation). To this end, we propose a self-evolving framework, named EvoLMM, that instantiates two cooperative agents from a single backbone model: a Proposer, which generates diverse, image-grounded questions, and a Solver, which solves them through internal consistency, where learning proceeds through a continuous self-rewarding process. This dynamic feedback encourages both the generation of informative queries and the refinement of structured reasoning without relying on ground-truth or human judgments. When using the popular Qwen2.5-VL as the base model, our EvoLMM yields consistent gains upto $\sim$3\% on multimodal math-reasoning benchmarks, including ChartQA, MathVista, and MathVision, using only raw training images. We hope our simple yet effective approach will serve as a solid baseline easing future research in self-improving LMMs in a fully-unsupervised fashion. Our code and models are available at https://github.com/mbzuai-oryx/EvoLMM.
comment: 9 Pages, 6 Figures, 4 Tables
☆ NoPo-Avatar: Generalizable and Animatable Avatars from Sparse Inputs without Human Poses NeurIPS'25
We tackle the task of recovering an animatable 3D human avatar from a single or a sparse set of images. For this task, beyond a set of images, many prior state-of-the-art methods use accurate "ground-truth" camera poses and human poses as input to guide reconstruction at test-time. We show that pose-dependent reconstruction degrades results significantly if pose estimates are noisy. To overcome this, we introduce NoPo-Avatar, which reconstructs avatars solely from images, without any pose input. By removing the dependence of test-time reconstruction on human poses, NoPo-Avatar is not affected by noisy human pose estimates, making it more widely applicable. Experiments on challenging THuman2.0, XHuman, and HuGe100K data show that NoPo-Avatar outperforms existing baselines in practical settings (without ground-truth poses) and delivers comparable results in lab settings (with ground-truth poses).
comment: NeurIPS'25; project page: https://wenj.github.io/NoPo-Avatar/
☆ Thinking-while-Generating: Interleaving Textual Reasoning throughout Visual Generation
Recent advances in visual generation have increasingly explored the integration of reasoning capabilities. They incorporate textual reasoning, i.e., think, either before (as pre-planning) or after (as post-refinement) the generation process, yet they lack on-the-fly multimodal interaction during the generation itself. In this preliminary study, we introduce Thinking-while-Generating (TwiG), the first interleaved framework that enables co-evolving textual reasoning throughout the visual generation process. As visual content is progressively generating, textual reasoning is interleaved to both guide upcoming local regions and reflect on previously synthesized ones. This dynamic interplay produces more context-aware and semantically rich visual outputs. To unveil the potential of this framework, we investigate three candidate strategies, zero-shot prompting, supervised fine-tuning (SFT) on our curated TwiG-50K dataset, and reinforcement learning (RL) via a customized TwiG-GRPO strategy, each offering unique insights into the dynamics of interleaved reasoning. We hope this work inspires further research into interleaving textual reasoning for enhanced visual generation. Code will be released at: https://github.com/ZiyuGuo99/Thinking-while-Generating.
comment: Project Page: https://think-while-gen.github.io Code: https://github.com/ZiyuGuo99/Thinking-while-Generating
☆ Learning to Think Fast and Slow for Visual Language Models
When confronted with complex problems, we tend to think slowly; conversely, for simple questions, we think quickly. Such a two-system thinking mechanism allows us to efficiently allocate cognitive resources, enabling quick decision-making for straightforward issues while reserving deeper analytical thinking for more intricate challenges. However, existing reasoning-oriented visual language models (VLMs), whether trained with explicit chain-of-thought annotations or rule-based RL rewards, mainly pursue lengthy, detailed reasoning chains, which often lead to excessive computational costs. In this work, we propose a simple RL approach, which enables VLMs to automatically switch between fast and slow thinking modes depending on task difficulty. The approach consists of two stages: in the first stage, we label data as either requiring fast thinking or slow thinking based on the model output length, which is inspired by the observation that pre-trained VLMs typically produce answers of varying lengths for different types of questions; in the second stage, we train the model using GRPO along with the thinking mode labels to develop dual-mode thinking. Despite its simplicity, our model, named DualMindVLM, significantly outperforms the base model and achieves performance on par with state-of-the-art visual reasoning models, while maintaining exceptionally high token efficiency.
☆ Video-as-Answer: Predict and Generate Next Video Event with Joint-GRPO
While language models have become impactful in many real-world applications, video generation remains largely confined to entertainment. Motivated by video's inherent capacity to demonstrate physical-world information that is difficult to convey through language alone (e.g., imagine teaching someone to tie a tie using only text), we identify an underutilized opportunity to extend video as a new answer modality for Next-Event Prediction (NEP), formalized as Video-Next-Event Prediction (VNEP). While the established NEP task takes a video with a procedural or predictive question as input to predict the next event in text, VNEP requires dynamic video responses. This shift from telling to showing unlocks more intuitive and customized answers for procedural learning and creative exploration. However, this task remains challenging for existing models, as it demands an understanding of multimodal input, instruction-conditioned reasoning, and the generation of video with visual and semantic consistency. To address this, we introduce VANS, a model that leverages reinforcement learning to align a Vision-Language Model (VLM) with a Video Diffusion Model (VDM) for VNEP. The core of VANS is our proposed Joint-GRPO that orchestrates the VLM and VDM to function as a unit. Driven by a shared reward on their respective output, it optimizes the VLM to produce captions that are both accurate and friendly to visualize, while guiding the VDM to generate videos that are faithful to these captions and the input visual context. To enable this learning, we craft VANS-Data-100K, a dedicated dataset for the VNEP task. Experiments on procedural and predictive benchmarks demonstrate that VANS achieves state-of-the-art performance in both video event prediction and visualization. Codes are released in https://github.com/KlingTeam/VANS.
comment: Project page: https://video-as-answer.github.io/
☆ V-ReasonBench: Toward Unified Reasoning Benchmark Suite for Video Generation Models
Recent progress in generative video models, such as Veo-3, has shown surprising zero-shot reasoning abilities, creating a growing need for systematic and reliable evaluation. We introduce V-ReasonBench, a benchmark designed to assess video reasoning across four key dimensions: structured problem-solving, spatial cognition, pattern-based inference, and physical dynamics. The benchmark is built from both synthetic and real-world image sequences and provides a diverse set of answer-verifiable tasks that are reproducible, scalable, and unambiguous. Evaluations of six state-of-the-art video models reveal clear dimension-wise differences, with strong variation in structured, spatial, pattern-based, and physical reasoning. We further compare video models with strong image models, analyze common hallucination behaviors, and study how video duration affects Chain-of-Frames reasoning. Overall, V-ReasonBench offers a unified and reproducible framework for measuring video reasoning and aims to support the development of models with more reliable, human-aligned reasoning skills.
comment: Project Page: https://oahzxl.github.io/VReasonBench
☆ SceneDesigner: Controllable Multi-Object Image Generation with 9-DoF Pose Manipulation NeurIPS 2025
Controllable image generation has attracted increasing attention in recent years, enabling users to manipulate visual content such as identity and style. However, achieving simultaneous control over the 9D poses (location, size, and orientation) of multiple objects remains an open challenge. Despite recent progress, existing methods often suffer from limited controllability and degraded quality, falling short of comprehensive multi-object 9D pose control. To address these limitations, we propose SceneDesigner, a method for accurate and flexible multi-object 9-DoF pose manipulation. SceneDesigner incorporates a branched network to the pre-trained base model and leverages a new representation, CNOCS map, which encodes 9D pose information from the camera view. This representation exhibits strong geometric interpretation properties, leading to more efficient and stable training. To support training, we construct a new dataset, ObjectPose9D, which aggregates images from diverse sources along with 9D pose annotations. To further address data imbalance issues, particularly performance degradation on low-frequency poses, we introduce a two-stage training strategy with reinforcement learning, where the second stage fine-tunes the model using a reward-based objective on rebalanced data. At inference time, we propose Disentangled Object Sampling, a technique that mitigates insufficient object generation and concept confusion in complex multi-object scenes. Moreover, by integrating user-specific personalization weights, SceneDesigner enables customized pose control for reference subjects. Extensive qualitative and quantitative experiments demonstrate that SceneDesigner significantly outperforms existing approaches in both controllability and quality. Code is publicly available at https://github.com/FudanCVL/SceneDesigner.
comment: NeurIPS 2025 (Spotlight), Project Page: https://henghuiding.com/SceneDesigner/
☆ TriDiff-4D: Fast 4D Generation through Diffusion-based Triplane Re-posing
With the increasing demand for 3D animation, generating high-fidelity, controllable 4D avatars from textual descriptions remains a significant challenge. Despite notable efforts in 4D generative modeling, existing methods exhibit fundamental limitations that impede their broader applicability, including temporal and geometric inconsistencies, perceptual artifacts, motion irregularities, high computational costs, and limited control over dynamics. To address these challenges, we propose TriDiff-4D, a novel 4D generative pipeline that employs diffusion-based triplane re-posing to produce high-quality, temporally coherent 4D avatars. Our model adopts an auto-regressive strategy to generate 4D sequences of arbitrary length, synthesizing each 3D frame with a single diffusion process. By explicitly learning 3D structure and motion priors from large-scale 3D and motion datasets, TriDiff-4D enables skeleton-driven 4D generation that excels in temporal consistency, motion accuracy, computational efficiency, and visual fidelity. Specifically, TriDiff-4D first generates a canonical 3D avatar and a corresponding motion sequence from a text prompt, then uses a second diffusion model to animate the avatar according to the motion sequence, supporting arbitrarily long 4D generation. Experimental results demonstrate that TriDiff-4D significantly outperforms existing methods, reducing generation time from hours to seconds by eliminating the optimization process, while substantially improving the generation of complex motions with high-fidelity appearance and accurate 3D geometry.
comment: 8 pages, 10 figures, Under review at a conference
☆ PartUV: Part-Based UV Unwrapping of 3D Meshes
UV unwrapping flattens 3D surfaces to 2D with minimal distortion, often requiring the complex surface to be decomposed into multiple charts. Although extensively studied, existing UV unwrapping methods frequently struggle with AI-generated meshes, which are typically noisy, bumpy, and poorly conditioned. These methods often produce highly fragmented charts and suboptimal boundaries, introducing artifacts and hindering downstream tasks. We introduce PartUV, a part-based UV unwrapping pipeline that generates significantly fewer, part-aligned charts while maintaining low distortion. Built on top of a recent learning-based part decomposition method PartField, PartUV combines high-level semantic part decomposition with novel geometric heuristics in a top-down recursive framework. It ensures each chart's distortion remains below a user-specified threshold while minimizing the total number of charts. The pipeline integrates and extends parameterization and packing algorithms, incorporates dedicated handling of non-manifold and degenerate meshes, and is extensively parallelized for efficiency. Evaluated across four diverse datasets, including man-made, CAD, AI-generated, and Common Shapes, PartUV outperforms existing tools and recent neural methods in chart count and seam length, achieves comparable distortion, exhibits high success rates on challenging meshes, and enables new applications like part-specific multi-tiles packing. Our project page is at https://www.zhaoningwang.com/PartUV.
comment: project page: https://www.zhaoningwang.com/PartUV
☆ Solving Spatial Supersensing Without Spatial Supersensing
Cambrian-S aims to take the first steps towards improving video world models with spatial supersensing by introducing (i) two benchmarks, VSI-Super-Recall (VSR) and VSI-Super-Counting (VSC), and (ii) bespoke predictive sensing inference strategies tailored to each benchmark. In this work, we conduct a critical analysis of Cambrian-S across both these fronts. First, we introduce a simple baseline, NoSense, which discards almost all temporal structure and uses only a bag-of-words SigLIP model, yet near-perfectly solves VSR, achieving 95% accuracy even on 4-hour videos. This shows benchmarks like VSR can be nearly solved without spatial cognition, world modeling or spatial supersensing. Second, we hypothesize that the tailored inference methods proposed by Cambrian-S likely exploit shortcut heuristics in the benchmark. We illustrate this with a simple sanity check on the VSC benchmark, called VSC-Repeat: We concatenate each video with itself 1-5 times, which does not change the number of unique objects. However, this simple perturbation entirely collapses the mean relative accuracy of Cambrian-S from 42% to 0%. A system that performs spatial supersensing and integrates information across experiences should recognize views of the same scene and keep object-count predictions unchanged; instead, Cambrian-S inference algorithm relies largely on a shortcut in the VSC benchmark that rooms are never revisited. Taken together, our findings suggest that (i) current VSI-Super benchmarks do not yet reliably measure spatial supersensing, and (ii) predictive-sensing inference recipes used by Cambrian-S improve performance by inadvertently exploiting shortcuts rather than from robust spatial supersensing. We include the response from the Cambrian-S authors (in Appendix A) to provide a balanced perspective alongside our claims. We release our code at: https://github.com/bethgelab/supersanity
comment: Tech Report
☆ Teacher-Guided One-Shot Pruning via Context-Aware Knowledge Distillation
Unstructured pruning remains a powerful strategy for compressing deep neural networks, yet it often demands iterative train-prune-retrain cycles, resulting in significant computational overhead. To address this challenge, we introduce a novel teacher-guided pruning framework that tightly integrates Knowledge Distillation (KD) with importance score estimation. Unlike prior approaches that apply KD as a post-pruning recovery step, our method leverages gradient signals informed by the teacher during importance score calculation to identify and retain parameters most critical for both task performance and knowledge transfer. Our method facilitates a one-shot global pruning strategy that efficiently eliminates redundant weights while preserving essential representations. After pruning, we employ sparsity-aware retraining with and without KD to recover accuracy without reactivating pruned connections. Comprehensive experiments across multiple image classification benchmarks, including CIFAR-10, CIFAR-100, and TinyImageNet, demonstrate that our method consistently achieves high sparsity levels with minimal performance degradation. Notably, our approach outperforms state-of-the-art baselines such as EPG and EPSD at high sparsity levels, while offering a more computationally efficient alternative to iterative pruning schemes like COLT. The proposed framework offers a computation-efficient, performance-preserving solution well suited for deployment in resource-constrained environments.
comment: Accepted at 2025 IEEE International Conference on Big Data (IEEE BigData 2025)
☆ Late-decoupled 3D Hierarchical Semantic Segmentation with Semantic Prototype Discrimination based Bi-branch Supervision
3D hierarchical semantic segmentation (3DHS) is crucial for embodied intelligence applications that demand a multi-grained and multi-hierarchy understanding of 3D scenes. Despite the progress, previous 3DHS methods have overlooked following two challenges: I) multi-label learning with a parameter-sharing model can lead to multi-hierarchy conflicts in cross-hierarchy optimization, and II) the class imbalance issue is inevitable across multiple hierarchies of 3D scenes, which makes the model performance become dominated by major classes. To address these issues, we propose a novel framework with a primary 3DHS branch and an auxiliary discrimination branch. Specifically, to alleviate the multi-hierarchy conflicts, we propose a late-decoupled 3DHS framework which employs multiple decoders with the coarse-to-fine hierarchical guidance and consistency. The late-decoupled architecture can mitigate the underfitting and overfitting conflicts among multiple hierarchies and can also constrain the class imbalance problem in each individual hierarchy. Moreover, we introduce a 3DHS-oriented semantic prototype based bi-branch supervision mechanism, which additionally learns class-wise discriminative point cloud features and performs mutual supervision between the auxiliary and 3DHS branches, to enhance the class-imbalance segmentation. Extensive experiments on multiple datasets and backbones demonstrate that our approach achieves state-of-the-art 3DHS performance, and its core components can also be used as a plug-and-play enhancement to improve previous methods.
☆ TRIM: Scalable 3D Gaussian Diffusion Inference with Temporal and Spatial Trimming NeurIPS 2025
Recent advances in 3D Gaussian diffusion models suffer from time-intensive denoising and post-denoising processing due to the massive number of Gaussian primitives, resulting in slow generation and limited scalability along sampling trajectories. To improve the efficiency of 3D diffusion models, we propose $\textbf{TRIM}$ ($\textbf{T}$rajectory $\textbf{R}$eduction and $\textbf{I}$nstance $\textbf{M}$ask denoising), a post-training approach that incorporates both temporal and spatial trimming strategies, to accelerate inference without compromising output quality while supporting the inference-time scaling for Gaussian diffusion models. Instead of scaling denoising trajectories in a costly end-to-end manner, we develop a lightweight selector model to evaluate latent Gaussian primitives derived from multiple sampled noises, enabling early trajectory reduction by selecting candidates with high-quality potential. Furthermore, we introduce instance mask denoising to prune learnable Gaussian primitives by filtering out redundant background regions, reducing inference computation at each denoising step. Extensive experiments and analysis demonstrate that TRIM significantly improves both the efficiency and quality of 3D generation. Source code is available at $\href{https://github.com/zeyuanyin/TRIM}{link}$.
comment: NeurIPS 2025
☆ SurvAgent: Hierarchical CoT-Enhanced Case Banking and Dichotomy-Based Multi-Agent System for Multimodal Survival Prediction
Survival analysis is critical for cancer prognosis and treatment planning, yet existing methods lack the transparency essential for clinical adoption. While recent pathology agents have demonstrated explainability in diagnostic tasks, they face three limitations for survival prediction: inability to integrate multimodal data, ineffective region-of-interest exploration, and failure to leverage experiential learning from historical cases. We introduce SurvAgent, the first hierarchical chain-of-thought (CoT)-enhanced multi-agent system for multimodal survival prediction. SurvAgent consists of two stages: (1) WSI-Gene CoT-Enhanced Case Bank Construction employs hierarchical analysis through Low-Magnification Screening, Cross-Modal Similarity-Aware Patch Mining, and Confidence-Aware Patch Mining for pathology images, while Gene-Stratified analysis processes six functional gene categories. Both generate structured reports with CoT reasoning, storing complete analytical processes for experiential learning. (2) Dichotomy-Based Multi-Expert Agent Inference retrieves similar cases via RAG and integrates multimodal reports with expert predictions through progressive interval refinement. Extensive experiments on five TCGA cohorts demonstrate SurvAgent's superority over conventional methods, proprietary MLLMs, and medical agents, establishing a new paradigm for explainable AI-driven survival prediction in precision oncology.
comment: 20 pages
☆ SAM 3D: 3Dfy Anything in Images
We present SAM 3D, a generative model for visually grounded 3D object reconstruction, predicting geometry, texture, and layout from a single image. SAM 3D excels in natural images, where occlusion and scene clutter are common and visual recognition cues from context play a larger role. We achieve this with a human- and model-in-the-loop pipeline for annotating object shape, texture, and pose, providing visually grounded 3D reconstruction data at unprecedented scale. We learn from this data in a modern, multi-stage training framework that combines synthetic pretraining with real-world alignment, breaking the 3D "data barrier". We obtain significant gains over recent work, with at least a 5:1 win rate in human preference tests on real-world objects and scenes. We will release our code and model weights, an online demo, and a new challenging benchmark for in-the-wild 3D object reconstruction.
comment: Website: https://ai.meta.com/sam3d/
☆ SAM2S: Segment Anything in Surgical Videos via Semantic Long-term Tracking
Surgical video segmentation is crucial for computer-assisted surgery, enabling precise localization and tracking of instruments and tissues. Interactive Video Object Segmentation (iVOS) models such as Segment Anything Model 2 (SAM2) provide prompt-based flexibility beyond methods with predefined categories, but face challenges in surgical scenarios due to the domain gap and limited long-term tracking. To address these limitations, we construct SA-SV, the largest surgical iVOS benchmark with instance-level spatio-temporal annotations (masklets) spanning eight procedure types (61k frames, 1.6k masklets), enabling comprehensive development and evaluation for long-term tracking and zero-shot generalization. Building on SA-SV, we propose SAM2S, a foundation model enhancing \textbf{SAM2} for \textbf{S}urgical iVOS through: (1) DiveMem, a trainable diverse memory mechanism for robust long-term tracking; (2) temporal semantic learning for instrument understanding; and (3) ambiguity-resilient learning to mitigate annotation inconsistencies across multi-source datasets. Extensive experiments demonstrate that fine-tuning on SA-SV enables substantial performance gains, with SAM2 improving by 12.99 average $\mathcal{J}$\&$\mathcal{F}$ over vanilla SAM2. SAM2S further advances performance to 80.42 average $\mathcal{J}$\&$\mathcal{F}$, surpassing vanilla and fine-tuned SAM2 by 17.10 and 4.11 points respectively, while maintaining 68 FPS real-time inference and strong zero-shot generalization. Code and dataset will be released at https://jinlab-imvr.github.io/SAM2S.
comment: 11 pages, 4 figures
☆ TimeViper: A Hybrid Mamba-Transformer Vision-Language Model for Efficient Long Video Understanding
We introduce TimeViper, a hybrid vision-language model designed to tackle challenges of long video understanding. Processing long videos demands both an efficient model architecture and an effective mechanism for handling extended temporal contexts. To this end, TimeViper adopts a hybrid Mamba-Transformer backbone that combines the efficiency of state-space models with the expressivity of attention mechanisms. Through this hybrid design, we reveal the vision-to-text information aggregation phenomenon, where information progressively flows from vision tokens to text tokens across increasing LLM depth, resulting in severe vision token redundancy. Motivated by this observation, we propose TransV, a token information transfer module that transfers and compresses vision tokens into instruction tokens while maintaining multimodal understanding capabilities. This design enables TimeViper to process hour-long videos exceeding 10,000 frames. Extensive experiments across multiple benchmarks demonstrate that TimeViper competes with state-of-the-art models while extending frame numbers. We further analyze attention behaviors of both Mamba and Transformer layers, offering new insights into hybrid model interpretability. This work represents an initial step towards developing, interpreting, and compressing hybrid Mamba-Transformer architectures.
comment: Project page: https://xuboshen.github.io/TimeViper
☆ Green Resilience of Cyber-Physical Systems: Doctoral Dissertation
Cyber-physical systems (CPS) combine computational and physical components. Online Collaborative AI System (OL-CAIS) is a type of CPS that learn online in collaboration with humans to achieve a common goal, which makes it vulnerable to disruptive events that degrade performance. Decision-makers must therefore restore performance while limiting energy impact, creating a trade-off between resilience and greenness. This research addresses how to balance these two properties in OL-CAIS. It aims to model resilience for automatic state detection, develop agent-based policies that optimize the greenness-resilience trade-off, and understand catastrophic forgetting to maintain performance consistency. We model OL-CAIS behavior through three operational states: steady, disruptive, and final. To support recovery during disruptions, we introduce the GResilience framework, which provides recovery strategies through multi-objective optimization (one-agent), game-theoretic decision-making (two-agent), and reinforcement learning (RL-agent). We also design a measurement framework to quantify resilience and greenness. Empirical evaluation uses real and simulated experiments with a collaborative robot learning object classification from human demonstrations. Results show that the resilience model captures performance transitions during disruptions, and that GResilience policies improve green recovery by shortening recovery time, stabilizing performance, and reducing human dependency. RL-agent policies achieve the strongest results, although with a marginal increase in CO2 emissions. We also observe catastrophic forgetting after repeated disruptions, while our policies help maintain steadiness. A comparison with containerized execution shows that containerization cuts CO2 emissions by half. Overall, this research provides models, metrics, and policies that ensure the green recovery of OL-CAIS.
☆ Erase to Retain: Low Rank Adaptation Guided Selective Unlearning in Medical Segmentation Networks
The ability to selectively remove knowledge from medical segmentation networks is increasingly important for privacy compliance, ethical deployment, and continual dataset revision. We introduce Erase to Retain, a controllable unlearning framework for medical image segmentation that achieves targeted forgetting without full retraining. Our method uses a teacher-student distillation paradigm with Low-Rank Adaptation (LoRA) constrained subspace updates, enabling the student network to erase lesion-specific or class-specific representations in low-rank decoder spaces while preserving global anatomical understanding. During the strong unlearning phase, LoRA modules are adversarially optimized to contradict the teacher's confident predictions on a designated forget subset, enforcing semantic removal. This is followed by a gentle restoration phase that recovers generalization on retained data through head-only supervised refinement. For ISIC segmentation, the student reduces forget-set IoU from 0.875 to 0.509 while maintaining competitive performance on the retain and validation splits (0.647 to 0.677 IoU). On the cross-domain CHASE dataset, Erase to Retain consistently lowers forget-set IoU while preserving utility on retain and validation sets. For ISIC classification, our method decreases accuracy on the forget subset from 87.0 percent to 64.1 percent while improving retain accuracy from 83.9 percent to 90.6 percent. These results demonstrate that LoRA-based subspace unlearning provides a practical pathway toward responsible, controllable, and reversible unlearning in medical image analysis, enabling models to forget sensitive samples or structures while preserving performance where it matters most.
☆ POMA-3D: The Point Map Way to 3D Scene Understanding
In this paper, we introduce POMA-3D, the first self-supervised 3D representation model learned from point maps. Point maps encode explicit 3D coordinates on a structured 2D grid, preserving global 3D geometry while remaining compatible with the input format of 2D foundation models. To transfer rich 2D priors into POMA-3D, a view-to-scene alignment strategy is designed. Moreover, as point maps are view-dependent with respect to a canonical space, we introduce POMA-JEPA, a joint embedding-predictive architecture that enforces geometrically consistent point map features across multiple views. Additionally, we introduce ScenePoint, a point map dataset constructed from 6.5K room-level RGB-D scenes and 1M 2D image scenes to facilitate large-scale POMA-3D pretraining. Experiments show that POMA-3D serves as a strong backbone for both specialist and generalist 3D understanding. It benefits diverse tasks, including 3D question answering, embodied navigation, scene retrieval, and embodied localization, all achieved using only geometric inputs (i.e., 3D coordinates). Overall, our POMA-3D explores a point map way to 3D scene understanding, addressing the scarcity of pretrained priors and limited data in 3D representation learning. Project Page: https://matchlab-imperial.github.io/poma3d/
comment: 11 pages, 6 tables, 5 figures
☆ NutriScreener: Retrieval-Augmented Multi-Pose Graph Attention Network for Malnourishment Screening AAAI 2026
Child malnutrition remains a global crisis, yet existing screening methods are laborious and poorly scalable, hindering early intervention. In this work, we present NutriScreener, a retrieval-augmented, multi-pose graph attention network that combines CLIP-based visual embeddings, class-boosted knowledge retrieval, and context awareness to enable robust malnutrition detection and anthropometric prediction from children's images, simultaneously addressing generalizability and class imbalance. In a clinical study, doctors rated it 4.3/5 for accuracy and 4.6/5 for efficiency, confirming its deployment readiness in low-resource settings. Trained and tested on 2,141 children from AnthroVision and additionally evaluated on diverse cross-continent populations, including ARAN and an in-house collected CampusPose dataset, it achieves 0.79 recall, 0.82 AUC, and significantly lower anthropometric RMSEs, demonstrating reliable measurement in unconstrained pediatric settings. Cross-dataset results show up to 25% recall gain and up to 3.5 cm RMSE reduction using demographically matched knowledge bases. NutriScreener offers a scalable and accurate solution for early malnutrition detection in low-resource environments.
comment: Accepted in AAAI 2026 Special Track on AI for Social Impact
☆ Lite Any Stereo: Efficient Zero-Shot Stereo Matching
Recent advances in stereo matching have focused on accuracy, often at the cost of significantly increased model size. Traditionally, the community has regarded efficient models as incapable of zero-shot ability due to their limited capacity. In this paper, we introduce Lite Any Stereo, a stereo depth estimation framework that achieves strong zero-shot generalization while remaining highly efficient. To this end, we design a compact yet expressive backbone to ensure scalability, along with a carefully crafted hybrid cost aggregation module. We further propose a three-stage training strategy on million-scale data to effectively bridge the sim-to-real gap. Together, these components demonstrate that an ultra-light model can deliver strong generalization, ranking 1st across four widely used real-world benchmarks. Remarkably, our model attains accuracy comparable to or exceeding state-of-the-art non-prior-based accurate methods while requiring less than 1% computational cost, setting a new standard for efficient stereo matching.
☆ Progressive Supernet Training for Efficient Visual Autoregressive Modeling CVPR 2025
Visual Auto-Regressive (VAR) models significantly reduce inference steps through the "next-scale" prediction paradigm. However, progressive multi-scale generation incurs substantial memory overhead due to cumulative KV caching, limiting practical deployment. We observe a scale-depth asymmetric dependency in VAR: early scales exhibit extreme sensitivity to network depth, while later scales remain robust to depth reduction. Inspired by this, we propose VARiant: by equidistant sampling, we select multiple subnets ranging from 16 to 2 layers from the original 30-layer VAR-d30 network. Early scales are processed by the full network, while later scales utilize subnet. Subnet and the full network share weights, enabling flexible depth adjustment within a single model. However, weight sharing between subnet and the entire network can lead to optimization conflicts. To address this, we propose a progressive training strategy that breaks through the Pareto frontier of generation quality for both subnets and the full network under fixed-ratio training, achieving joint optimality. Experiments on ImageNet demonstrate that, compared to the pretrained VAR-d30 (FID 1.95), VARiant-d16 and VARiant-d8 achieve nearly equivalent quality (FID 2.05/2.12) while reducing memory consumption by 40-65%. VARiant-d2 achieves 3.5 times speedup and 80% memory reduction at moderate quality cost (FID 2.97). In terms of deployment, VARiant's single-model architecture supports zero-cost runtime depth switching and provides flexible deployment options from high quality to extreme efficiency, catering to diverse application scenarios.
comment: Submitted to CVPR 2025. 10 pages, 7 figures
☆ EOGS++: Earth Observation Gaussian Splatting with Internal Camera Refinement and Direct Panchromatic Rendering SP
Recently, 3D Gaussian Splatting has been introduced as a compelling alternative to NeRF for Earth observation, offering com- petitive reconstruction quality with significantly reduced training times. In this work, we extend the Earth Observation Gaussian Splatting (EOGS) framework to propose EOGS++, a novel method tailored for satellite imagery that directly operates on raw high-resolution panchromatic data without requiring external preprocessing. Furthermore, leveraging optical flow techniques we embed bundle adjustment directly within the training process, avoiding reliance on external optimization tools while improving camera pose estimation. We also introduce several improvements to the original implementation, including early stopping and TSDF post-processing, all contributing to sharper reconstructions and better geometric accuracy. Experiments on the IARPA 2016 and DFC2019 datasets demonstrate that EOGS++ achieves state-of-the-art performance in terms of reconstruction quality and effi- ciency, outperforming the original EOGS method and other NeRF-based methods while maintaining the computational advantages of Gaussian Splatting. Our model demonstrates an improvement from 1.33 to 1.19 mean MAE errors on buildings compared to the original EOGS models
comment: 8 pages, ISPRS
☆ Supervised Contrastive Learning for Few-Shot AI-Generated Image Detection and Attribution
The rapid advancement of generative artificial intelligence has enabled the creation of synthetic images that are increasingly indistinguishable from authentic content, posing significant challenges for digital media integrity. This problem is compounded by the accelerated release cycle of novel generative models, which renders traditional detection approaches (reliant on periodic retraining) computationally infeasible and operationally impractical. This work proposes a novel two-stage detection framework designed to address the generalization challenge inherent in synthetic image detection. The first stage employs a vision deep learning model trained via supervised contrastive learning to extract discriminative embeddings from input imagery. Critically, this model was trained on a strategically partitioned subset of available generators, with specific architectures withheld from training to rigorously ablate cross-generator generalization capabilities. The second stage utilizes a k-nearest neighbors (k-NN) classifier operating on the learned embedding space, trained in a few-shot learning paradigm incorporating limited samples from previously unseen test generators. With merely 150 images per class in the few-shot learning regime, which are easily obtainable from current generation models, the proposed framework achieves an average detection accuracy of 91.3\%, representing a 5.2 percentage point improvement over existing approaches . For the source attribution task, the proposed approach obtains improvements of of 14.70\% and 4.27\% in AUC and OSCR respectively on an open set classification context, marking a significant advancement toward robust, scalable forensic attribution systems capable of adapting to the evolving generative AI landscape without requiring exhaustive retraining protocols.
comment: 17 pages, 6 figures, 6 tables
☆ Investigating Optical Flow Computation: From Local Methods to a Multiresolution Horn-Schunck Implementation with Bilinear Interpolation
This paper presents an applied analysis of local and global methods, with a focus on the Horn-Schunck algorithm for optical flow computation. We explore the theoretical and practical aspects of local approaches, such as the Lucas-Kanade method, and global techniques such as Horn-Schunck. Additionally, we implement a multiresolution version of the Horn-Schunck algorithm, using bilinear interpolation and prolongation to improve accuracy and convergence. The study investigates the effectiveness of these combined strategies in estimating motion between frames, particularly under varying image conditions.
☆ Enhancing Multi-Camera Gymnast Tracking Through Domain Knowledge Integration
We present a robust multi-camera gymnast tracking, which has been applied at international gymnastics championships for gymnastics judging. Despite considerable progress in multi-camera tracking algorithms, tracking gymnasts presents unique challenges: (i) due to space restrictions, only a limited number of cameras can be installed in the gymnastics stadium; and (ii) due to variations in lighting, background, uniforms, and occlusions, multi-camera gymnast detection may fail in certain views and only provide valid detections from two opposing views. These factors complicate the accurate determination of a gymnast's 3D trajectory using conventional multi-camera triangulation. To alleviate this issue, we incorporate gymnastics domain knowledge into our tracking solution. Given that a gymnast's 3D center typically lies within a predefined vertical plane during \revised{much of their} performance, we can apply a ray-plane intersection to generate coplanar 3D trajectory candidates for opposing-view detections. More specifically, we propose a novel cascaded data association (DA) paradigm that employs triangulation to generate 3D trajectory candidates when cross-view detections are sufficient, and resort to the ray-plane intersection when they are insufficient. Consequently, coplanar candidates are used to compensate for uncertain trajectories, thereby minimizing tracking failures. The robustness of our method is validated through extensive experimentation, demonstrating its superiority over existing methods in challenging scenarios. Furthermore, our gymnastics judging system, equipped with this tracking method, has been successfully applied to recent Gymnastics World Championships, earning significant recognition from the International Gymnastics Federation.
☆ Contrastive vision-language learning with paraphrasing and negation
Contrastive vision-language models continue to be the dominant approach for image and text retrieval. Contrastive Language-Image Pre-training (CLIP) trains two neural networks in contrastive manner to align their image and text embeddings in a shared latent space. Recent results evaluating CLIP on negated or paraphrased text have shown mixed performance because negation changes meaning radically with minimal lexical changes, while paraphrasing can create very different textual expressions with the same intended meaning. This poses a significant challenge for improving the evaluation results and alignment of vision-language models. To address this challenge, this paper evaluates the combination of paraphrasing and negation, proposes a new CLIP contrastive loss function accounting for both paraphrasing and negation, and applies LLM-generated training triples consisting of original, paraphrased and negated textual captions to CLIP-like training models. The approach, called SemCLIP, is shown to move paraphrased captions towards the original image embeddings while pushing negated captions further away in embedding space. Empirically, SemCLIP is shown to be capable of preserving CLIP's performance while increasing considerably the distances to negated captions. On the CC-Neg benchmark using an original over negation image-retrieval accuracy metric, SemCLIP improves accuracy from 68.1% to 78.1%. Although results are mixed when compared with CLIP on the Sugarcrepe++ benchmark, SemCLIP's performance is generally better than the models trained with negated captions. This robustness to negation extends to downstream zero-shot classification tasks where SemCLIP pre-trained on Sugarcrepe++ performs better than CLIP on all tested downstream tasks. These results indicate that SemCLIP can achieve significant robustness to semantic transformations.
☆ BoxingVI: A Multi-Modal Benchmark for Boxing Action Recognition and Localization
Accurate analysis of combat sports using computer vision has gained traction in recent years, yet the development of robust datasets remains a major bottleneck due to the dynamic, unstructured nature of actions and variations in recording environments. In this work, we present a comprehensive, well-annotated video dataset tailored for punch detection and classification in boxing. The dataset comprises 6,915 high-quality punch clips categorized into six distinct punch types, extracted from 20 publicly available YouTube sparring sessions and involving 18 different athletes. Each clip is manually segmented and labeled to ensure precise temporal boundaries and class consistency, capturing a wide range of motion styles, camera angles, and athlete physiques. This dataset is specifically curated to support research in real-time vision-based action recognition, especially in low-resource and unconstrained environments. By providing a rich benchmark with diverse punch examples, this contribution aims to accelerate progress in movement analysis, automated coaching, and performance assessment within boxing and related domains.
☆ YOWO: You Only Walk Once to Jointly Map An Indoor Scene and Register Ceiling-mounted Cameras
Using ceiling-mounted cameras (CMCs) for indoor visual capturing opens up a wide range of applications. However, registering CMCs to the target scene layout presents a challenging task. While manual registration with specialized tools is inefficient and costly, automatic registration with visual localization may yield poor results when visual ambiguity exists. To alleviate these issues, we propose a novel solution for jointly mapping an indoor scene and registering CMCs to the scene layout. Our approach involves equipping a mobile agent with a head-mounted RGB-D camera to traverse the entire scene once and synchronize CMCs to capture this mobile agent. The egocentric videos generate world-coordinate agent trajectories and the scene layout, while the videos of CMCs provide pseudo-scale agent trajectories and CMC relative poses. By correlating all the trajectories with their corresponding timestamps, the CMC relative poses can be aligned to the world-coordinate scene layout. Based on this initialization, a factor graph is customized to enable the joint optimization of ego-camera poses, scene layout, and CMC poses. We also develop a new dataset, setting the first benchmark for collaborative scene mapping and CMC registration (https://sites.google.com/view/yowo/home). Experimental results indicate that our method not only effectively accomplishes two tasks within a unified framework, but also jointly enhances their performance. We thus provide a reliable tool to facilitate downstream position-aware applications.
☆ MiMo-Embodied: X-Embodied Foundation Model Technical Report
We open-source MiMo-Embodied, the first cross-embodied foundation model to successfully integrate and achieve state-of-the-art performance in both Autonomous Driving and Embodied AI. MiMo-Embodied sets new records across 17 embodied AI benchmarks in Task Planning, Affordance Prediction and Spatial Understanding, while also excelling in 12 autonomous driving benchmarks across Environmental Perception, Status Prediction, and Driving Planning. Across these tasks, MiMo-Embodied significantly outperforms existing open-source, closed-source, and specialized baselines. Our results indicate that through multi-stage learning, curated data construction, and CoT/RL fine-tuning, these two domains exhibit strong positive transfer and mutually reinforce one another. We provide a detailed analysis of our model design and training methodologies to facilitate further research. Code and models are available at https://github.com/XiaomiMiMo/MiMo-Embodied.
comment: Code: https://github.com/XiaomiMiMo/MiMo-Embodied Model: https://huggingface.co/XiaomiMiMo/MiMo-Embodied-7B
☆ Acquisition Time-Informed Breast Tumor Segmentation from Dynamic Contrast-Enhanced MRI
Dynamic contrast-enhanced magnetic resonance imaging (DCE-MRI) plays an important role in breast cancer screening, tumor assessment, and treatment planning and monitoring. The dynamic changes in contrast in different tissues help to highlight the tumor in post-contrast images. However, varying acquisition protocols and individual factors result in large variation in the appearance of tissues, even for images acquired in the same phase (e.g., first post-contrast phase), making automated tumor segmentation challenging. Here, we propose a tumor segmentation method that leverages knowledge of the image acquisition time to modulate model features according to the specific acquisition sequence. We incorporate the acquisition times using feature-wise linear modulation (FiLM) layers, a lightweight method for incorporating temporal information that also allows for capitalizing on the full, variables number of images acquired per imaging study. We trained baseline and different configurations for the time-modulated models with varying backbone architectures on a large public multisite breast DCE-MRI dataset. Evaluation on in-domain images and a public out-of-domain dataset showed that incorporating knowledge of phase acquisition time improved tumor segmentation performance and model generalization.
comment: 5 pages, 3 figures
☆ Physics-Informed Machine Learning for Efficient Sim-to-Real Data Augmentation in Micro-Object Pose Estimation
Precise pose estimation of optical microrobots is essential for enabling high-precision object tracking and autonomous biological studies. However, current methods rely heavily on large, high-quality microscope image datasets, which are difficult and costly to acquire due to the complexity of microrobot fabrication and the labour-intensive labelling. Digital twin systems offer a promising path for sim-to-real data augmentation, yet existing techniques struggle to replicate complex optical microscopy phenomena, such as diffraction artifacts and depth-dependent imaging.This work proposes a novel physics-informed deep generative learning framework that, for the first time, integrates wave optics-based physical rendering and depth alignment into a generative adversarial network (GAN), to synthesise high-fidelity microscope images for microrobot pose estimation efficiently. Our method improves the structural similarity index (SSIM) by 35.6% compared to purely AI-driven methods, while maintaining real-time rendering speeds (0.022 s/frame).The pose estimator (CNN backbone) trained on our synthetic data achieves 93.9%/91.9% (pitch/roll) accuracy, just 5.0%/5.4% (pitch/roll) below that of an estimator trained exclusively on real data. Furthermore, our framework generalises to unseen poses, enabling data augmentation and robust pose estimation for novel microrobot configurations without additional training data.
☆ Flow and Depth Assisted Video Prediction with Latent Transformer
Video prediction is a fundamental task for various downstream applications, including robotics and world modeling. Although general video prediction models have achieved remarkable performance in standard scenarios, occlusion is still an inherent challenge in video prediction. We hypothesize that providing explicit information about motion (via point-flow) and geometric structure (via depth-maps) will enable video prediction models to perform better in situations with occlusion and the background motion. To investigate this, we present the first systematic study dedicated to occluded video prediction. We use a standard multi-object latent transformer architecture to predict future frames, but modify this to incorporate information from depth and point-flow. We evaluate this model in a controlled setting on both synthetic and real-world datasets with not only appearance-based metrics but also Wasserstein distances on object masks, which can effectively measure the motion distribution of the prediction. We find that when the prediction model is assisted with point flow and depth, it performs better in occluded scenarios and predicts more accurate background motion compared to models without the help of these modalities.
☆ FastSurfer-CC: A robust, accurate, and comprehensive framework for corpus callosum morphometry
The corpus callosum, the largest commissural structure in the human brain, is a central focus in research on aging and neurological diseases. It is also a critical target for interventions such as deep brain stimulation and serves as an important biomarker in clinical trials, including those investigating remyelination therapies. Despite extensive research on corpus callosum segmentation, few publicly available tools provide a comprehensive and automated analysis pipeline. To address this gap, we present FastSurfer-CC, an efficient and fully automated framework for corpus callosum morphometry. FastSurfer-CC automatically identifies mid-sagittal slices, segments the corpus callosum and fornix, localizes the anterior and posterior commissures to standardize head positioning, generates thickness profiles and subdivisions, and extracts eight shape metrics for statistical analysis. We demonstrate that FastSurfer-CC outperforms existing specialized tools across the individual tasks. Moreover, our method reveals statistically significant differences between Huntington's disease patients and healthy controls that are not detected by the current state-of-the-art.
☆ Arctic-Extract Technical Report
Arctic-Extract is a state-of-the-art model designed for extracting structural data (question answering, entities and tables) from scanned or digital-born business documents. Despite its SoTA capabilities, the model is deployable on resource-constrained hardware, weighting only 6.6 GiB, making it suitable for deployment on devices with limited resources, such as A10 GPUs with 24 GB of memory. Arctic-Extract can process up to 125 A4 pages on those GPUs, making suitable for long document processing. This paper highlights Arctic-Extract's training protocols and evaluation results, demonstrating its strong performance in document understanding.
☆ LLaVA$^3$: Representing 3D Scenes like a Cubist Painter to Boost 3D Scene Understanding of VLMs AAAI'26
Developing a multi-modal language model capable of understanding 3D scenes remains challenging due to the limited availability of 3D training data, in contrast to the abundance of 2D datasets used for vision-language models (VLM). As an alternative, we introduce LLaVA$^3$ (pronounced LLaVA-Cube), a novel method that improves the 3D scene understanding capabilities of VLM using only multi-view 2D images and without any fine-tuning. Inspired by Cubist painters, who represented multiple viewpoints of a 3D object within a single picture, we propose to describe the 3D scene for the VLM through omnidirectional visual representations of each object. These representations are derived from an intermediate multi-view 3D reconstruction of the scene. Extensive experiments on 3D VQA and 3D language grounding show that our approach outperforms previous 2D-based VLM solutions.
comment: Accepted at AAAI'26
☆ VLA-Pruner: Temporal-Aware Dual-Level Visual Token Pruning for Efficient Vision-Language-Action Inference
Vision-Language-Action (VLA) models have shown great promise for embodied AI, yet the heavy computational cost of processing continuous visual streams severely limits their real-time deployment. Token pruning (keeping salient visual tokens and dropping redundant ones) has emerged as an effective approach for accelerating Vision-Language Models (VLMs), offering a solution for efficient VLA. However, these VLM-specific token pruning methods select tokens based solely on semantic salience metrics (e.g., prefill attention), while overlooking the VLA's intrinsic dual-system nature of high-level semantic understanding and low-level action execution. Consequently, these methods bias token retention toward semantic cues, discard critical information for action generation, and significantly degrade VLA performance. To bridge this gap, we propose VLA-Pruner, a versatile plug-and-play VLA-specific token prune method that aligns with the dual-system nature of VLA models and exploits the temporal continuity in robot manipulation. Specifically, VLA-Pruner adopts a dual-level importance criterion for visual token retention: vision-language prefill attention for semantic-level relevance and action decode attention, estimated via temporal smoothing, for action-level importance. Based on this criterion, VLA-Pruner proposes a novel dual-level token selection strategy that adaptively preserves a compact, informative set of visual tokens for both semantic understanding and action execution under given compute budget. Experiments show that VLA-Pruner achieves state-of-the-art performance across multiple VLA architectures and diverse robotic tasks.
☆ StreetView-Waste: A Multi-Task Dataset for Urban Waste Management WACV 2026
Urban waste management remains a critical challenge for the development of smart cities. Despite the growing number of litter detection datasets, the problem of monitoring overflowing waste containers, particularly from images captured by garbage trucks, has received little attention. While existing datasets are valuable, they often lack annotations for specific container tracking or are captured in static, decontextualized environments, limiting their utility for real-world logistics. To address this gap, we present StreetView-Waste, a comprehensive dataset of urban scenes featuring litter and waste containers. The dataset supports three key evaluation tasks: (1) waste container detection, (2) waste container tracking, and (3) waste overflow segmentation. Alongside the dataset, we provide baselines for each task by benchmarking state-of-the-art models in object detection, tracking, and segmentation. Additionally, we enhance baseline performance by proposing two complementary strategies: a heuristic-based method for improved waste container tracking and a model-agnostic framework that leverages geometric priors to refine litter segmentation. Our experimental results show that while fine-tuned object detectors achieve reasonable performance in detecting waste containers, baseline tracking methods struggle to accurately estimate their number; however, our proposed heuristics reduce the mean absolute counting error by 79.6%. Similarly, while segmenting amorphous litter is challenging, our geometry-aware strategy improves segmentation mAP@0.5 by 27% on lightweight models, demonstrating the value of multimodal inputs for this task. Ultimately, StreetView-Waste provides a challenging benchmark to encourage research into real-world perception systems for urban waste management.
comment: Accepted at WACV 2026
☆ Beyond Visual Cues: Leveraging General Semantics as Support for Few-Shot Segmentation
Few-shot segmentation (FSS) aims to segment novel classes under the guidance of limited support samples by a meta-learning paradigm. Existing methods mainly mine references from support images as meta guidance. However, due to intra-class variations among visual representations, the meta information extracted from support images cannot produce accurate guidance to segment untrained classes. In this paper, we argue that the references from support images may not be essential, the key to the support role is to provide unbiased meta guidance for both trained and untrained classes. We then introduce a Language-Driven Attribute Generalization (LDAG) architecture to utilize inherent target property language descriptions to build robust support strategy. Specifically, to obtain an unbiased support representation, we design a Multi-attribute Enhancement (MaE) module, which produces multiple detailed attribute descriptions of the target class through Large Language Models (LLMs), and then builds refined visual-text prior guidance utilizing multi-modal matching. Meanwhile, due to text-vision modal shift, attribute text struggles to promote visual feature representation, we design a Multi-modal Attribute Alignment (MaA) to achieve cross-modal interaction between attribute texts and visual feature. Experiments show that our proposed method outperforms existing approaches by a clear margin and achieves the new state-of-the art performance. The code will be released.
☆ Graph Neural Networks for Surgical Scene Segmentation
Purpose: Accurate identification of hepatocystic anatomy is critical to preventing surgical complications during laparoscopic cholecystectomy. Deep learning models often struggle with occlusions, long-range dependencies, and capturing the fine-scale geometry of rare structures. This work addresses these challenges by introducing graph-based segmentation approaches that enhance spatial and semantic understanding in surgical scene analyses. Methods: We propose two segmentation models integrating Vision Transformer (ViT) feature encoders with Graph Neural Networks (GNNs) to explicitly model spatial relationships between anatomical regions. (1) A static k Nearest Neighbours (k-NN) graph with a Graph Convolutional Network with Initial Residual and Identity Mapping (GCNII) enables stable long-range information propagation. (2) A dynamic Differentiable Graph Generator (DGG) with a Graph Attention Network (GAT) supports adaptive topology learning. Both models are evaluated on the Endoscapes-Seg50 and CholecSeg8k benchmarks. Results: The proposed approaches achieve up to 7-8% improvement in Mean Intersection over Union (mIoU) and 6% improvement in Mean Dice (mDice) scores over state-of-the-art baselines. It produces anatomically coherent predictions, particularly on thin, rare and safety-critical structures. Conclusion: The proposed graph-based segmentation methods enhance both performance and anatomical consistency in surgical scene segmentation. By combining ViT-based global context with graph-based relational reasoning, the models improve interpretability and reliability, paving the way for safer laparoscopic and robot-assisted surgery through a precise identification of critical anatomical features.
comment: 12 pages, 4 figures, 3 tables
☆ CylinderDepth: Cylindrical Spatial Attention for Multi-View Consistent Self-Supervised Surround Depth Estimation
Self-supervised surround-view depth estimation enables dense, low-cost 3D perception with a 360° field of view from multiple minimally overlapping images. Yet, most existing methods suffer from depth estimates that are inconsistent between overlapping images. Addressing this limitation, we propose a novel geometry-guided method for calibrated, time-synchronized multi-camera rigs that predicts dense, metric, and cross-view-consistent depth. Given the intrinsic and relative orientation parameters, a first depth map is predicted per image and the so-derived 3D points from all images are projected onto a shared unit cylinder, establishing neighborhood relations across different images. This produces a 2D position map for every image, where each pixel is assigned its projected position on the cylinder. Based on these position maps, we apply an explicit, non-learned spatial attention that aggregates features among pixels across images according to their distances on the cylinder, to predict a final depth map per image. Evaluated on the DDAD and nuScenes datasets, our approach improves the consistency of depth estimates across images and the overall depth compared to state-of-the-art methods.
☆ End-to-End Motion Capture from Rigid Body Markers with Geodesic Loss
Marker-based optical motion capture (MoCap), while long regarded as the gold standard for accuracy, faces practical challenges, such as time-consuming preparation and marker identification ambiguity, due to its reliance on dense marker configurations, which fundamentally limit its scalability. To address this, we introduce a novel fundamental unit for MoCap, the Rigid Body Marker (RBM), which provides unambiguous 6-DoF data and drastically simplifies setup. Leveraging this new data modality, we develop a deep-learning-based regression model that directly estimates SMPL parameters under a geodesic loss. This end-to-end approach matches the performance of optimization-based methods while requiring over an order of magnitude less computation. Trained on synthesized data from the AMASS dataset, our end-to-end model achieves state-of-the-art accuracy in body pose estimation. Real-world data captured using a Vicon optical tracking system further demonstrates the practical viability of our approach. Overall, the results show that combining sparse 6-DoF RBM with a manifold-aware geodesic loss yields a practical and high-fidelity solution for real-time MoCap in graphics, virtual reality, and biomechanics.
comment: The source code is available in : https://github.com/wer010/GLRBM-Mocap
☆ CAMS: Towards Compositional Zero-Shot Learning via Gated Cross-Attention and Multi-Space Disentanglement
Compositional zero-shot learning (CZSL) aims to learn the concepts of attributes and objects in seen compositions and to recognize their unseen compositions. Most Contrastive Language-Image Pre-training (CLIP)-based CZSL methods focus on disentangling attributes and objects by leveraging the global semantic representation obtained from the image encoder. However, this representation has limited representational capacity and do not allow for complete disentanglement of the two. To this end, we propose CAMS, which aims to extract semantic features from visual features and perform semantic disentanglement in multidimensional spaces, thereby improving generalization over unseen attribute-object compositions. Specifically, CAMS designs a Gated Cross-Attention that captures fine-grained semantic features from the high-level image encoding blocks of CLIP through a set of latent units, while adaptively suppressing background and other irrelevant information. Subsequently, it conducts Multi-Space Disentanglement to achieve disentanglement of attribute and object semantics. Experiments on three popular benchmarks (MIT-States, UT-Zappos, and C-GQA) demonstrate that CAMS achieves state-of-the-art performance in both closed-world and open-world settings. The code is available at https://github.com/ybyangjing/CAMS.
☆ DetailSemNet: Elevating Signature Verification through Detail-Semantic Integration
Offline signature verification (OSV) is a frequently utilized technology in forensics. This paper proposes a new model, DetailSemNet, for OSV. Unlike previous methods that rely on holistic features for pair comparisons, our approach underscores the significance of fine-grained differences for robust OSV. We propose to match local structures between two signature images, significantly boosting verification accuracy. Furthermore, we observe that without specific architectural modifications, transformer-based backbones might naturally obscure local details, adversely impacting OSV performance. To address this, we introduce a Detail Semantics Integrator, leveraging feature disentanglement and re-entanglement. This integrator is specifically designed to enhance intricate details while simultaneously expanding discriminative semantics, thereby augmenting the efficacy of local structural matching. We evaluate our method against leading benchmarks in offline signature verification. Our model consistently outperforms recent methods, achieving state-of-the-art results with clear margins. The emphasis on local structure matching not only improves performance but also enhances the model's interpretability, supporting our findings. Additionally, our model demonstrates remarkable generalization capabilities in cross-dataset testing scenarios. The combination of generalizability and interpretability significantly bolsters the potential of DetailSemNet for real-world applications.
☆ Multi-Order Matching Network for Alignment-Free Depth Super-Resolution
Recent guided depth super-resolution methods are premised on the assumption of strictly spatial alignment between depth and RGB, achieving high-quality depth reconstruction. However, in real-world scenarios, the acquisition of strictly aligned RGB-D is hindered by inherent hardware limitations (e.g., physically separate RGB-D sensors) and unavoidable calibration drift induced by mechanical vibrations or temperature variations. Consequently, existing approaches often suffer inevitable performance degradation when applied to misaligned real-world scenes. In this paper, we propose the Multi-Order Matching Network (MOMNet), a novel alignment-free framework that adaptively retrieves and selects the most relevant information from misaligned RGB. Specifically, our method begins with a multi-order matching mechanism, which jointly performs zero-order, first-order, and second-order matching to comprehensively identify RGB information consistent with depth across multi-order feature spaces. To effectively integrate the retrieved RGB and depth, we further introduce a multi-order aggregation composed of multiple structure detectors. This strategy uses multi-order priors as prompts to facilitate the selective feature transfer from RGB to depth. Extensive experiments demonstrate that MOMNet achieves state-of-the-art performance and exhibits outstanding robustness.
☆ CRISTAL: Real-time Camera Registration in Static LiDAR Scans using Neural Rendering
Accurate camera localization is crucial for robotics and Extended Reality (XR), enabling reliable navigation and alignment of virtual and real content. Existing visual methods often suffer from drift, scale ambiguity, and depend on fiducials or loop closure. This work introduces a real-time method for localizing a camera within a pre-captured, highly accurate colored LiDAR point cloud. By rendering synthetic views from this cloud, 2D-3D correspondences are established between live frames and the point cloud. A neural rendering technique narrows the domain gap between synthetic and real images, reducing occlusion and background artifacts to improve feature matching. The result is drift-free camera tracking with correct metric scale in the global LiDAR coordinate system. Two real-time variants are presented: Online Render and Match, and Prebuild and Localize. We demonstrate improved results on the ScanNet++ dataset and outperform existing SLAM pipelines.
☆ Aerial View River Landform Video segmentation: A Weakly Supervised Context-aware Temporal Consistency Distillation Approach
The study of terrain and landform classification through UAV remote sensing diverges significantly from ground vehicle patrol tasks. Besides grappling with the complexity of data annotation and ensuring temporal consistency, it also confronts the scarcity of relevant data and the limitations imposed by the effective range of many technologies. This research substantiates that, in aerial positioning tasks, both the mean Intersection over Union (mIoU) and temporal consistency (TC) metrics are of paramount importance. It is demonstrated that fully labeled data is not the optimal choice, as selecting only key data lacks the enhancement in TC, leading to failures. Hence, a teacher-student architecture, coupled with key frame selection and key frame updating algorithms, is proposed. This framework successfully performs weakly supervised learning and TC knowledge distillation, overcoming the deficiencies of traditional TC training in aerial tasks. The experimental results reveal that our method utilizing merely 30\% of labeled data, concurrently elevates mIoU and temporal consistency ensuring stable localization of terrain objects. Result demo : https://gitlab.com/prophet.ai.inc/drone-based-riverbed-inspection
☆ Arbitrary-Resolution and Arbitrary-Scale Face Super-Resolution with Implicit Representation Networks
Face super-resolution (FSR) is a critical technique for enhancing low-resolution facial images and has significant implications for face-related tasks. However, existing FSR methods are limited by fixed up-sampling scales and sensitivity to input size variations. To address these limitations, this paper introduces an Arbitrary-Resolution and Arbitrary-Scale FSR method with implicit representation networks (ARASFSR), featuring three novel designs. First, ARASFSR employs 2D deep features, local relative coordinates, and up-sampling scale ratios to predict RGB values for each target pixel, allowing super-resolution at any up-sampling scale. Second, a local frequency estimation module captures high-frequency facial texture information to reduce the spectral bias effect. Lastly, a global coordinate modulation module guides FSR to leverage prior facial structure knowledge and achieve resolution adaptation effectively. Quantitative and qualitative evaluations demonstrate the robustness of ARASFSR over existing state-of-the-art methods while super-resolving facial images across various input sizes and up-sampling scales.
☆ ChangeDINO: DINOv3-Driven Building Change Detection in Optical Remote Sensing Imagery
Remote sensing change detection (RSCD) aims to identify surface changes from co-registered bi-temporal images. However, many deep learning-based RSCD methods rely solely on change-map annotations and underuse the semantic information in non-changing regions, which limits robustness under illumination variation, off-nadir views, and scarce labels. This article introduces ChangeDINO, an end-to-end multiscale Siamese framework for optical building change detection. The model fuses a lightweight backbone stream with features transferred from a frozen DINOv3, yielding semantic- and context-rich pyramids even on small datasets. A spatial-spectral differential transformer decoder then exploits multi-scale absolute differences as change priors to highlight true building changes and suppress irrelevant responses. Finally, a learnable morphology module refines the upsampled logits to recover clean boundaries. Experiments on four public benchmarks show that ChangeDINO consistently outperforms recent state-of-the-art methods in IoU and F1, and ablation studies confirm the effectiveness of each component. The source code is available at https://github.com/chingheng0808/ChangeDINO.
☆ WWE-UIE: A Wavelet & White Balance Efficient Network for Underwater Image Enhancement
Underwater Image Enhancement (UIE) aims to restore visibility and correct color distortions caused by wavelength-dependent absorption and scattering. Recent hybrid approaches, which couple domain priors with modern deep neural architectures, have achieved strong performance but incur high computational cost, limiting their practicality in real-time scenarios. In this work, we propose WWE-UIE, a compact and efficient enhancement network that integrates three interpretable priors. First, adaptive white balance alleviates the strong wavelength-dependent color attenuation, particularly the dominance of blue-green tones. Second, a wavelet-based enhancement block (WEB) performs multi-band decomposition, enabling the network to capture both global structures and fine textures, which are critical for underwater restoration. Third, a gradient-aware module (SGFB) leverages Sobel operators with learnable gating to explicitly preserve edge structures degraded by scattering. Extensive experiments on benchmark datasets demonstrate that WWE-UIE achieves competitive restoration quality with substantially fewer parameters and FLOPs, enabling real-time inference on resource-limited platforms. Ablation studies and visualizations further validate the contribution of each component. The source code is available at https://github.com/chingheng0808/WWE-UIE.
☆ NaTex: Seamless Texture Generation as Latent Color Diffusion
We present NaTex, a native texture generation framework that predicts texture color directly in 3D space. In contrast to previous approaches that rely on baking 2D multi-view images synthesized by geometry-conditioned Multi-View Diffusion models (MVDs), NaTex avoids several inherent limitations of the MVD pipeline. These include difficulties in handling occluded regions that require inpainting, achieving precise mesh-texture alignment along boundaries, and maintaining cross-view consistency and coherence in both content and color intensity. NaTex features a novel paradigm that addresses the aforementioned issues by viewing texture as a dense color point cloud. Driven by this idea, we propose latent color diffusion, which comprises a geometry-awared color point cloud VAE and a multi-control diffusion transformer (DiT), entirely trained from scratch using 3D data, for texture reconstruction and generation. To enable precise alignment, we introduce native geometry control that conditions the DiT on direct 3D spatial information via positional embeddings and geometry latents. We co-design the VAE-DiT architecture, where the geometry latents are extracted via a dedicated geometry branch tightly coupled with the color VAE, providing fine-grained surface guidance that maintains strong correspondence with the texture. With these designs, NaTex demonstrates strong performance, significantly outperforming previous methods in texture coherence and alignment. Moreover, NaTex also exhibits strong generalization capabilities, either training-free or with simple tuning, for various downstream applications, e.g., material generation, texture refinement, and part segmentation and texturing.
comment: Technical Report
☆ BioBench: A Blueprint to Move Beyond ImageNet for Scientific ML Benchmarks NeurIPS 2025
ImageNet-1K linear-probe transfer accuracy remains the default proxy for visual representation quality, yet it no longer predicts performance on scientific imagery. Across 46 modern vision model checkpoints, ImageNet top-1 accuracy explains only 34% of variance on ecology tasks and mis-ranks 30% of models above 75% accuracy. We present BioBench, an open ecology vision benchmark that captures what ImageNet misses. BioBench unifies 9 publicly released, application-driven tasks, 4 taxonomic kingdoms, and 6 acquisition modalities (drone RGB, web video, micrographs, in-situ and specimen photos, camera-trap frames), totaling 3.1M images. A single Python API downloads data, fits lightweight classifiers to frozen backbones, and reports class-balanced macro-F1 (plus domain metrics for FishNet and FungiCLEF); ViT-L models evaluate in 6 hours on an A6000 GPU. BioBench provides new signal for computer vision in ecology and a template recipe for building reliable AI-for-science benchmarks in any domain. Code and predictions are available at https://github.com/samuelstevens/biobench and results at https://samuelstevens.me/biobench.
comment: Accepted at the 3rd Imageomics Workshop at NeurIPS 2025
☆ Sparse Autoencoders are Topic Models
Sparse autoencoders (SAEs) are used to analyze embeddings, but their role and practical value are debated. We propose a new perspective on SAEs by demonstrating that they can be naturally understood as topic models. We extend Latent Dirichlet Allocation to embedding spaces and derive the SAE objective as a maximum a posteriori estimator under this model. This view implies SAE features are thematic components rather than steerable directions. Based on this, we introduce SAE-TM, a topic modeling framework that: (1) trains an SAE to learn reusable topic atoms, (2) interprets them as word distributions on downstream data, and (3) merges them into any number of topics without retraining. SAE-TM yields more coherent topics than strong baselines on text and image datasets while maintaining diversity. Finally, we analyze thematic structure in image datasets and trace topic changes over time in Japanese woodblock prints. Our work positions SAEs as effective tools for large-scale thematic analysis across modalities. Code and data will be released upon publication.
☆ Upsample Anything: A Simple and Hard to Beat Baseline for Feature Upsampling
We present \textbf{Upsample Anything}, a lightweight test-time optimization (TTO) framework that restores low-resolution features to high-resolution, pixel-wise outputs without any training. Although Vision Foundation Models demonstrate strong generalization across diverse downstream tasks, their representations are typically downsampled by 14x/16x (e.g., ViT), which limits their direct use in pixel-level applications. Existing feature upsampling approaches depend on dataset-specific retraining or heavy implicit optimization, restricting scalability and generalization. Upsample Anything addresses these issues through a simple per-image optimization that learns an anisotropic Gaussian kernel combining spatial and range cues, effectively bridging Gaussian Splatting and Joint Bilateral Upsampling. The learned kernel acts as a universal, edge-aware operator that transfers seamlessly across architectures and modalities, enabling precise high-resolution reconstruction of features, depth, or probability maps. It runs in only $\approx0.419 \text{s}$ per 224x224 image and achieves state-of-the-art performance on semantic segmentation, depth estimation, and both depth and probability map upsampling.
comment: 15 pages, 12 figures
☆ Optimizing 3D Gaussian Splattering for Mobile GPUs
Image-based 3D scene reconstruction, which transforms multi-view images into a structured 3D representation of the surrounding environment, is a common task across many modern applications. 3D Gaussian Splatting (3DGS) is a new paradigm to address this problem and offers considerable efficiency as compared to the previous methods. Motivated by this, and considering various benefits of mobile device deployment (data privacy, operating without internet connectivity, and potentially faster responses), this paper develops Texture3dgs, an optimized mapping of 3DGS for a mobile GPU. A critical challenge in this area turns out to be optimizing for the two-dimensional (2D) texture cache, which needs to be exploited for faster executions on mobile GPUs. As a sorting method dominates the computations in 3DGS on mobile platforms, the core of Texture3dgs is a novel sorting algorithm where the processing, data movement, and placement are highly optimized for 2D memory. The properties of this algorithm are analyzed in view of a cost model for the texture cache. In addition, we accelerate other steps of the 3DGS algorithm through improved variable layout design and other optimizations. End-to-end evaluation shows that Texture3dgs delivers up to 4.1$\times$ and 1.7$\times$ speedup for the sorting and overall 3D scene reconstruction, respectively -- while also reducing memory usage by up to 1.6$\times$ -- demonstrating the effectiveness of our design for efficient mobile 3D scene reconstruction.
☆ Explainable AI for Diabetic Retinopathy Detection Using Deep Learning with Attention Mechanisms and Fuzzy Logic-Based Interpretability
The task of weed detection is an essential element of precision agriculture since accurate species identification allows a farmer to selectively apply herbicides and fits into sustainable agriculture crop management. This paper proposes a hybrid deep learning framework recipe for weed detection that utilizes Convolutional Neural Networks (CNNs), Vision Transformers (ViTs), and Graph Neural Networks (GNNs) to build robustness to multiple field conditions. A Generative Adversarial Network (GAN)-based augmentation method was imposed to balance class distributions and better generalize the model. Further, a self-supervised contrastive pre-training method helps to learn more features from limited annotated data. Experimental results yield superior results with 99.33% accuracy, precision, recall, and F1-score on multi-benchmark datasets. The proposed model architecture enables local, global, and relational feature representations and offers high interpretability and adaptability. Practically, the framework allows real-time, efficient deployment of edge devices for automated weed detecting, reducing over-reliance on herbicides and providing scalable, sustainable precision-farming options.
☆ Building temporally coherent 3D maps with VGGT for memory-efficient Semantic SLAM
We present a fast, spatio-temporal scene understanding framework based on Vision Gated Generative Transformers (VGGT). The proposed pipeline is designed to enable efficient, close to real-time performance, supporting applications including assistive navigation. To achieve continuous updates of the 3D scene representation, we process the image flow with a sliding window, aligning submaps, thereby overcoming VGGT's high memory demands. We exploit the VGGT tracking head to aggregate 2D semantic instance masks into 3D objects. To allow for temporal consistency and richer contextual reasoning the system stores timestamps and instance-level identities, thereby enabling the detection of changes in the environment. We evaluate the approach on well-known benchmarks and custom datasets specifically designed for assistive navigation scenarios. The results demonstrate the applicability of the framework to real-world scenarios.
☆ TetraSDF: Precise Mesh Extraction with Multi-resolution Tetrahedral Grid
Extracting meshes that exactly match the zero-level set of neural signed distance functions (SDFs) remains challenging. Sampling-based methods introduce discretization error, while continuous piecewise affine (CPWA) analytic approaches apply only to plain ReLU MLPs. We present TetraSDF, a precise analytic meshing framework for SDFs represented by a ReLU MLP composed with a multi-resolution tetrahedral positional encoder. The encoder's barycentric interpolation preserves global CPWA structure, enabling us to track ReLU linear regions within an encoder-induced polyhedral complex. A fixed analytic input preconditioner derived from the encoder's metric further reduces directional bias and stabilizes training. Across multiple benchmarks, TetraSDF matches or surpasses existing grid-based encoders in SDF reconstruction accuracy, and its analytic extractor produces highly self-consistent meshes that remain faithful to the learned isosurfaces, all with practical runtime and memory efficiency.
☆ Weakly Supervised Segmentation and Classification of Alpha-Synuclein Aggregates in Brightfield Midbrain Images
Parkinson's disease (PD) is a neurodegenerative disorder associated with the accumulation of misfolded alpha-synuclein aggregates, forming Lewy bodies and neuritic shape used for pathology diagnostics. Automatic analysis of immunohistochemistry histopathological images with Deep Learning provides a promising tool for better understanding the spatial organization of these aggregates. In this study, we develop an automated image processing pipeline to segment and classify these aggregates in whole-slide images (WSIs) of midbrain tissue from PD and incidental Lewy Body Disease (iLBD) cases based on weakly supervised segmentation, robust to immunohistochemical labelling variability, with a ResNet50 classifier. Our approach allows to differentiate between major aggregate morphologies, including Lewy bodies and neurites with a balanced accuracy of $80\%$. This framework paves the way for large-scale characterization of the spatial distribution and heterogeneity of alpha-synuclein aggregates in brightfield immunohistochemical tissue, and for investigating their poorly understood relationships with surrounding cells such as microglia and astrocytes.
☆ Mem-MLP: Real-Time 3D Human Motion Generation from Sparse Inputs
Realistic and smooth full-body tracking is crucial for immersive AR/VR applications. Existing systems primarily track head and hands via Head Mounted Devices (HMDs) and controllers, making the 3D full-body reconstruction in-complete. One potential approach is to generate the full-body motions from sparse inputs collected from limited sensors using a Neural Network (NN) model. In this paper, we propose a novel method based on a multi-layer perceptron (MLP) backbone that is enhanced with residual connections and a novel NN-component called Memory-Block. In particular, Memory-Block represents missing sensor data with trainable code-vectors, which are combined with the sparse signals from previous time instances to improve the temporal consistency. Furthermore, we formulate our solution as a multi-task learning problem, allowing our MLP-backbone to learn robust representations that boost accuracy. Our experiments show that our method outperforms state-of-the-art baselines by substantially reducing prediction errors. Moreover, it achieves 72 FPS on mobile HMDs that ultimately improves the accuracy-running time tradeoff.
☆ How Robot Dogs See the Unseeable
Peering, a side-to-side motion used by animals to estimate distance through motion parallax, offers a powerful bio-inspired strategy to overcome a fundamental limitation in robotic vision: partial occlusion. Conventional robot cameras, with their small apertures and large depth of field, render both foreground obstacles and background objects in sharp focus, causing occluders to obscure critical scene information. This work establishes a formal connection between animal peering and synthetic aperture (SA) sensing from optical imaging. By having a robot execute a peering motion, its camera describes a wide synthetic aperture. Computational integration of the captured images synthesizes an image with an extremely shallow depth of field, effectively blurring out occluding elements while bringing the background into sharp focus. This efficient, wavelength-independent technique enables real-time, high-resolution perception across various spectral bands. We demonstrate that this approach not only restores basic scene understanding but also empowers advanced visual reasoning in large multimodal models, which fail with conventionally occluded imagery. Unlike feature-dependent multi-view 3D vision methods or active sensors like LiDAR, SA sensing via peering is robust to occlusion, computationally efficient, and immediately deployable on any mobile robot. This research bridges animal behavior and robotics, suggesting that peering motions for synthetic aperture sensing are a key to advanced scene understanding in complex, cluttered environments.
☆ SwiTrack: Tri-State Switch for Cross-Modal Object Tracking
Cross-modal object tracking (CMOT) is an emerging task that maintains target consistency while the video stream switches between different modalities, with only one modality available in each frame, mostly focusing on RGB-Near Infrared (RGB-NIR) tracking. Existing methods typically connect parallel RGB and NIR branches to a shared backbone, which limits the comprehensive extraction of distinctive modality-specific features and fails to address the issue of object drift, especially in the presence of unreliable inputs. In this paper, we propose SwiTrack, a novel state-switching framework that redefines CMOT through the deployment of three specialized streams. Specifically, RGB frames are processed by the visual encoder, while NIR frames undergo refinement via a NIR gated adapter coupled with the visual encoder to progressively calibrate shared latent space features, thereby yielding more robust cross-modal representations. For invalid modalities, a consistency trajectory prediction module leverages spatio-temporal cues to estimate target movement, ensuring robust tracking and mitigating drift. Additionally, we incorporate dynamic template reconstruction to iteratively update template features and employ a similarity alignment loss to reinforce feature consistency. Experimental results on the latest benchmarks demonstrate that our tracker achieves state-of-the-art performance, boosting precision rate and success rate gains by 7.2\% and 4.3\%, respectively, while maintaining real-time tracking at 65 frames per second. Code and models are available at https://github.com/xuboyue1999/SwiTrack.git.
☆ Can MLLMs Read the Room? A Multimodal Benchmark for Assessing Deception in Multi-Party Social Interactions
Despite their advanced reasoning capabilities, state-of-the-art Multimodal Large Language Models (MLLMs) demonstrably lack a core component of human intelligence: the ability to `read the room' and assess deception in complex social interactions. To rigorously quantify this failure, we introduce a new task, Multimodal Interactive Deception Assessment (MIDA), and present a novel multimodal dataset providing synchronized video and text with verifiable ground-truth labels for every statement. We establish a comprehensive benchmark evaluating 12 state-of-the-art open- and closed-source MLLMs, revealing a significant performance gap: even powerful models like GPT-4o struggle to distinguish truth from falsehood reliably. Our analysis of failure modes indicates that these models fail to effectively ground language in multimodal social cues and lack the ability to model what others know, believe, or intend, highlighting the urgent need for novel approaches to building more perceptive and trustworthy AI systems. To take a step forward, we design a Social Chain-of-Thought (SoCoT) reasoning pipeline and a Dynamic Social Epistemic Memory (DSEM) module. Our framework yields performance improvement on this challenging task, demonstrating a promising new path toward building MLLMs capable of genuine human-like social reasoning.
☆ Unsupervised Image Classification with Adaptive Nearest Neighbor Selection and Cluster Ensembles
Unsupervised image classification, or image clustering, aims to group unlabeled images into semantically meaningful categories. Early methods integrated representation learning and clustering within an iterative framework. However, the rise of foundational models have recently shifted focus solely to clustering, bypassing the representation learning step. In this work, we build upon a recent multi-head clustering approach by introducing adaptive nearest neighbor selection and cluster ensembling strategies to improve clustering performance. Our method, "Image Clustering through Cluster Ensembles" (ICCE), begins with a clustering stage, where we train multiple clustering heads on a frozen backbone, producing diverse image clusterings. We then employ a cluster ensembling technique to consolidate these potentially conflicting results into a unified consensus clustering. Finally, we train an image classifier using the consensus clustering result as pseudo-labels. ICCE achieves state-of-the-art performance on ten image classification benchmarks, achieving 99.3% accuracy on CIFAR10, 89% on CIFAR100, and 70.4% on ImageNet datasets, narrowing the performance gap with supervised methods. To the best of our knowledge, ICCE is the first fully unsupervised image classification method to exceed 70% accuracy on ImageNet.
☆ When Alignment Fails: Multimodal Adversarial Attacks on Vision-Language-Action Models
Vision-Language-Action models (VLAs) have recently demonstrated remarkable progress in embodied environments, enabling robots to perceive, reason, and act through unified multimodal understanding. Despite their impressive capabilities, the adversarial robustness of these systems remains largely unexplored, especially under realistic multimodal and black-box conditions. Existing studies mainly focus on single-modality perturbations and overlook the cross-modal misalignment that fundamentally affects embodied reasoning and decision-making. In this paper, we introduce VLA-Fool, a comprehensive study of multimodal adversarial robustness in embodied VLA models under both white-box and black-box settings. VLA-Fool unifies three levels of multimodal adversarial attacks: (1) textual perturbations through gradient-based and prompt-based manipulations, (2) visual perturbations via patch and noise distortions, and (3) cross-modal misalignment attacks that intentionally disrupt the semantic correspondence between perception and instruction. We further incorporate a VLA-aware semantic space into linguistic prompts, developing the first automatically crafted and semantically guided prompting framework. Experiments on the LIBERO benchmark using a fine-tuned OpenVLA model reveal that even minor multimodal perturbations can cause significant behavioral deviations, demonstrating the fragility of embodied multimodal alignment.
☆ PrIntMesh: Precise Intersection Surfaces for 3D Organ Mesh Reconstruction
Human organs are composed of interconnected substructures whose geometry and spatial relationships constrain one another. Yet, most deep-learning approaches treat these parts independently, producing anatomically implausible reconstructions. We introduce PrIntMesh, a template-based, topology-preserving framework that reconstructs organs as unified systems. Starting from a connected template, PrIntMesh jointly deforms all substructures to match patient-specific anatomy, while explicitly preserving internal boundaries and enforcing smooth, artifact-free surfaces. We demonstrate its effectiveness on the heart, hippocampus, and lungs, achieving high geometric accuracy, correct topology, and robust performance even with limited or noisy training data. Compared to voxel- and surface-based methods, PrIntMesh better reconstructs shared interfaces, maintains structural consistency, and provides a data-efficient solution suitable for clinical use.
comment: 12 pages, 9 figures
☆ Domain-Shared Learning and Gradual Alignment for Unsupervised Domain Adaptation Visible-Infrared Person Re-Identification
Recently, Visible-Infrared person Re-Identification (VI-ReID) has achieved remarkable performance on public datasets. However, due to the discrepancies between public datasets and real-world data, most existing VI-ReID algorithms struggle in real-life applications. To address this, we take the initiative to investigate Unsupervised Domain Adaptation Visible-Infrared person Re-Identification (UDA-VI-ReID), aiming to transfer the knowledge learned from the public data to real-world data without compromising accuracy and requiring the annotation of new samples. Specifically, we first analyze two basic challenges in UDA-VI-ReID, i.e., inter-domain modality discrepancies and intra-domain modality discrepancies. Then, we design a novel two-stage model, i.e., Domain-Shared Learning and Gradual Alignment (DSLGA), to handle these discrepancies. In the first pre-training stage, DSLGA introduces a Domain-Shared Learning Strategy (DSLS) to mitigate ineffective pre-training caused by inter-domain modality discrepancies via exploiting shared information between the source and target domains. While, in the second fine-tuning stage, DSLGA designs a Gradual Alignment Strategy (GAS) to handle the cross-modality alignment challenges between visible and infrared data caused by the large intra-domain modality discrepancies through a cluster-to-holistic alignment way. Finally, a new UDA-VI-ReID testing method i.e., CMDA-XD, is constructed for training and testing different UDA-VI-ReID models. A large amount of experiments demonstrate that our method significantly outperforms existing domain adaptation methods for VI-ReID and even some supervised methods under various settings.
☆ FOOTPASS: A Multi-Modal Multi-Agent Tactical Context Dataset for Play-by-Play Action Spotting in Soccer Broadcast Videos
Soccer video understanding has motivated the creation of datasets for tasks such as temporal action localization, spatiotemporal action detection (STAD), or multiobject tracking (MOT). The annotation of structured sequences of events (who does what, when, and where) used for soccer analytics requires a holistic approach that integrates both STAD and MOT. However, current action recognition methods remain insufficient for constructing reliable play-by-play data and are typically used to assist rather than fully automate annotation. Parallel research has advanced tactical modeling, trajectory forecasting, and performance analysis, all grounded in game-state and play-by-play data. This motivates leveraging tactical knowledge as a prior to support computer-vision-based predictions, enabling more automated and reliable extraction of play-by-play data. We introduce Footovision Play-by-Play Action Spotting in Soccer Dataset (FOOTPASS), the first benchmark for play-by-play action spotting over entire soccer matches in a multi-modal, multi-agent tactical context. It enables the development of methods for player-centric action spotting that exploit both outputs from computer-vision tasks (e.g., tracking, identification) and prior knowledge of soccer, including its tactical regularities over long time horizons, to generate reliable play-by-play data streams. These streams form an essential input for data-driven sports analytics.
☆ Mantis: A Versatile Vision-Language-Action Model with Disentangled Visual Foresight
Recent advances in Vision-Language-Action (VLA) models demonstrate that visual signals can effectively complement sparse action supervisions. However, letting VLA directly predict high-dimensional visual states can distribute model capacity and incur prohibitive training cost, while compressing visual states into more compact supervisory signals inevitably incurs information bottlenecks. Moreover, existing methods often suffer from poor comprehension and reasoning capabilities due to the neglect of language supervision. This paper introduces Mantis, a novel framework featuring a Disentangled Visual Foresight (DVF) to tackle these issues. Specifically, Mantis decouples visual foresight prediction from the backbone with the combination of meta queries and a diffusion Transformer (DiT) head. With the current visual state provided to the DiT via a residual connection, a simple next-state prediction objective enables the meta queries to automatically capture the latent actions that delineate the visual trajectory, and hence boost the learning of explicit actions. The disentanglement reduces the burden of the VLA backbone, enabling it to maintain comprehension and reasoning capabilities through language supervision. Empirically, pretrained on human manipulation videos, robot demonstrations, and image-text pairs, Mantis achieves a 96.7% success rate on LIBERO benchmark after fine-tuning, surpassing powerful baselines while exhibiting high convergence speed. Real-world evaluations show that Mantis outperforms $π_{0.5}$, a leading open-source VLA model, particularly in instruction-following capability, generalization to unseen instructions, and reasoning ability. Code and weights are released to support the open-source community.
☆ Target Refocusing via Attention Redistribution for Open-Vocabulary Semantic Segmentation: An Explainability Perspective AAAI 2026
Open-vocabulary semantic segmentation (OVSS) employs pixel-level vision-language alignment to associate category-related prompts with corresponding pixels. A key challenge is enhancing the multimodal dense prediction capability, specifically this pixel-level multimodal alignment. Although existing methods achieve promising results by leveraging CLIP's vision-language alignment, they rarely investigate the performance boundaries of CLIP for dense prediction from an interpretability mechanisms perspective. In this work, we systematically investigate CLIP's internal mechanisms and identify a critical phenomenon: analogous to human distraction, CLIP diverts significant attention resources from target regions to irrelevant tokens. Our analysis reveals that these tokens arise from dimension-specific over-activation; filtering them enhances CLIP's dense prediction performance. Consequently, we propose ReFocusing CLIP (RF-CLIP), a training-free approach that emulates human distraction-refocusing behavior to redirect attention from distraction tokens back to target regions, thereby refining CLIP's multimodal alignment granularity. Our method achieves SOTA performance on eight benchmarks while maintaining high inference efficiency.
comment: Accepted by AAAI 2026
☆ EvoVLA: Self-Evolving Vision-Language-Action Model
Long-horizon robotic manipulation remains challenging for Vision-Language-Action (VLA) models despite recent progress in zero-shot generalization and simulation-to-real-world transfer. Current VLA models suffer from stage hallucination, where agents exploit coarse evaluation signals to shortcut multi-step tasks, reporting high progress without truly completing them. We present EvoVLA, a self-supervised VLA framework that addresses this issue through three complementary components: Stage-Aligned Reward (SAR), which uses triplet contrastive learning with Gemini-generated hard negatives to prevent visual shortcuts; Pose-Based Object Exploration (POE), which grounds curiosity in relative object-gripper pose instead of raw pixels; and Long-Horizon Memory, which uses selective context retention and gated fusion to stabilize intrinsic shaping during extended rollouts. Extensive evaluations on Discoverse-L, a long-horizon manipulation benchmark with three multi-stage tasks, show that EvoVLA improves average task success by 10.2 percentage points over the strongest baseline (OpenVLA-OFT), reaching 69.2 percent. EvoVLA also achieves one-and-a-half times better sample efficiency and reduces stage hallucination from 38.5 percent to 14.8 percent. Real-world deployment on physical robots reaches an average success rate of 54.6 percent across four manipulation tasks, outperforming OpenVLA-OFT by 11 points, demonstrating effective sim-to-real transfer and strong generalization. Code: https://github.com/AIGeeksGroup/EvoVLA. Website: https://aigeeksgroup.github.io/EvoVLA.
☆ An Image Is Worth Ten Thousand Words: Verbose-Text Induction Attacks on VLMs
With the remarkable success of Vision-Language Models (VLMs) on multimodal tasks, concerns regarding their deployment efficiency have become increasingly prominent. In particular, the number of tokens consumed during the generation process has emerged as a key evaluation metric.Prior studies have shown that specific inputs can induce VLMs to generate lengthy outputs with low information density, which significantly increases energy consumption, latency, and token costs. However, existing methods simply delay the occurrence of the EOS token to implicitly prolong output, and fail to directly maximize the output token length as an explicit optimization objective, lacking stability and controllability.To address these limitations, this paper proposes a novel verbose-text induction attack (VTIA) to inject imperceptible adversarial perturbations into benign images via a two-stage framework, which identifies the most malicious prompt embeddings for optimizing and maximizing the output token of the perturbed images.Specifically, we first perform adversarial prompt search, employing reinforcement learning strategies to automatically identify adversarial prompts capable of inducing the LLM component within VLMs to produce verbose outputs. We then conduct vision-aligned perturbation optimization to craft adversarial examples on input images, maximizing the similarity between the perturbed image's visual embeddings and those of the adversarial prompt, thereby constructing malicious images that trigger verbose text generation. Comprehensive experiments on four popular VLMs demonstrate that our method achieves significant advantages in terms of effectiveness, efficiency, and generalization capability.
☆ Layer-wise Noise Guided Selective Wavelet Reconstruction for Robust Medical Image Segmentation
Clinical deployment requires segmentation models to stay stable under distribution shifts and perturbations. The mainstream solution is adversarial training (AT) to improve robustness; however, AT often brings a clean--robustness trade-off and high training/tuning cost, which limits scalability and maintainability in medical imaging. We propose \emph{Layer-wise Noise-Guided Selective Wavelet Reconstruction (LNG-SWR)}. During training, we inject small, zero-mean noise at multiple layers to learn a frequency-bias prior that steers representations away from noise-sensitive directions. We then apply prior-guided selective wavelet reconstruction on the input/feature branch to achieve frequency adaptation: suppress noise-sensitive bands, enhance directional structures and shape cues, and stabilize boundary responses while maintaining spectral consistency. The framework is backbone-agnostic and adds low additional inference overhead. It can serve as a plug-in enhancement to AT and also improves robustness without AT. On CT and ultrasound datasets, under a unified protocol with PGD-$L_{\infty}/L_{2}$ and SSAH, LNG-SWR delivers consistent gains on clean Dice/IoU and significantly reduces the performance drop under strong attacks; combining LNG-SWR with AT yields additive gains. When combined with adversarial training, robustness improves further without sacrificing clean accuracy, indicating an engineering-friendly and scalable path to robust segmentation. These results indicate that LNG-SWR provides a simple, effective, and engineering-friendly path to robust medical image segmentation in both adversarial and standard training regimes.
☆ Simba: Towards High-Fidelity and Geometrically-Consistent Point Cloud Completion via Transformation Diffusion AAAI
Point cloud completion is a fundamental task in 3D vision. A persistent challenge in this field is simultaneously preserving fine-grained details present in the input while ensuring the global structural integrity of the completed shape. While recent works leveraging local symmetry transformations via direct regression have significantly improved the preservation of geometric structure details, these methods suffer from two major limitations: (1) These regression-based methods are prone to overfitting which tend to memorize instant-specific transformations instead of learning a generalizable geometric prior. (2) Their reliance on point-wise transformation regression lead to high sensitivity to input noise, severely degrading their robustness and generalization. To address these challenges, we introduce Simba, a novel framework that reformulates point-wise transformation regression as a distribution learning problem. Our approach integrates symmetry priors with the powerful generative capabilities of diffusion models, avoiding instance-specific memorization while capturing robust geometric structures. Additionally, we introduce a hierarchical Mamba-based architecture to achieve high-fidelity upsampling. Extensive experiments across the PCN, ShapeNet, and KITTI benchmarks validate our method's state-of-the-art (SOTA) performance.
comment: Accepted for publication at the 40th AAAI Conference on Artificial Intelligence (AAAI-26)
☆ Video2Layout: Recall and Reconstruct Metric-Grounded Cognitive Map for Spatial Reasoning
Spatial intelligence is a critical frontier for Multimodal Large Language Models (MLLMs), empowering them to comprehend the physical world. Drawing inspiration from human perception mechanisms, existing studies attempt to construct a coherent spatial understanding via grid-based cognitive maps from multi-frame visual inputs. However, current grid-based map methods rely on discretized raster representations, which limit the model's ability in fine-grained spatial reasoning. To overcome this limitation, we propose Video2Layout, a framework for reconstructing metric-grounded spatial layouts from video. The framework employs continuous object boundary coordinates to quantify inter-object physical distances and object size. This empowers the model with quantitative spatial computation capabilities, effectively alleviating the inherent ambiguity when describing spatial relationships in natural language. Specifically, our method comprises two core stages. First, in supervised fine-tuning stage, we construct a high-quality dataset from the AI2THOR simulator, which enables the model to learn the mapping from visual inputs to precise boundary coordinates. Subsequently, a reinforcement fine-tuning stage further enhances the model's real-world generalization capabilities. To systematically evaluate the correlation between cognitive map accuracy and image quantity, as well as how the quantity of image inputs affects spatial reasoning accuracy, we introduce QVS-Bench, a diagnostic benchmark designed to analyze the relevant mechanisms. Evaluated on QVS-Bench and mainstream spatial reasoning benchmarks, our model, V2LO-7B achieves an average improvement of 4.92% over the model trained on grid maps, validating the superiority of our method. Our code is available at https://github.com/ybrrraway/Video2Layout.
☆ Pluggable Pruning with Contiguous Layer Distillation for Diffusion Transformers
Diffusion Transformers (DiTs) have shown exceptional performance in image generation, yet their large parameter counts incur high computational costs, impeding deployment in resource-constrained settings. To address this, we propose Pluggable Pruning with Contiguous Layer Distillation (PPCL), a flexible structured pruning framework specifically designed for DiT architectures. First, we identify redundant layer intervals through a linear probing mechanism combined with the first-order differential trend analysis of similarity metrics. Subsequently, we propose a plug-and-play teacher-student alternating distillation scheme tailored to integrate depth-wise and width-wise pruning within a single training phase. This distillation framework enables flexible knowledge transfer across diverse pruning ratios, eliminating the need for per-configuration retraining. Extensive experiments on multiple Multi-Modal Diffusion Transformer architecture models demonstrate that PPCL achieves a 50\% reduction in parameter count compared to the full model, with less than 3\% degradation in key objective metrics. Notably, our method maintains high-quality image generation capabilities while achieving higher compression ratios, rendering it well-suited for resource-constrained environments. The open-source code, checkpoints for PPCL can be found at the following link: https://github.com/OPPO-Mente-Lab/Qwen-Image-Pruning.
comment: https://github.com/OPPO-Mente-Lab/Qwen-Image-Pruning
☆ Reasoning Guided Embeddings: Leveraging MLLM Reasoning for Improved Multimodal Retrieval
Multimodal embeddings are widely used in downstream tasks such as multimodal retrieval, enabling alignment of interleaved modalities in a shared representation space. While recent studies show that Multimodal Large Language Models (MLLMs) can serve as strong embedding extractors, existing approaches treat embedding extraction as a direct encoding step, overlooking the fact that MLLMs possess the generative capability for reasoning that could be leveraged to enhance representation quality. In this work, we explore how to explicitly incorporate reasoning into the embedding process. To this end, we propose Reasoning Guided Embeddings (RGE), which preserves the generative rationale process of MLLMs and couples it with contrastive training. Our method first enables the model to perform structured rationale generation conditioned on the instruction, and then extracts representations after reasoning has unfolded. This simple design enhances the context-conditional inference signals within the embedding, leading to improved multimodal representation quality. Experiments on the MMEB benchmark show that reasoning-guided conditioning improves multimodal retrieval performance by 4.9% over the non-reasoning baseline, confirming that explicit reasoning can effectively enhance embedding quality.
☆ LEGO-SLAM: Language-Embedded Gaussian Optimization SLAM
Recent advances in 3D Gaussian Splatting (3DGS) have enabled Simultaneous Localization and Mapping (SLAM) systems to build photorealistic maps. However, these maps lack the open-vocabulary semantic understanding required for advanced robotic interaction. Integrating language features into SLAM remains a significant challenge, as storing high-dimensional features demands excessive memory and rendering overhead, while existing methods with static models lack adaptability for novel environments. To address these limitations, we propose LEGO-SLAM (Language-Embedded Gaussian Optimization SLAM), the first framework to achieve real-time, open-vocabulary mapping within a 3DGS-based SLAM system. At the core of our method is a scene-adaptive encoder-decoder that distills high-dimensional language embeddings into a compact 16-dimensional feature space. This design reduces the memory per Gaussian and accelerates rendering, enabling real-time performance. Unlike static approaches, our encoder adapts online to unseen scenes. These compact features also enable a language-guided pruning strategy that identifies semantic redundancy, reducing the map's Gaussian count by over 60\% while maintaining rendering quality. Furthermore, we introduce a language-based loop detection approach that reuses these mapping features, eliminating the need for a separate detection model. Extensive experiments demonstrate that LEGO-SLAM achieves competitive mapping quality and tracking accuracy, all while providing open-vocabulary capabilities at 15 FPS.
comment: 18 pages
☆ A Spatial Semantics and Continuity Perception Attention for Remote Sensing Water Body Change Detection
Remote sensing Water Body Change Detection (WBCD) aims to detect water body surface changes from bi-temporal images of the same geographic area. Recently, the scarcity of high spatial resolution datasets for WBCD restricts its application in urban and rural regions, which require more accurate positioning. Meanwhile, previous deep learning-based methods fail to comprehensively exploit the spatial semantic and structural information in deep features in the change detection networks. To resolve these concerns, we first propose a new dataset, HSRW-CD, with a spatial resolution higher than 3 meters for WBCD. Specifically, it contains a large number of image pairs, widely covering various water body types. Besides, a Spatial Semantics and Continuity Perception (SSCP) attention module is designed to fully leverage both the spatial semantics and structure of deep features in the WBCD networks, significantly improving the discrimination capability for water body. The proposed SSCP has three components: the Multi-Semantic spatial Attention (MSA), the Structural Relation-aware Global Attention (SRGA), and the Channel-wise Self-Attention (CSA). The MSA enhances the spatial semantics of water body features and provides precise spatial semantic priors for the CSA. Then, the SRGA further extracts spatial structure to learn the spatial continuity of the water body. Finally, the CSA utilizes the spatial semantic and structural priors from the MSA and SRGA to compute the similarity across channels. Specifically designed as a plug-and-play module for water body deep features, the proposed SSCP allows integration into existing WBCD models. Numerous experiments conducted on the proposed HSRW-CD and Water-CD datasets validate the effectiveness and generalization of the SSCP. The code of this work and the HSRW-CD dataset will be accessed at https://github.com/QingMa1/SSCP.
☆ Real-Time 3D Object Detection with Inference-Aligned Learning AAAI 2026
Real-time 3D object detection from point clouds is essential for dynamic scene understanding in applications such as augmented reality, robotics and navigation. We introduce a novel Spatial-prioritized and Rank-aware 3D object detection (SR3D) framework for indoor point clouds, to bridge the gap between how detectors are trained and how they are evaluated. This gap stems from the lack of spatial reliability and ranking awareness during training, which conflicts with the ranking-based prediction selection used as inference. Such a training-inference gap hampers the model's ability to learn representations aligned with inference-time behavior. To address the limitation, SR3D consists of two components tailored to the spatial nature of point clouds during training: a novel spatial-prioritized optimal transport assignment that dynamically emphasizes well-located and spatially reliable samples, and a rank-aware adaptive self-distillation scheme that adaptively injects ranking perception via a self-distillation paradigm. Extensive experiments on ScanNet V2 and SUN RGB-D show that SR3D effectively bridges the training-inference gap and significantly outperforms prior methods in accuracy while maintaining real-time speed.
comment: Accepted by AAAI 2026
☆ Degradation-Aware Hierarchical Termination for Blind Quality Enhancement of Compressed Video
Existing studies on Quality Enhancement for Compressed Video (QECV) predominantly rely on known Quantization Parameters (QPs), employing distinct enhancement models per QP setting, termed non-blind methods. However, in real-world scenarios involving transcoding or transmission, QPs may be partially or entirely unknown, limiting the applicability of such approaches and motivating the development of blind QECV techniques. Current blind methods generate degradation vectors via classification models with cross-entropy loss, using them as channel attention to guide artifact removal. However, these vectors capture only global degradation information and lack spatial details, hindering adaptation to varying artifact patterns at different spatial positions. To address these limitations, we propose a pretrained Degradation Representation Learning (DRL) module that decouples and extracts high-dimensional, multiscale degradation representations from video content to guide the artifact removal. Additionally, both blind and non-blind methods typically employ uniform architectures across QPs, hence, overlooking the varying computational demands inherent to different compression levels. We thus introduce a hierarchical termination mechanism that dynamically adjusts the number of artifact reduction stages based on the compression level. Experimental results demonstrate that the proposed approach significantly enhances performance, achieving a PSNR improvement of 110% (from 0.31 dB to 0.65 dB) over a competing state-of-the-art blind method at QP = 22. Furthermore, the proposed hierarchical termination mechanism reduces the average inference time at QP = 22 by half compared to QP = 42.
☆ How Noise Benefits AI-generated Image Detection
The rapid advancement of generative models has made real and synthetic images increasingly indistinguishable. Although extensive efforts have been devoted to detecting AI-generated images, out-of-distribution generalization remains a persistent challenge. We trace this weakness to spurious shortcuts exploited during training and we also observe that small feature-space perturbations can mitigate shortcut dominance. To address this problem in a more controllable manner, we propose the Positive-Incentive Noise for CLIP (PiN-CLIP), which jointly trains a noise generator and a detection network under a variational positive-incentive principle. Specifically, we construct positive-incentive noise in the feature space via cross-attention fusion of visual and categorical semantic features. During optimization, the noise is injected into the feature space to fine-tune the visual encoder, suppressing shortcut-sensitive directions while amplifying stable forensic cues, thereby enabling the extraction of more robust and generalized artifact representations. Comparative experiments are conducted on an open-world dataset comprising synthetic images generated by 42 distinct generative models. Our method achieves new state-of-the-art performance, with notable improvements of 5.4 in average accuracy over existing approaches.
☆ VTinker: Guided Flow Upsampling and Texture Mapping for High-Resolution Video Frame Interpolation AAAI 2026
Due to large pixel movement and high computational cost, estimating the motion of high-resolution frames is challenging. Thus, most flow-based Video Frame Interpolation (VFI) methods first predict bidirectional flows at low resolution and then use high-magnification upsampling (e.g., bilinear) to obtain the high-resolution ones. However, this kind of upsampling strategy may cause blur or mosaic at the flows' edges. Additionally, the motion of fine pixels at high resolution cannot be adequately captured in motion estimation at low resolution, which leads to the misalignment of task-oriented flows. With such inaccurate flows, input frames are warped and combined pixel-by-pixel, resulting in ghosting and discontinuities in the interpolated frame. In this study, we propose a novel VFI pipeline, VTinker, which consists of two core components: guided flow upsampling (GFU) and Texture Mapping. After motion estimation at low resolution, GFU introduces input frames as guidance to alleviate the blurring details in bilinear upsampling flows, which makes flows' edges clearer. Subsequently, to avoid pixel-level ghosting and discontinuities, Texture Mapping generates an initial interpolated frame, referred to as the intermediate proxy. The proxy serves as a cue for selecting clear texture blocks from the input frames, which are then mapped onto the proxy to facilitate producing the final interpolated frame via a reconstruction module. Extensive experiments demonstrate that VTinker achieves state-of-the-art performance in VFI. Codes are available at: https://github.com/Wucy0519/VTinker.
comment: Accepted by AAAI 2026
☆ Decoupling Complexity from Scale in Latent Diffusion Model
Existing latent diffusion models typically couple scale with content complexity, using more latent tokens to represent higher-resolution images or higher-frame rate videos. However, the latent capacity required to represent visual data primarily depends on content complexity, with scale serving only as an upper bound. Motivated by this observation, we propose DCS-LDM, a novel paradigm for visual generation that decouples information complexity from scale. DCS-LDM constructs a hierarchical, scale-independent latent space that models sample complexity through multi-level tokens and supports decoding to arbitrary resolutions and frame rates within a fixed latent representation. This latent space enables DCS-LDM to achieve a flexible computation-quality tradeoff. Furthermore, by decomposing structural and detailed information across levels, DCS-LDM supports a progressive coarse-to-fine generation paradigm. Experimental results show that DCS-LDM delivers performance comparable to state-of-the-art methods while offering flexible generation across diverse scales and visual qualities.
comment: 15 pages, 16 figures
☆ Clustered Error Correction with Grouped 4D Gaussian Splatting SIGGRAPH
Existing 4D Gaussian Splatting (4DGS) methods struggle to accurately reconstruct dynamic scenes, often failing to resolve ambiguous pixel correspondences and inadequate densification in dynamic regions. We address these issues by introducing a novel method composed of two key components: (1) Elliptical Error Clustering and Error Correcting Splat Addition that pinpoints dynamic areas to improve and initialize fitting splats, and (2) Grouped 4D Gaussian Splatting that improves consistency of mapping between splats and represented dynamic objects. Specifically, we classify rendering errors into missing-color and occlusion types, then apply targeted corrections via backprojection or foreground splitting guided by cross-view color consistency. Evaluations on Neural 3D Video and Technicolor datasets demonstrate that our approach significantly improves temporal consistency and achieves state-of-the-art perceptual rendering quality, improving 0.39dB of PSNR on the Technicolor Light Field dataset. Our visualization shows improved alignment between splats and dynamic objects, and the error correction method's capability to identify errors and properly initialize new splats. Our implementation details and source code are available at https://github.com/tho-kn/cem-4dgs.
comment: 16 pages, 8 figures, SIGGRAPH Asia Conference Papers 2025
☆ T2T-VICL: Unlocking the Boundaries of Cross-Task Visual In-Context Learning via Implicit Text-Driven VLMs
In large language models (LLM), in-context learning (ICL) refers to performing new tasks by conditioning on small demonstrations provided in the input context. Recent advances in visual in-context learning (VICL) demonstrate promising capabilities for solving downstream tasks by unified vision-language models (VLMs). When the visual prompt and the target images originate from different visual tasks, can VLMs still enable VICL? In the paper, we propose a fully collaborative pipeline, i.e. T2T-VICL, for VLMs to investigate the potential of cross-task VICL. Fundamentally, we design a mechanism to generate and select text prompts that best implicitly describe the differences between two distinct low-level vision tasks, and construct the first cross-task VICL dataset. Building upon this, we propose a novel inference framework that combines perceptual score-based reasoning with traditional evaluation metrics to perform cross-task VICL. Our approach achieves top-tier results across nine cross-task scenarios and second-tier performance in ten additional scenarios, unlocking the boundaries of cross-task VICL within VLMs.
☆ Rad-GS: Radar-Vision Integration for 3D Gaussian Splatting SLAM in Outdoor Environments
We present Rad-GS, a 4D radar-camera SLAM system designed for kilometer-scale outdoor environments, utilizing 3D Gaussian as a differentiable spatial representation. Rad-GS combines the advantages of raw radar point cloud with Doppler information and geometrically enhanced point cloud to guide dynamic object masking in synchronized images, thereby alleviating rendering artifacts and improving localization accuracy. Additionally, unsynchronized image frames are leveraged to globally refine the 3D Gaussian representation, enhancing texture consistency and novel view synthesis fidelity. Furthermore, the global octree structure coupled with a targeted Gaussian primitive management strategy further suppresses noise and significantly reduces memory consumption in large-scale environments. Extensive experiments and ablation studies demonstrate that Rad-GS achieves performance comparable to traditional 3D Gaussian methods based on camera or LiDAR inputs, highlighting the feasibility of robust outdoor mapping using 4D mmWave radar. Real-world reconstruction at kilometer scale validates the potential of Rad-GS for large-scale scene reconstruction.
☆ SpectralTrain: A Universal Framework for Hyperspectral Image Classification
Hyperspectral image (HSI) classification typically involves large-scale data and computationally intensive training, which limits the practical deployment of deep learning models in real-world remote sensing tasks. This study introduces SpectralTrain, a universal, architecture-agnostic training framework that enhances learning efficiency by integrating curriculum learning (CL) with principal component analysis (PCA)-based spectral downsampling. By gradually introducing spectral complexity while preserving essential information, SpectralTrain enables efficient learning of spectral -- spatial patterns at significantly reduced computational costs. The framework is independent of specific architectures, optimizers, or loss functions and is compatible with both classical and state-of-the-art (SOTA) models. Extensive experiments on three benchmark datasets -- Indian Pines, Salinas-A, and the newly introduced CloudPatch-7 -- demonstrate strong generalization across spatial scales, spectral characteristics, and application domains. The results indicate consistent reductions in training time by 2-7x speedups with small-to-moderate accuracy deltas depending on backbone. Its application to cloud classification further reveals potential in climate-related remote sensing, emphasizing training strategy optimization as an effective complement to architectural design in HSI models. Code is available at https://github.com/mh-zhou/SpectralTrain.
☆ VideoSeg-R1:Reasoning Video Object Segmentation via Reinforcement Learning
Traditional video reasoning segmentation methods rely on supervised fine-tuning, which limits generalization to out-of-distribution scenarios and lacks explicit reasoning. To address this, we propose \textbf{VideoSeg-R1}, the first framework to introduce reinforcement learning into video reasoning segmentation. It adopts a decoupled architecture that formulates the task as joint referring image segmentation and video mask propagation. It comprises three stages: (1) A hierarchical text-guided frame sampler to emulate human attention; (2) A reasoning model that produces spatial cues along with explicit reasoning chains; and (3) A segmentation-propagation stage using SAM2 and XMem. A task difficulty-aware mechanism adaptively controls reasoning length for better efficiency and accuracy. Extensive evaluations on multiple benchmarks demonstrate that VideoSeg-R1 achieves state-of-the-art performance in complex video reasoning and segmentation tasks. The code will be publicly available at https://github.com/euyis1019/VideoSeg-R1.
☆ LiSTAR: Ray-Centric World Models for 4D LiDAR Sequences in Autonomous Driving
Synthesizing high-fidelity and controllable 4D LiDAR data is crucial for creating scalable simulation environments for autonomous driving. This task is inherently challenging due to the sensor's unique spherical geometry, the temporal sparsity of point clouds, and the complexity of dynamic scenes. To address these challenges, we present LiSTAR, a novel generative world model that operates directly on the sensor's native geometry. LiSTAR introduces a Hybrid-Cylindrical-Spherical (HCS) representation to preserve data fidelity by mitigating quantization artifacts common in Cartesian grids. To capture complex dynamics from sparse temporal data, it utilizes a Spatio-Temporal Attention with Ray-Centric Transformer (START) that explicitly models feature evolution along individual sensor rays for robust temporal coherence. Furthermore, for controllable synthesis, we propose a novel 4D point cloud-aligned voxel layout for conditioning and a corresponding discrete Masked Generative START (MaskSTART) framework, which learns a compact, tokenized representation of the scene, enabling efficient, high-resolution, and layout-guided compositional generation. Comprehensive experiments validate LiSTAR's state-of-the-art performance across 4D LiDAR reconstruction, prediction, and conditional generation, with substantial quantitative gains: reducing generation MMD by a massive 76%, improving reconstruction IoU by 32%, and lowering prediction L1 Med by 50%. This level of performance provides a powerful new foundation for creating realistic and controllable autonomous systems simulations. Project link: https://ocean-luna.github.io/LiSTAR.gitub.io.
☆ AMS-KV: Adaptive KV Caching in Multi-Scale Visual Autoregressive Transformers
Visual autoregressive modeling (VAR) via next-scale prediction has emerged as a scalable image generation paradigm. While Key and Value (KV) caching in large language models (LLMs) has been extensively studied, next-scale prediction presents unique challenges, and KV caching design for next-scale based VAR transformers remains largely unexplored. A major bottleneck is the excessive KV memory growth with the increasing number of scales-severely limiting scalability. Our systematic investigation reveals that: (1) Attending to tokens from local scales significantly contributes to generation quality (2) Allocating a small amount of memory for the coarsest scales, termed as condensed scales, stabilizes multi-scale image generation (3) Strong KV similarity across finer scales is predominantly observed in cache-efficient layers, whereas cache-demanding layers exhibit weaker inter-scale similarity. Based on the observations, we introduce AMS-KV, a scale-adaptive KV caching policy for next-scale prediction in VAR models. AMS-KV prioritizes storing KVs from condensed and local scales, preserving the most relevant tokens to maintain generation quality. It further optimizes KV cache utilization and computational efficiency identifying cache-demanding layers through inter-scale similarity analysis. Compared to the vanilla next-scale prediction-based VAR models, AMS-KV reduces KV cache usage by up to 84.83% and self-attention latency by 60.48%. Moreover, when the baseline VAR-d30 model encounters out-of-memory failures at a batch size of 128, AMS-KV enables stable scaling to a batch size of 256 with improved throughput.
☆ LLMs-based Augmentation for Domain Adaptation in Long-tailed Food Datasets
Training a model for food recognition is challenging because the training samples, which are typically crawled from the Internet, are visually different from the pictures captured by users in the free-living environment. In addition to this domain-shift problem, the real-world food datasets tend to be long-tailed distributed and some dishes of different categories exhibit subtle variations that are difficult to distinguish visually. In this paper, we present a framework empowered with large language models (LLMs) to address these challenges in food recognition. We first leverage LLMs to parse food images to generate food titles and ingredients. Then, we project the generated texts and food images from different domains to a shared embedding space to maximize the pair similarities. Finally, we take the aligned features of both modalities for recognition. With this simple framework, we show that our proposed approach can outperform the existing approaches tailored for long-tailed data distribution, domain adaptation, and fine-grained classification, respectively, on two food datasets.
☆ Crossmodal learning for Crop Canopy Trait Estimation
Recent advances in plant phenotyping have driven widespread adoption of multi sensor platforms for collecting crop canopy reflectance data. This includes the collection of heterogeneous data across multiple platforms, with Unmanned Aerial Vehicles (UAV) seeing significant usage due to their high performance in crop monitoring, forecasting, and prediction tasks. Similarly, satellite missions have been shown to be effective for agriculturally relevant tasks. In contrast to UAVs, such missions are bound to the limitation of spatial resolution, which hinders their effectiveness for modern farming systems focused on micro-plot management. In this work, we propose a cross modal learning strategy that enriches high-resolution satellite imagery with UAV level visual detail for crop canopy trait estimation. Using a dataset of approximately co registered satellite UAV image pairs collected from replicated plots of 84 hybrid maize varieties across five distinct locations in the U.S. Corn Belt, we train a model that learns fine grained spectral spatial correspondences between sensing modalities. Results show that the generated UAV-like representations from satellite inputs consistently outperform real satellite imagery on multiple downstream tasks, including yield and nitrogen prediction, demonstrating the potential of cross-modal correspondence learning to bridge the gap between satellite and UAV sensing in agricultural monitoring.
comment: 18 pages, 7 figures
☆ CuriGS: Curriculum-Guided Gaussian Splatting for Sparse View Synthesis
3D Gaussian Splatting (3DGS) has recently emerged as an efficient, high-fidelity representation for real-time scene reconstruction and rendering. However, extending 3DGS to sparse-view settings remains challenging because of supervision scarcity and overfitting caused by limited viewpoint coverage. In this paper, we present CuriGS, a curriculum-guided framework for sparse-view 3D reconstruction using 3DGS. CuriGS addresses the core challenge of sparse-view synthesis by introducing student views: pseudo-views sampled around ground-truth poses (teacher). For each teacher, we generate multiple groups of student views with different perturbation levels. During training, we follow a curriculum schedule that gradually unlocks higher perturbation level, randomly sampling candidate students from the active level to assist training. Each sampled student is regularized via depth-correlation and co-regularization, and evaluated using a multi-signal metric that combines SSIM, LPIPS, and an image-quality measure. For every teacher and perturbation level, we periodically retain the best-performing students and promote those that satisfy a predefined quality threshold to the training set, resulting in a stable augmentation of sparse training views. Experimental results show that CuriGS outperforms state-of-the-art baselines in both rendering fidelity and geometric consistency across various synthetic and real sparse-view scenes. Project page: https://zijian1026.github.io/CuriGS/
♻ ☆ LightFusion: A Light-weighted, Double Fusion Framework for Unified Multimodal Understanding and Generation
Unified multimodal models have recently shown remarkable gains in both capability and versatility, yet most leading systems are still trained from scratch and require substantial computational resources. In this paper, we show that competitive performance can be obtained far more efficiently by strategically fusing publicly available models specialized for either generation or understanding. Our key design is to retain the original blocks while additionally interleaving multimodal self-attention blocks throughout the networks. This double fusion mechanism (1) effectively enables rich multi-modal fusion while largely preserving the original strengths of the base models, and (2) catalyzes synergistic fusion of high-level semantic representations from the understanding encoder with low-level spatial signals from the generation encoder. By training with only ~ 35B tokens, this approach achieves strong results across multiple benchmarks: 0.91 on GenEval for compositional text-to-image generation, 82.16 on DPG-Bench for complex text-to-image generation, 6.06 on GEditBench, and 3.77 on ImgEdit-Bench for image editing. By fully releasing the entire suite of code, model weights, and datasets, we hope to support future research on unified multimodal modeling.
comment: Preprint. Work in progress
♻ ☆ Sigma: Semantically Informative Pre-training for Skeleton-based Sign Language Understanding
Pre-training has proven effective for learning transferable features in sign language understanding (SLU) tasks. Recently, skeleton-based methods have gained increasing attention because they can robustly handle variations in subjects and backgrounds without being affected by appearance or environmental factors. Current SLU methods continue to face three key limitations: 1) weak semantic grounding, as models often capture low-level motion patterns from skeletal data but struggle to relate them to linguistic meaning; 2) imbalance between local details and global context, with models either focusing too narrowly on fine-grained cues or overlooking them for broader context; and 3) inefficient cross-modal learning, as constructing semantically aligned representations across modalities remains difficult. To address these, we propose Sigma, a unified skeleton-based SLU framework featuring: 1) a sign-aware early fusion mechanism that facilitates deep interaction between visual and textual modalities, enriching visual features with linguistic context; 2) a hierarchical alignment learning strategy that jointly maximises agreements across different levels of paired features from different modalities, effectively capturing both fine-grained details and high-level semantic relationships; and 3) a unified pre-training framework that combines contrastive learning, text matching and language modelling to promote semantic consistency and generalisation. Sigma achieves new state-of-the-art results on isolated sign language recognition, continuous sign language recognition, and gloss-free sign language translation on multiple benchmarks spanning different sign and spoken languages, demonstrating the impact of semantically informative pre-training and the effectiveness of skeletal data as a stand-alone solution for SLU.
♻ ☆ Beyond Patches: Mining Interpretable Part-Prototypes for Explainable AI
As AI systems grow more capable, it becomes increasingly important that their decisions remain understandable and aligned with human expectations. A key challenge is the limited interpretability of deep models. Post-hoc methods like GradCAM offer heatmaps but provide limited conceptual insight, while prototype-based approaches offer example-based explanations but often rely on rigid region selection and lack semantic consistency. To address these limitations, we propose PCMNet, a part-prototypical concept mining network that learns human-comprehensible prototypes from meaningful image regions without additional supervision. By clustering these prototypes into concept groups and extracting concept activation vectors, PCMNet provides structured, concept-level explanations and enhances robustness to occlusion and challenging conditions, which are both critical for building reliable and aligned AI systems. Experiments across multiple image classification benchmarks show that PCMNet outperforms state-of-the-art methods in interpretability, stability, and robustness. This work contributes to AI alignment by enhancing transparency, controllability, and trustworthiness in AI systems. Our code is available at: https://github.com/alehdaghi/PCMNet.
♻ ☆ On Geometry-Enhanced Parameter-Efficient Fine-Tuning for 3D Scene Segmentation
The emergence of large-scale pre-trained point cloud models has significantly advanced 3D scene understanding, but adapting these models to specific downstream tasks typically demands full fine-tuning, incurring high computational and storage costs. Parameter-efficient fine-tuning (PEFT) techniques, successful in natural language processing and 2D vision tasks, would underperform when naively applied to 3D point cloud models due to significant geometric and spatial distribution shifts. Existing PEFT methods commonly treat points as orderless tokens, neglecting important local spatial structures and global geometric contexts in 3D modeling. To bridge this gap, we introduce the Geometric Encoding Mixer (GEM), a novel geometry-aware PEFT module specifically designed for 3D point cloud transformers. GEM explicitly integrates fine-grained local positional encodings with a lightweight latent attention mechanism to capture comprehensive global context, thereby effectively addressing the spatial and geometric distribution mismatch. Extensive experiments demonstrate that GEM achieves performance comparable to or sometimes even exceeding full fine-tuning, while only updating 1.6% of the model's parameters, fewer than other PEFT methods. With significantly reduced training time and memory requirements, our approach thus sets a new benchmark for efficient, scalable, and geometry-aware fine-tuning of large-scale 3D point cloud models. Code is available at https://github.com/LiyaoTang/GEM.
comment: Neurips 2025; available at https://github.com/LiyaoTang/GEM
♻ ☆ Self-Supervised Discriminative Feature Learning for Deep Multi-View Clustering
Multi-view clustering is an important research topic due to its capability to utilize complementary information from multiple views. However, there are few methods to consider the negative impact caused by certain views with unclear clustering structures, resulting in poor multi-view clustering performance. To address this drawback, we propose self-supervised discriminative feature learning for deep multi-view clustering (SDMVC). Concretely, deep autoencoders are applied to learn embedded features for each view independently. To leverage the multi-view complementary information, we concatenate all views' embedded features to form the global features, which can overcome the negative impact of some views' unclear clustering structures. In a self-supervised manner, pseudo-labels are obtained to build a unified target distribution to perform multi-view discriminative feature learning. During this process, global discriminative information can be mined to supervise all views to learn more discriminative features, which in turn are used to update the target distribution. Besides, this unified target distribution can make SDMVC learn consistent cluster assignments, which accomplishes the clustering consistency of multiple views while preserving their features' diversity. Experiments on various types of multi-view datasets show that SDMVC outperforms 14 competitors including classic and state-of-the-art methods. The code is available at https://github.com/SubmissionsIn/SDMVC.
♻ ☆ Context-Aware Multimodal Representation Learning for Spatio-Temporally Explicit Environmental Modelling
Earth observation (EO) foundation models have emerged as an effective approach to derive latent representations of the Earth system from various remote sensing sensors. These models produce embeddings that can be used as analysis-ready datasets, enabling the modelling of ecosystem dynamics without extensive sensor-specific preprocessing. However, existing models typically operate at fixed spatial or temporal scales, limiting their use for ecological analyses that require both fine spatial detail and high temporal fidelity. To overcome these limitations, we propose a representation learning framework that integrates different EO modalities into a unified feature space at high spatio-temporal resolution. We introduce the framework using Sentinel-1 and Sentinel-2 data as representative modalities. Our approach produces a latent space at native 10 m resolution and the temporal frequency of cloud-free Sentinel-2 acquisitions. Each sensor is first modeled independently to capture its sensor-specific characteristics. Their representations are then combined into a shared model. This two-stage design enables modality-specific optimisation and easy extension to new sensors, retaining pretrained encoders while retraining only fusion layers. This enables the model to capture complementary remote sensing data and to preserve coherence across space and time. Qualitative analyses reveal that the learned embeddings exhibit high spatial and semantic consistency across heterogeneous landscapes. Quantitative evaluation in modelling Gross Primary Production reveals that they encode ecologically meaningful patterns and retain sufficient temporal fidelity to support fine-scale analyses. Overall, the proposed framework provides a flexible, analysis-ready representation learning approach for environmental applications requiring diverse spatial and temporal resolutions.
comment: 10 pages (incliding 2 pages of references), 7 figures
♻ ☆ vMFCoOp: Towards Equilibrium on a Unified Hyperspherical Manifold for Prompting Biomedical VLMs AAAI 2026
Recent advances in context optimization (CoOp) guided by large language model (LLM)-distilled medical semantic priors offer a scalable alternative to manual prompt engineering and full fine-tuning for adapting biomedical CLIP-based vision-language models (VLMs). However, prompt learning in this context is challenged by semantic misalignment between LLMs and CLIP variants due to divergent training corpora and model architectures; it further lacks scalability across continuously evolving families of foundation models. More critically, pairwise multimodal alignment via conventional Euclidean-space optimization lacks the capacity to model unified representations or apply localized geometric constraints, which tends to amplify modality gaps in complex biomedical imaging and destabilize few-shot adaptation. In this work, we propose vMFCoOp, a framework that inversely estimates von Mises-Fisher (vMF) distributions on a shared Hyperspherical Manifold, aligning semantic biases between arbitrary LLMs and CLIP backbones via Unified Semantic Anchors to achieve robust biomedical prompting and superior few-shot classification. Grounded in three complementary constraints, vMFCoOp demonstrates consistent improvements across 14 medical datasets, 12 medical imaging modalities, and 13 anatomical regions, outperforming state-of-the-art methods in accuracy, generalization, and clinical applicability. This work aims to continuously expand to encompass more downstream applications, and the corresponding resources are intended to be shared through https://github.com/VinyehShaw/UniEqui.
comment: Accepted as an Oral Presentation at AAAI 2026 Main Technical Track (this version is not peer-reviewed; it is the extended version)
♻ ☆ DiffuSyn Bench: Evaluating Vision-Language Models on Real-World Complexities with Diffusion-Generated Synthetic Benchmarks
This study assesses the ability of Large Vision-Language Models (LVLMs) to differentiate between AI-generated and human-generated images. It introduces a new automated benchmark construction method for this evaluation. The experiment compared common LVLMs with human participants using a mixed dataset of AI and human-created images. Results showed that LVLMs could distinguish between the image types to some extent but exhibited a rightward bias, and perform significantly worse compared to humans. To build on these findings, we developed an automated benchmark construction process using AI. This process involved topic retrieval, narrative script generation, error embedding, and image generation, creating a diverse set of text-image pairs with intentional errors. We validated our method through constructing two caparable benchmarks. This study highlights the strengths and weaknesses of LVLMs in real-world understanding and advances benchmark construction techniques, providing a scalable and automatic approach for AI model evaluation.
♻ ☆ A Decade of You Only Look Once (YOLO) for Object Detection: A Review
This review marks the tenth anniversary of You Only Look Once (YOLO), one of the most influential frameworks in real-time object detection. Over the past decade, YOLO has evolved from a streamlined detector into a diverse family of architectures characterized by efficient design, modular scalability, and cross-domain adaptability. The paper presents a technical overview of the main versions (from YOLOv1 to YOLOv13), highlights key architectural trends, and surveys the principal application areas in which YOLO has been adopted. It also addresses evaluation practices, ethical considerations, and potential future directions for the framework's continued development. The analysis aims to provide a comprehensive and critical perspective on YOLO's trajectory and ongoing transformation.
♻ ☆ Active Measurement: Efficient Estimation at Scale NeurIPS 2025
AI has the potential to transform scientific discovery by analyzing vast datasets with little human effort. However, current workflows often do not provide the accuracy or statistical guarantees that are needed. We introduce active measurement, a human-in-the-loop AI framework for scientific measurement. An AI model is used to predict measurements for individual units, which are then sampled for human labeling using importance sampling. With each new set of human labels, the AI model is improved and an unbiased Monte Carlo estimate of the total measurement is refined. Active measurement can provide precise estimates even with an imperfect AI model, and requires little human effort when the AI model is very accurate. We derive novel estimators, weighting schemes, and confidence intervals, and show that active measurement reduces estimation error compared to alternatives in several measurement tasks.
comment: NeurIPS 2025
♻ ☆ Unsupervised Discovery of Long-Term Spatiotemporal Periodic Workflows in Human Activities WACV 2026
Periodic human activities with implicit workflows are common in manufacturing, sports, and daily life. While short-term periodic activities -- characterized by simple structures and high-contrast patterns -- have been widely studied, long-term periodic workflows with low-contrast patterns remain largely underexplored. To bridge this gap, we introduce the first benchmark comprising 580 multimodal human activity sequences featuring long-term periodic workflows. The benchmark supports three evaluation tasks aligned with real-world applications: unsupervised periodic workflow detection, task completion tracking, and procedural anomaly detection. We also propose a lightweight, training-free baseline for modeling diverse periodic workflow patterns. Experiments show that: (i) our benchmark presents significant challenges to both unsupervised periodic detection methods and zero-shot approaches based on powerful large language models (LLMs); (ii) our baseline outperforms competing methods by a substantial margin in all evaluation tasks; and (iii) in real-world applications, our baseline demonstrates deployment advantages on par with traditional supervised workflow detection approaches, eliminating the need for annotation and retraining. Our project page is https://sites.google.com/view/periodicworkflow.
comment: accepted to WACV 2026
♻ ☆ Learning to Detect Unknown Jailbreak Attacks in Large Vision-Language Models
Despite extensive alignment efforts, Large Vision-Language Models (LVLMs) remain vulnerable to jailbreak attacks, posing serious safety risks. To address this, existing detection methods either learn attack-specific parameters, which hinders generalization to unseen attacks, or rely on heuristically sound principles, which limit accuracy and efficiency. To overcome these limitations, we propose Learning to Detect (LoD), a general framework that accurately detects unknown jailbreak attacks by shifting the focus from attack-specific learning to task-specific learning. This framework includes a Multi-modal Safety Concept Activation Vector module for safety-oriented representation learning and a Safety Pattern Auto-Encoder module for unsupervised attack classification. Extensive experiments show that our method achieves consistently higher detection AUROC on diverse unknown attacks while improving efficiency. The code is available at https://anonymous.4open.science/r/Learning-to-Detect-51CB.
comment: 16 pages; Previously this version appeared as arXiv:2510.15430 which was submitted as a new work by accident
♻ ☆ Body-Hand Modality Expertized Networks with Cross-attention for Fine-grained Skeleton Action Recognition
Skeleton-based Human Action Recognition (HAR) is a vital technology in robotics and human-robot interaction. However, most existing methods concentrate primarily on full-body movements and often overlook subtle hand motions that are critical for distinguishing fine-grained actions. Recent work leverages a unified graph representation that combines body, hand, and foot keypoints to capture detailed body dynamics. Yet, these models often blur fine hand details due to the disparity between body and hand action characteristics and the loss of subtle features during the spatial-pooling. In this paper, we propose BHaRNet (Body-Hand action Recognition Network), a novel framework that augments a typical body-expert model with a hand-expert model. Our model jointly trains both streams with an ensemble loss that fosters cooperative specialization, functioning in a manner reminiscent of a Mixture-of-Experts (MoE). Moreover, cross-attention is employed via an expertized branch method and a pooling-attention module to enable feature-level interactions and selectively fuse complementary information. Inspired by MMNet, we also demonstrate the applicability of our approach to multi-modal tasks by leveraging RGB information, where body features guide RGB learning to capture richer contextual cues. Experiments on large-scale benchmarks (NTU RGB+D 60, NTU RGB+D 120, PKU-MMD, and Northwestern-UCLA) demonstrate that BHaRNet achieves SOTA accuracies -- improving from 86.4\% to 93.0\% in hand-intensive actions -- while maintaining fewer GFLOPs and parameters than the relevant unified methods.
comment: 7 figures, 8 pages
♻ ☆ MHR: Momentum Human Rig
We present MHR, a parametric human body model that combines the decoupled skeleton/shape paradigm of ATLAS with a flexible, modern rig and pose corrective system inspired by the Momentum library. Our model enables expressive, anatomically plausible human animation, supporting non-linear pose correctives, and is designed for robust integration in AR/VR and graphics pipelines.
♻ ☆ Unsupervised learning of spatially varying regularization for diffeomorphic image registration
Spatially varying regularization accommodates the deformation variations that may be necessary for different anatomical regions during deformable image registration. Historically, optimization-based registration models have harnessed spatially varying regularization to address anatomical subtleties. However, most modern deep learning-based models tend to gravitate towards spatially invariant regularization, wherein a homogenous regularization strength is applied across the entire image, potentially disregarding localized variations. In this paper, we propose a hierarchical probabilistic model that integrates a prior distribution on the deformation regularization strength, enabling the end-to-end learning of a spatially varying deformation regularizer directly from the data. The proposed method is straightforward to implement and easily integrates with various registration network architectures. Additionally, automatic tuning of hyperparameters is achieved through Bayesian optimization, allowing efficient identification of optimal hyperparameters for any given registration task. Comprehensive evaluations on publicly available datasets demonstrate that the proposed method significantly improves registration performance and enhances the interpretability of deep learning-based registration, all while maintaining smooth deformations.
comment: Accepted to Medical Image Analysis ((c) MedIA). Code available at http://bit.ly/3BrXGxz
♻ ☆ CleverDistiller: Simple and Spatially Consistent Cross-modal Distillation BMVC 2025
Vision foundation models (VFMs) such as DINO have led to a paradigm shift in 2D camera-based perception towards extracting generalized features to support many downstream tasks. Recent works introduce self-supervised cross-modal knowledge distillation (KD) as a way to transfer these powerful generalization capabilities into 3D LiDAR-based models. However, they either rely on highly complex distillation losses, pseudo-semantic maps, or limit KD to features useful for semantic segmentation only. In this work, we propose CleverDistiller, a self-supervised, cross-modal 2D-to-3D KD framework introducing a set of simple yet effective design choices: Unlike contrastive approaches relying on complex loss design choices, our method employs a direct feature similarity loss in combination with a multi layer perceptron (MLP) projection head to allow the 3D network to learn complex semantic dependencies throughout the projection. Crucially, our approach does not depend on pseudo-semantic maps, allowing for direct knowledge transfer from a VFM without explicit semantic supervision. Additionally, we introduce the auxiliary self-supervised spatial task of occupancy prediction to enhance the semantic knowledge, obtained from a VFM through KD, with 3D spatial reasoning capabilities. Experiments on standard autonomous driving benchmarks for 2D-to-3D KD demonstrate that CleverDistiller achieves state-of-the-art performance in both semantic segmentation and 3D object detection (3DOD) by up to 10% mIoU, especially when fine tuning on really low data amounts, showing the effectiveness of our simple yet powerful KD strategy
comment: Accepted to BMVC 2025
♻ ☆ TC-Light: Temporally Coherent Generative Rendering for Realistic World Transfer
Illumination and texture editing are critical dimensions for world-to-world transfer, which is valuable for applications including sim2real and real2real visual data scaling up for embodied AI. Existing techniques generatively re-render the input video to realize the transfer, such as video relighting models and conditioned world generation models. Nevertheless, these models are predominantly limited to the domain of training data (e.g., portrait) or fall into the bottleneck of temporal consistency and computation efficiency, especially when the input video involves complex dynamics and long durations. In this paper, we propose TC-Light, a novel generative renderer to overcome these problems. Starting from the video preliminarily relighted by an inflated video relighting model, it optimizes appearance embedding in the first stage to align global illumination. Then it optimizes the proposed canonical video representation, i.e., Unique Video Tensor (UVT), to align fine-grained texture and lighting in the second stage. To comprehensively evaluate performance, we also establish a long and highly dynamic video benchmark. Extensive experiments show that our method enables physically plausible re-rendering results with superior temporal coherence and low computation cost. The code and video demos are available at https://dekuliutesla.github.io/tclight/.
comment: Project Page: https://dekuliutesla.github.io/tclight/ Code: https://github.com/Linketic/TC-Light
♻ ☆ CD-DPE: Dual-Prompt Expert Network based on Convolutional Dictionary Feature Decoupling for Multi-Contrast MRI Super-Resolution AAAI
Multi-contrast magnetic resonance imaging (MRI) super-resolution intends to reconstruct high-resolution (HR) images from low-resolution (LR) scans by leveraging structural information present in HR reference images acquired with different contrasts. This technique enhances anatomical detail and soft tissue differentiation, which is vital for early diagnosis and clinical decision-making. However, inherent contrasts disparities between modalities pose fundamental challenges in effectively utilizing reference image textures to guide target image reconstruction, often resulting in suboptimal feature integration. To address this issue, we propose a dual-prompt expert network based on a convolutional dictionary feature decoupling (CD-DPE) strategy for multi-contrast MRI super-resolution. Specifically, we introduce an iterative convolutional dictionary feature decoupling module (CD-FDM) to separate features into cross-contrast and intra-contrast components, thereby reducing redundancy and interference. To fully integrate these features, a novel dual-prompt feature fusion expert module (DP-FFEM) is proposed. This module uses a frequency prompt to guide the selection of relevant reference features for incorporation into the target image, while an adaptive routing prompt determines the optimal method for fusing reference and target features to enhance reconstruction quality. Extensive experiments on public multi-contrast MRI datasets demonstrate that CD-DPE outperforms state-of-the-art methods in reconstructing fine details. Additionally, experiments on unseen datasets demonstrated that CD-DPE exhibits strong generalization capabilities.
comment: This paper has been accepted by AAAI, but due to the final camera-ready version not being finalized, there are still some expression errors. It will be re-published after correction
♻ ☆ End-to-End 4D Heart Mesh Recovery Across Full-Stack and Sparse Cardiac MRI
Reconstructing cardiac motion from CMR sequences is critical for diagnosis, prognosis, and intervention. Existing methods rely on complete CMR stacks to infer full heart motion, limiting their applicability during intervention when only sparse observations are available. We present TetHeart, the first end-to-end framework for unified 4D heart mesh recovery from both offline full-stack and intra-procedural sparse-slice observations. Our method leverages deformable tetrahedra to capture shape and motion in a coherent space shared across cardiac structures. Before a procedure, it initializes detailed, patient-specific heart meshes from high-quality full stacks, which can then be updated using whatever slices can be obtained in real-time, down to a single one during the procedure. TetHeart incorporates several key innovations: (i) an attentive slice-adaptive 2D-3D feature assembly mechanism that integrates information from arbitrary numbers of slices at any position; (ii) a distillation strategy to ensure accurate reconstruction under extreme sparsity; and (iii) a weakly supervised motion learning scheme requiring annotations only at keyframes, such as the end-diastolic and end-systolic phases. Trained and validated on three large public datasets and evaluated zero-shot on additional private interventional and public datasets without retraining, TetHeart achieves state-of-the-art accuracy and strong generalization in both pre- and intra-procedural settings.
♻ ☆ One Pic is All it Takes: Poisoning Visual Document Retrieval Augmented Generation with a Single Image
Retrieval-augmented generation (RAG) is instrumental for inhibiting hallucinations in large language models (LLMs) through the use of a factual knowledge base (KB). Although PDF documents are prominent sources of knowledge, text-based RAG pipelines are ineffective at capturing their rich multi-modal information. In contrast, visual document RAG (VD-RAG) uses screenshots of document pages as the KB, which has been shown to achieve state-of-the-art results. However, by introducing the image modality, VD-RAG introduces new attack vectors for adversaries to disrupt the system by injecting malicious documents into the KB. In this paper, we demonstrate the vulnerability of VD-RAG to poisoning attacks targeting both retrieval and generation. We define two attack objectives and demonstrate that both can be realized by injecting only a single adversarial image into the KB. Firstly, we introduce a targeted attack against one or a group of queries with the goal of spreading targeted disinformation. Secondly, we present a universal attack that, for any potential user query, influences the response to cause a denial-of-service in the VD-RAG system. We investigate the two attack objectives under both white-box and black-box assumptions, employing a multi-objective gradient-based optimization approach as well as prompting state-of-the-art generative models. Using two visual document datasets, a diverse set of state-of-the-art retrievers (embedding models) and generators (vision language models), we show VD-RAG is vulnerable to poisoning attacks in both the targeted and universal settings, yet demonstrating robustness to black-box attacks in the universal setting.
♻ ☆ Co-Reinforcement Learning for Unified Multimodal Understanding and Generation NeurIPS 2025
This paper presents a pioneering exploration of reinforcement learning (RL) via group relative policy optimization for unified multimodal large language models (ULMs), aimed at simultaneously reinforcing generation and understanding capabilities. Through systematic pilot studies, we uncover the significant potential of ULMs to enable the synergistic co-evolution of dual capabilities within a shared policy optimization framework. Building on this insight, we introduce CoRL, a co-reinforcement learning framework comprising a unified RL stage for joint optimization and a refined RL stage for task-specific enhancement. With the proposed CoRL, our resulting model, ULM-R1, achieves average improvements of 7% on three text-to-image generation datasets and 23% on nine multimodal understanding benchmarks. These results demonstrate the effectiveness of CoRL and highlight the substantial benefit of reinforcement learning in facilitating cross-task synergy and optimization for ULMs. Code is available at https://github.com/mm-vl/ULM-R1.
comment: NeurIPS 2025
♻ ☆ FunnyNodules: A Customizable Medical Dataset Tailored for Evaluating Explainable AI
Densely annotated medical image datasets that capture not only diagnostic labels but also the underlying reasoning behind these diagnoses are scarce. Such reasoning-related annotations are essential for developing and evaluating explainable AI (xAI) models that reason similarly to radiologists: making correct predictions for the right reasons. To address this gap, we introduce FunnyNodules, a fully parameterized synthetic dataset designed for systematic analysis of attribute-based reasoning in medical AI models. The dataset generates abstract, lung nodule-like shapes with controllable visual attributes such as roundness, margin sharpness, and spiculation. Target class is derived from a predefined attribute combination, allowing full control over the decision rule that links attributes to the diagnostic class. We demonstrate how FunnyNodules can be used in model-agnostic evaluations to assess whether models learn correct attribute-target relations, to interpret over- or underperformance in attribute prediction, and to analyze attention alignment with attribute-specific regions of interest. The framework is fully customizable, supporting variations in dataset complexity, target definitions, class balance, and beyond. With complete ground truth information, FunnyNodules provides a versatile foundation for developing, benchmarking, and conducting in-depth analyses of explainable AI methods in medical image analysis.
♻ ☆ Seeing Beyond Haze: Generative Nighttime Image Dehazing
Nighttime image dehazing is particularly challenging when dense haze and intense glow severely degrade or entirely obscure background information. Existing methods often struggle due to insufficient background priors and limited generative capability, both of which are highly important under such conditions. In this paper, we introduce BeyondHaze, a generative nighttime dehazing method that not only reduces haze and glow effects but also reconstructs plausible background structures in regions where visual cues are heavily degraded. Our approach is built on two main ideas: obtaining strong background priors by adapting image diffusion models to nighttime dehazing, and enhancing generative ability in haze- and glow-obscured areas through guided training. Task-specific nighttime dehazing knowledge is distilled into an image diffusion model while preserving its capacity to generate clean images. The diffusion model is further trained on tailored image pairs to improve its ability to recover background details that are suppressed by haze effects. Since generative models may introduce hallucinated content, we design our framework to allow user control over the generative level, enabling a balance between visual realism and fidelity. Experiments on real-world nighttime images demonstrate that BeyondHaze substantially improves visibility and scene detail under dense haze.
♻ ☆ Kandinsky 5.0: A Family of Foundation Models for Image and Video Generation
This report introduces Kandinsky 5.0, a family of state-of-the-art foundation models for high-resolution image and 10-second video synthesis. The framework comprises three core line-up of models: Kandinsky 5.0 Image Lite - a line-up of 6B parameter image generation models, Kandinsky 5.0 Video Lite - a fast and lightweight 2B parameter text-to-video and image-to-video models, and Kandinsky 5.0 Video Pro - 19B parameter models that achieves superior video generation quality. We provide a comprehensive review of the data curation lifecycle - including collection, processing, filtering and clustering - for the multi-stage training pipeline that involves extensive pre-training and incorporates quality-enhancement techniques such as self-supervised fine-tuning (SFT) and reinforcement learning (RL)-based post-training. We also present novel architectural, training, and inference optimizations that enable Kandinsky 5.0 to achieve high generation speeds and state-of-the-art performance across various tasks, as demonstrated by human evaluation. As a large-scale, publicly available generative framework, Kandinsky 5.0 leverages the full potential of its pre-training and subsequent stages to be adapted for a wide range of generative applications. We hope that this report, together with the release of our open-source code and training checkpoints, will substantially advance the development and accessibility of high-quality generative models for the research community.
comment: Website: https://kandinskylab.ai/
♻ ☆ Multimodal Evaluation of Russian-language Architectures
Multimodal large language models (MLLMs) are currently at the center of research attention, showing rapid progress in scale and capabilities, yet their intelligence, limitations, and risks remain insufficiently understood. To address these issues, particularly in the context of the Russian language, where no multimodal benchmarks currently exist, we introduce Mera Multi, an open multimodal evaluation framework for Russian-spoken architectures. The benchmark is instruction-based and encompasses default text, image, audio, and video modalities, comprising 18 newly constructed evaluation tasks for both general-purpose models and modality-specific architectures (image-to-text, video-to-text, and audio-to-text). Our contributions include: (i) a universal taxonomy of multimodal abilities; (ii) 18 datasets created entirely from scratch with attention to Russian cultural and linguistic specificity, unified prompts, and metrics; (iii) baseline results for both closed-source and open-source models; (iv) a methodology for preventing benchmark leakage, including watermarking and licenses for private sets. While our current focus is on Russian, the proposed benchmark provides a replicable methodology for constructing multimodal benchmarks in typologically diverse languages, particularly within the Slavic language family.
♻ ☆ System Filter-Based Common Components Modeling for Cross-Subject EEG Decoding
Brain-computer interface (BCI) technology enables direct communication between the brain and external devices through electroencephalography (EEG) signals. However, existing decoding models often mix common and personalized components, leading to interference from individual variability that limits cross-subject decoding performance. To address this issue, this paper proposes a system filter that extends the concept of signal filtering to the system level. The method expands a system into its spectral representation, selectively removes unnecessary components, and reconstructs the system from the retained target components, thereby achieving explicit system-level decomposition and filtering. We further integrate the system filter into a Cross-Subject Decoding framework based on the System Filter (CSD-SF) and evaluate it on the four-class motor imagery (MI) task of the BCIC IV 2a dataset. Personalized models are transformed into relation spectrums, and statistical testing across subjects is used to remove personalized components. The remaining stable relations, representing common components across subjects, are then used to construct a common model for cross-subject decoding. Experimental results show an average improvement of 3.28% in decoding accuracy over baseline methods, demonstrating that the proposed system filter effectively isolates stable common components and enhances model robustness and generalizability in cross-subject EEG decoding.
comment: 12 pages, 11 figures
♻ ☆ DINO in the Room: Leveraging 2D Foundation Models for 3D Segmentation 3DV 2026
Vision foundation models (VFMs) trained on large-scale image datasets provide high-quality features that have significantly advanced 2D visual recognition. However, their potential in 3D scene segmentation remains largely untapped, despite the common availability of 2D images alongside 3D point cloud datasets. While significant research has been dedicated to 2D-3D fusion, recent state-of-the-art 3D methods predominantly focus on 3D data, leaving the integration of VFMs into 3D models underexplored. In this work, we challenge this trend by introducing DITR, a generally applicable approach that extracts 2D foundation model features, projects them to 3D, and finally injects them into a 3D point cloud segmentation model. DITR achieves state-of-the-art results on both indoor and outdoor 3D semantic segmentation benchmarks. To enable the use of VFMs even when images are unavailable during inference, we additionally propose to pretrain 3D models by distilling 2D foundation models. By initializing the 3D backbone with knowledge distilled from 2D VFMs, we create a strong basis for downstream 3D segmentation tasks, ultimately boosting performance across various datasets.
comment: Accepted to 3DV 2026. Project page at https://vision.rwth-aachen.de/ditr
♻ ☆ MagicFace: High-Fidelity Facial Expression Editing with Action-Unit Control
We address the problem of facial expression editing by controling the relative variation of facial action-unit (AU) from the same person. This enables us to edit this specific person's expression in a fine-grained, continuous and interpretable manner, while preserving their identity, pose, background and detailed facial attributes. Key to our model, which we dub MagicFace, is a diffusion model conditioned on AU variations and an ID encoder to preserve facial details of high consistency. Specifically, to preserve the facial details with the input identity, we leverage the power of pretrained Stable-Diffusion models and design an ID encoder to merge appearance features through self-attention. To keep background and pose consistency, we introduce an efficient Attribute Controller by explicitly informing the model of current background and pose of the target. By injecting AU variations into a denoising UNet, our model can animate arbitrary identities with various AU combinations, yielding superior results in high-fidelity expression editing compared to other facial expression editing works. Code is publicly available at https://github.com/weimengting/MagicFace.
♻ ☆ Conan: Progressive Learning to Reason Like a Detective over Multi-Scale Visual Evidence
Video reasoning, which requires multi-step deduction across frames, remains a major challenge for multimodal large language models (MLLMs). While reinforcement learning (RL)-based methods enhance reasoning capabilities, they often rely on text-only chains that yield ungrounded or hallucinated conclusions. Conversely, frame-retrieval approaches introduce visual grounding, yet still struggle with inaccurate evidence localization. To address these limitations, we present Conan, a framework for evidence-grounded multi-step video reasoning. Conan identifies context and evidence frames, reasons over cross-frame clues, and adaptively decides when to conclude or explore further. To achieve this, we 1) construct Conan-91K, a large-scale dataset of automatically generated reasoning traces that include frame identification, evidence reasoning, and action decision, and 2) design a multi-stage progressive cold-start strategy combined with an Identification-Reasoning-Action (AIR) RLVR training framework to progressively incentivize multi-step visual reasoning. Extensive experiments on six multi-step reasoning benchmarks demonstrate that Conan surpasses the baseline Qwen2.5-VL-7B-Instruct by an average of over 10% in accuracy, achieving state-of-the-art performance. Furthermore, Conan generalizes effectively to long video understanding tasks, validating its strong scalability and robustness.
♻ ☆ VisPlay: Self-Evolving Vision-Language Models from Images
Reinforcement learning (RL) provides a principled framework for improving Vision-Language Models (VLMs) on complex reasoning tasks. However, existing RL approaches often rely on human-annotated labels or task-specific heuristics to define verifiable rewards, both of which are costly and difficult to scale. We introduce VisPlay, a self-evolving RL framework that enables VLMs to autonomously improve their reasoning abilities using large amounts of unlabeled image data. Starting from a single base VLM, VisPlay assigns the model into two interacting roles: an Image-Conditioned Questioner that formulates challenging yet answerable visual questions, and a Multimodal Reasoner that generates silver responses. These roles are jointly trained with Group Relative Policy Optimization (GRPO), which incorporates diversity and difficulty rewards to balance the complexity of generated questions with the quality of the silver answers. VisPlay scales efficiently across two model families. When trained on Qwen2.5-VL and MiMo-VL, VisPlay achieves consistent improvements in visual reasoning, compositional generalization, and hallucination reduction across eight benchmarks, including MM-Vet and MMMU, demonstrating a scalable path toward self-evolving multimodal intelligence. The project page is available at https://bruno686.github.io/VisPlay/
♻ ☆ RoMa v2: Harder Better Faster Denser Feature Matching
Dense feature matching aims to estimate all correspondences between two images of a 3D scene and has recently been established as the gold-standard due to its high accuracy and robustness. However, existing dense matchers still fail or perform poorly for many hard real-world scenarios, and high-precision models are often slow, limiting their applicability. In this paper, we attack these weaknesses on a wide front through a series of systematic improvements that together yield a significantly better model. In particular, we construct a novel matching architecture and loss, which, combined with a curated diverse training distribution, enables our model to solve many complex matching tasks. We further make training faster through a decoupled two-stage matching-then-refinement pipeline, and at the same time, significantly reduce refinement memory usage through a custom CUDA kernel. Finally, we leverage the recent DINOv3 foundation model along with multiple other insights to make the model more robust and unbiased. In our extensive set of experiments we show that the resulting novel matcher sets a new state-of-the-art, being significantly more accurate than its predecessors. Code is available at https://github.com/Parskatt/romav2
comment: Added acknowledgements, and some minor fixes
♻ ☆ Label-Efficient Cross-Modality Generalization for Liver Segmentation in Multi-Phase MRI MICCAI 2025
Accurate liver segmentation in multi-phase MRI is vital for liver fibrosis assessment, yet labeled data is often scarce and unevenly distributed across imaging modalities and vendor systems. We propose a label-efficient segmentation approach that promotes cross-modality generalization under real-world conditions, where GED4 hepatobiliary-phase annotations are limited, non-contrast sequences (T1WI, T2WI, DWI) are unlabeled, and spatial misalignment and missing phases are common. Our method integrates a foundation-scale 3D segmentation backbone adapted via fine-tuning, co-training with cross pseudo supervision to leverage unlabeled volumes, and a standardized preprocessing pipeline. Without requiring spatial registration, the model learns to generalize across MRI phases and vendors, demonstrating robust segmentation performance in both labeled and unlabeled domains. Our results exhibit the effectiveness of our proposed label-efficient baseline for liver segmentation in multi-phase, multi-vendor MRI and highlight the potential of combining foundation model adaptation with co-training for real-world clinical imaging tasks.
comment: Accepted at MICCAI 2025 Workshop
♻ ☆ Structural-Spectral Graph Convolution with Evidential Edge Learning for Hyperspectral Image Clustering
Hyperspectral image (HSI) clustering groups pixels into clusters without labeled data, which is an important yet challenging task. For large-scale HSIs, most methods rely on superpixel segmentation and perform superpixel-level clustering based on graph neural networks (GNNs). However, existing GNNs cannot fully exploit the spectral information of the input HSI, and the inaccurate superpixel topological graph may lead to the confusion of different class semantics during information aggregation. To address these challenges, we first propose a structural-spectral graph convolutional operator (SSGCO) tailored for graph-structured HSI superpixels to improve their representation quality through the co-extraction of spatial and spectral features. Second, we propose an evidence-guided adaptive edge learning (EGAEL) module that adaptively predicts and refines edge weights in the superpixel topological graph. We integrate the proposed method into a contrastive learning framework to achieve clustering, where representation learning and clustering are simultaneously conducted. Experiments demonstrate that the proposed method improves clustering accuracy by 2.61%, 6.06%, 4.96% and 3.15% over the best compared methods on four HSI datasets. Our code is available at https://github.com/jhqi/SSGCO-EGAEL.
♻ ☆ From Play to Replay: Composed Video Retrieval for Temporally Fine-Grained Videos
Composed Video Retrieval (CoVR) retrieves a target video given a query video and a modification text describing the intended change. Existing CoVR benchmarks emphasize appearance shifts or coarse event changes and therefore do not test the ability to capture subtle, fast-paced temporal differences. We introduce TF-CoVR, the first large-scale benchmark dedicated to temporally fine-grained CoVR. TF-CoVR focuses on gymnastics and diving, and provides 180K triplets drawn from FineGym and FineDiving datasets. Previous CoVR benchmarks, focusing on temporal aspect, link each query to a single target segment taken from the same video, limiting practical usefulness. In TF-CoVR, we instead construct each pair by prompting an LLM with the label differences between clips drawn from different videos; every pair is thus associated with multiple valid target videos (3.9 on average), reflecting real-world tasks such as sports-highlight generation. To model these temporal dynamics, we propose TF-CoVR-Base, a concise two-stage training framework: (i) pre-train a video encoder on fine-grained action classification to obtain temporally discriminative embeddings; (ii) align the composed query with candidate videos using contrastive learning. We conduct the first comprehensive study of image, video, and general multimodal embedding (GME) models on temporally fine-grained composed retrieval in both zero-shot and fine-tuning regimes. On TF-CoVR, TF-CoVR-Base improves zero-shot mAP@50 from 5.92 (LanguageBind) to 7.51, and after fine-tuning raises the state-of-the-art from 19.83 to 27.22.
♻ ☆ Learning from Dense Events: Towards Fast Spiking Neural Networks Training via Event Dataset Distillation
Event cameras sense brightness changes and output binary asynchronous event streams, attracting increasing attention. Their bio-inspired dynamics align well with spiking neural networks (SNNs), offering a promising energy-efficient alternative to conventional vision systems. However, SNNs remain costly to train due to temporal coding, which limits their practical deployment. To alleviate the high training cost of SNNs, we introduce \textbf{PACE} (Phase-Aligned Condensation for Events), the first dataset distillation framework to SNNs and event-based vision. PACE distills a large training dataset into a compact synthetic one that enables fast SNN training, which is achieved by two core modules: \textbf{ST-DSM} and \textbf{PEQ-N}. ST-DSM uses residual membrane potentials to densify spike-based features (SDR) and to perform fine-grained spatiotemporal matching of amplitude and phase (ST-SM), while PEQ-N provides a plug-and-play straight through probabilistic integer quantizer compatible with standard event-frame pipelines. Across DVS-Gesture, CIFAR10-DVS, and N-MNIST datasets, PACE outperforms existing coreset selection and dataset distillation baselines, with particularly strong gains on dynamic event streams and at low or moderate IPC. Specifically, on N-MNIST, it achieves \(84.4\%\) accuracy, about \(85\%\) of the full training set performance, while reducing training time by more than \(50\times\) and storage cost by \(6000\times\), yielding compact surrogates that enable minute-scale SNN training and efficient edge deployment.
♻ ☆ Human Motion Unlearning
We introduce the task of human motion unlearning to prevent the synthesis of toxic animations while preserving the general text-to-motion generative performance. Unlearning toxic motions is challenging as those can be generated from explicit text prompts and from implicit toxic combinations of safe motions (e.g., "kicking" is "loading and swinging a leg"). We propose the first motion unlearning benchmark by filtering toxic motions from the large and recent text-to-motion datasets of HumanML3D and Motion-X. We propose baselines, by adapting state-of-the-art image unlearning techniques to process spatio-temporal signals. Finally, we propose a novel motion unlearning model based on Latent Code Replacement, which we dub LCR. LCR is training-free and suitable to the discrete latent spaces of state-of-the-art text-to-motion diffusion models. LCR is simple and consistently outperforms baselines qualitatively and quantitatively. Project page: https://www.pinlab.org/hmu.
♻ ☆ Introducing DEFORMISE: A deep learning framework for dementia diagnosis in the elderly using optimized MRI slice selection
Dementia, a debilitating neurological condition affecting millions worldwide, presents significant diagnostic challenges. In this work, we introduce DEFORMISE, a novel DEep learning Framework for dementia diagnOsis of eldeRly patients using 3D brain Magnetic resonance Imaging (MRI) scans with Optimized Slice sElection. Our approach features a unique technique for selectively processing MRI slices, focusing on the most relevant brain regions and excluding less informative sections. This methodology is complemented by a confidence-based classification committee composed of three novel deep learning models. Tested on the Open OASIS datasets, our method achieved an impressive accuracy of 94.12%, surpassing existing methodologies. Furthermore, validation on the ADNI dataset confirmed the robustness and generalizability of our approach. The use of explainable AI (XAI) techniques and comprehensive ablation studies further substantiate the effectiveness of our techniques, providing insights into the decision-making process and the importance of our methodology. This research offers a significant advancement in dementia diagnosis, providing a highly accurate and efficient tool for clinical applications.
♻ ☆ Enhancing efficiency in paediatric brain tumour segmentation using a pathologically diverse single-center clinical dataset
Background Brain tumours are the most common solid malignancies in children, encompassing diverse histological, molecular subtypes and imaging features and outcomes. Paediatric brain tumours (PBTs), including high- and low-grade gliomas (HGG, LGG), medulloblastomas (MB), ependymomas, and rarer forms, pose diagnostic and therapeutic challenges. Deep learning (DL)-based segmentation offers promising tools for tumour delineation, yet its performance across heterogeneous PBT subtypes and MRI protocols remains uncertain. Methods A retrospective single-centre cohort of 174 paediatric patients with HGG, LGG, medulloblastomas (MB), ependymomas, and other rarer subtypes was used. MRI sequences included T1, T1 post-contrast (T1-C), T2, and FLAIR. Manual annotations were provided for four tumour subregions: whole tumour (WT), T2-hyperintensity (T2H), enhancing tumour (ET), and cystic component (CC). A 3D nnU-Net model was trained and tested (121/53 split), with segmentation performance assessed using the Dice similarity coefficient (DSC) and compared against intra- and inter-rater variability. Results The model achieved robust performance for WT and T2H (mean DSC: 0.85), comparable to human annotator variability (mean DSC: 0.86). ET segmentation was moderately accurate (mean DSC: 0.75), while CC performance was poor. Segmentation accuracy varied by tumour type, MRI sequence combination, and location. Notably, T1, T1-C, and T2 alone produced results nearly equivalent to the full protocol. Conclusions DL is feasible for PBTs, particularly for T2H and WT. Challenges remain for ET and CC segmentation, highlighting the need for further refinement. These findings support the potential for protocol simplification and automation to enhance volumetric assessment and streamline paediatric neuro-oncology workflows.
comment: A. Jakab and F. Kofler have shared last authorship
♻ ☆ LSAP: Rethinking Inversion Fidelity, Perception and Editability in GAN Latent Space
As research on image inversion advances, the process is generally divided into two stages. The first step is Image Embedding, involves using an encoder or optimization procedure to embed an image and obtain its corresponding latent code. The second stage, referred to as Result Refinement, further improves the inversion and editing outcomes. Although this refinement stage substantially enhances reconstruction fidelity, perception and editability remain largely unchanged and are highly dependent on the latent codes derived from the first stage. Therefore, a key challenge lies in obtaining latent codes that preserve reconstruction fidelity while simultaneously improving perception and editability. In this work, we first reveal that these two properties are closely related to the degree of alignment (or disalignment) between the inverted latent codes and the synthetic distribution. Based on this insight, we propose the \textbf{ Latent Space Alignment Inversion Paradigm (LSAP)}, which integrates both an evaluation metric and a unified inversion solution. Specifically, we introduce the \textbf{Normalized Style Space ($\mathcal{S^N}$ space)} and \textbf{Normalized Style Space Cosine Distance (NSCD)} to quantify the disalignment of inversion methods. Moreover, our paradigm can be optimized for both encoder-based and optimization-based embeddings, providing a consistent alignment framework. Extensive experiments across various domains demonstrate that NSCD effectively captures perceptual and editable characteristics, and that our alignment paradigm achieves state-of-the-art performance in both stages of inversion.
comment: under review
♻ ☆ Otter: Mitigating Background Distractions of Wide-Angle Few-Shot Action Recognition with Enhanced RWKV AAAI 2026
Wide-angle videos in few-shot action recognition (FSAR) effectively express actions within specific scenarios. However, without a global understanding of both subjects and background, recognizing actions in such samples remains challenging because of the background distractions. Receptance Weighted Key Value (RWKV), which learns interaction between various dimensions, shows promise for global modeling. While directly applying RWKV to wide-angle FSAR may fail to highlight subjects due to excessive background information. Additionally, temporal relation degraded by frames with similar backgrounds is difficult to reconstruct, further impacting performance. Therefore, we design the CompOund SegmenTation and Temporal REconstructing RWKV (Otter). Specifically, the Compound Segmentation Module~(CSM) is devised to segment and emphasize key patches in each frame, effectively highlighting subjects against background information. The Temporal Reconstruction Module (TRM) is incorporated into the temporal-enhanced prototype construction to enable bidirectional scanning, allowing better reconstruct temporal relation. Furthermore, a regular prototype is combined with the temporal-enhanced prototype to simultaneously enhance subject emphasis and temporal modeling, improving wide-angle FSAR performance. Extensive experiments on benchmarks such as SSv2, Kinetics, UCF101, and HMDB51 demonstrate that Otter achieves state-of-the-art performance. Extra evaluation on the VideoBadminton dataset further validates the superiority of Otter in wide-angle FSAR.
comment: Accepted by AAAI 2026 Oral
♻ ☆ Linear time small coresets for k-mean clustering of segments with applications
We study the $k$-means problem for a set $\mathcal{S} \subseteq \mathbb{R}^d$ of $n$ segments, aiming to find $k$ centers $X \subseteq \mathbb{R}^d$ that minimize $D(\mathcal{S},X) := \sum_{S \in \mathcal{S}} \min_{x \in X} D(S,x)$, where $D(S,x) := \int_{p \in S} |p - x| dp$ measures the total distance from each point along a segment to a center. Variants of this problem include handling outliers, employing alternative distance functions such as M-estimators, weighting distances to achieve balanced clustering, or enforcing unique cluster assignments. For any $\varepsilon > 0$, an $\varepsilon$-coreset is a weighted subset $C \subseteq \mathbb{R}^d$ that approximates $D(\mathcal{S},X)$ within a factor of $1 \pm \varepsilon$ for any set of $k$ centers, enabling efficient streaming, distributed, or parallel computation. We propose the first coreset construction that provably handles arbitrary input segments. For constant $k$ and $\varepsilon$, it produces a coreset of size $O(\log^2 n)$ computable in $O(nd)$ time. Experiments, including a real-time video tracking application, demonstrate substantial speedups with minimal loss in clustering accuracy, confirming both the practical efficiency and theoretical guarantees of our method.
comment: First published in WALCOM 2026 by Springer Nature
♻ ☆ IOR: Inversed Objects Replay for Incremental Object Detection
Existing Incremental Object Detection (IOD) methods partially alleviate catastrophic forgetting when incrementally detecting new objects in real-world scenarios. However, many of these methods rely on the assumption that unlabeled old-class objects may co-occur with labeled new-class objects in the incremental data. When unlabeled old-class objects are absent, the performance of existing methods tends to degrade. The absence can be mitigated by generating old-class samples, but it incurs high costs. This paper argues that previous generation-based IOD suffers from redundancy, both in the use of generative models, which require additional training and storage, and in the overproduction of generated samples, many of which do not contribute significantly to performance improvements. To eliminate the redundancy, we propose Inversed Objects Replay (IOR). Specifically, we generate old-class samples by inversing the original detectors, thus eliminating the necessity of training and storing additional generative models. We propose augmented replay to reuse the objects in generated samples, reducing redundant generations. Moreover, we propose high-value knowledge distillation focusing on the positions of old-class objects overwhelmed by the background, which transfers the knowledge to the incremental detector. Extensive experiments conducted on MS COCO 2017 demonstrate that our method can efficiently improve detection performance in IOD scenarios with the absence of old-class objects. The code is available at https://github.com/JiaJia075/IOR.
♻ ☆ Towards Metric-Aware Multi-Person Mesh Recovery by Jointly Optimizing Human Crowd in Camera Space
Multi-person human mesh recovery from a single image is a challenging task, hindered by the scarcity of in-the-wild training data. Prevailing in-the-wild human mesh pseudo-ground-truth (pGT) generation pipelines are single-person-centric, where each human is processed individually without joint optimization. This oversight leads to a lack of scene-level consistency, producing individuals with conflicting depths and scales within the same image. To address this, we introduce Depth-conditioned Translation Optimization (DTO), a novel optimization-based method that jointly refines the camera-space translations of all individuals in a crowd. By leveraging anthropometric priors on human height and depth cues from a monocular depth estimator, DTO solves for a scene-consistent placement of all subjects within a principled Maximum a posteriori (MAP) framework. Applying DTO to the 4D-Humans dataset, we construct DTO-Humans, a new large-scale pGT dataset of 0.56M high-quality, scene-consistent multi-person images, featuring dense crowds with an average of 4.8 persons per image. Furthermore, we propose Metric-Aware HMR, an end-to-end network that directly estimates human mesh and camera parameters in metric scale. This is enabled by a camera branch and a relative metric loss that enforces plausible relative scales. Extensive experiments demonstrate that our method achieves state-of-the-art performance on relative depth reasoning and human mesh recovery. Code is available at: https://github.com/gouba2333/MA-HMR.
♻ ☆ Fusion of Multi-scale Heterogeneous Pathology Foundation Models for Whole Slide Image Analysis
Whole slide image (WSI) analysis has emerged as an increasingly essential technique in computational pathology. Recent advances in the pathology foundation models (FMs) have demonstrated significant advantages in deriving meaningful patch-level or slide-level multi-scale features from WSIs. However, current pathology FMs have exhibited substantial heterogeneity caused by diverse private training datasets and different network architectures. This heterogeneity introduces performance variability when we utilize the features from different FMs in the downstream tasks. To fully explore the advantages of multiple FMs effectively, in this work, we propose a novel framework for the fusion of multi-scale heterogeneous pathology FMs, called FuseCPath, yielding a model with a superior ensemble performance. The main contributions of our framework can be summarized as follows: (i) To guarantee the representativeness of the training patches, we propose a multi-view clustering-based method to filter out the discriminative patches via multiple FMs' embeddings. (ii) To effectively fuse the patch-level FMs, we devise a cluster-level re-embedding strategy to online capture patch-level local features. (iii) To effectively fuse the slide-level FMs, we devise a collaborative distillation strategy to explore the connections between slide-level FMs. Extensive experiments demonstrate that the proposed FuseCPath achieves state-of-the-art performance across multiple tasks on diverse datasets.
comment: 22 pages, 9 figures
♻ ☆ UINO-FSS: Unifying Representation Learning and Few-shot Segmentation via Hierarchical Distillation and Mamba-HyperCorrelation
Few-shot semantic segmentation has attracted growing interest for its ability to generalize to novel object categories using only a few annotated samples. To address data scarcity, recent methods incorporate multiple foundation models to improve feature transferability and segmentation performance. However, they often rely on dual-branch architectures that combine pre-trained encoders to leverage complementary strengths, a design that limits flexibility and efficiency. This raises a fundamental question: can we build a unified model that integrates knowledge from different foundation architectures? Achieving this is, however, challenging due to the misalignment between class-agnostic segmentation capabilities and fine-grained discriminative representations. To this end, we present UINO-FSS, a novel framework built on the key observation that early-stage DINOv2 features exhibit distribution consistency with SAM's output embeddings. This consistency enables the integration of both models' knowledge into a single-encoder architecture via coarse-to-fine multimodal distillation. In particular, our segmenter consists of three core components: a bottleneck adapter for embedding alignment, a meta-visual prompt generator that leverages dense similarity volumes and semantic embeddings, and a mask decoder. Using hierarchical cross-model distillation, we effectively transfer SAM's knowledge into the segmenter, further enhanced by Mamba-based 4D correlation mining on support-query pairs. Extensive experiments on PASCAL-5$^i$ and COCO-20$^i$ show that UINO-FSS achieves new state-of-the-art results under the 1-shot setting, with mIoU of 80.6 (+3.8%) on PASCAL-5$^i$ and 64.5 (+4.1%) on COCO-20$^i$, demonstrating the effectiveness of our unified approach.
♻ ☆ FLUX-Text: A Simple and Advanced Diffusion Transformer Baseline for Scene Text Editing
Scene text editing aims to modify or add texts on images while ensuring text fidelity and overall visual quality consistent with the background. Recent methods are primarily built on UNet-based diffusion models, which have improved scene text editing results, but still struggle with complex glyph structures, especially for non-Latin ones (\eg, Chinese, Korean, Japanese). To address these issues, we present \textbf{FLUX-Text}, a simple and advanced multilingual scene text editing DiT method. Specifically, our FLUX-Text enhances glyph understanding and generation through lightweight Visual and Text Embedding Modules, while preserving the original generative capability of FLUX. We further propose a Regional Text Perceptual Loss tailored for text regions, along with a matching two-stage training strategy to better balance text editing and overall image quality. Benefiting from the DiT-based architecture and lightweight feature injection modules, FLUX-Text can be trained with only $0.1$M training examples, a \textbf{97\%} reduction compared to $2.9$M required by popular methods. Extensive experiments on multiple public datasets, including English and Chinese benchmarks, demonstrate that our method surpasses other methods in visual quality and text fidelity. All the code is available at https://github.com/AMAP-ML/FluxText.
comment: 10 pages, 5 figures
♻ ☆ DuetMatch: Harmonizing Semi-Supervised Brain MRI Segmentation via Decoupled Branch Optimization
The limited availability of annotated data in medical imaging makes semi-supervised learning increasingly appealing for its ability to learn from imperfect supervision. Recently, teacher-student frameworks have gained popularity for their training benefits and robust performance. However, jointly optimizing the entire network can hinder convergence and stability, especially in challenging scenarios. To address this for medical image segmentation, we propose DuetMatch, a novel dual-branch semi-supervised framework with asynchronous optimization, where each branch optimizes either the encoder or decoder while keeping the other frozen. To improve consistency under noisy conditions, we introduce Decoupled Dropout Perturbation, enforcing regularization across branches. We also design Pair-wise CutMix Cross-Guidance to enhance model diversity by exchanging pseudo-labels through augmented input pairs. To mitigate confirmation bias from noisy pseudo-labels, we propose Consistency Matching, refining labels using stable predictions from frozen teacher models. Extensive experiments on benchmark brain MRI segmentation datasets, including ISLES2022 and BraTS, show that DuetMatch consistently outperforms state-of-the-art methods, demonstrating its effectiveness and robustness across diverse semi-supervised segmentation scenarios.
comment: Published in Computerized Medical Imaging and Graphics (CMIG)
♻ ☆ Medverse: A Universal Model for Full-Resolution 3D Medical Image Segmentation, Transformation and Enhancement
In-context learning (ICL) offers a promising paradigm for universal medical image analysis, enabling models to perform diverse image processing tasks without retraining. However, current ICL models for medical imaging remain limited in two critical aspects: they cannot simultaneously achieve high-fidelity predictions and global anatomical understanding, and there is no unified model trained across diverse medical imaging tasks (e.g., segmentation and enhancement) and anatomical regions. As a result, the full potential of ICL in medical imaging remains underexplored. Thus, we present \textbf{Medverse}, a universal ICL model for 3D medical imaging, trained on 22 datasets covering diverse tasks in universal image segmentation, transformation, and enhancement across multiple organs, imaging modalities, and clinical centers. Medverse employs a next-scale autoregressive in-context learning framework that progressively refines predictions from coarse to fine, generating consistent, full-resolution volumetric outputs and enabling multi-scale anatomical awareness. We further propose a blockwise cross-attention module that facilitates long-range interactions between context and target inputs while preserving computational efficiency through spatial sparsity. Medverse is extensively evaluated on a broad collection of held-out datasets covering previously unseen clinical centers, organs, species, and imaging modalities. Results demonstrate that Medverse substantially outperforms existing ICL baselines and establishes a novel paradigm for in-context learning. Code and model weights will be made publicly available. Our model are publicly available at https://github.com/jiesihu/Medverse.
♻ ☆ Rep-GLS: Report-Guided Generalized Label Smoothing for Robust Disease Detection
Unlike nature image classification where groundtruth label is explicit and of no doubt, physicians commonly interpret medical image conditioned on certainty like using phrase "probable" or "likely". Existing medical image datasets either simply overlooked the nuance and polarise into binary label. Here, we propose a novel framework that leverages a Large Language Model (LLM) to directly mine medical reports to utilise the uncertainty relevant expression for supervision signal. At first, we collect uncertainty keywords from medical reports. Then, we use Qwen-3 4B to identify the textual uncertainty and map them into an adaptive Generalized Label Smoothing (GLS) rate. This rate allows our model to treat uncertain labels not as errors, but as informative signals, effectively incorporating expert skepticism into the training process. We establish a new clinical expert uncertainty-aware benchmark to rigorously evaluate this problem. Experiments demonstrate that our approach significantly outperforms state-of-the-art methods in medical disease detection. The curated uncertainty words database, code, and benchmark will be made publicly available upon acceptance.
♻ ☆ Event Stream Filtering via Probability Flux Estimation
Event cameras asynchronously capture brightness changes with microsecond latency, offering exceptional temporal precision but suffering from severe noise and signal inconsistencies. Unlike conventional signals, events carry state information through polarities and process information through inter-event time intervals. However, existing event filters often ignore the latter, producing outputs that are sparser than the raw input and limiting the reconstruction of continuous irradiance dynamics. We propose the Event Density Flow Filter (EDFilter), a framework that models event generation as threshold-crossing probability fluxes arising from the stochastic diffusion of irradiance trajectories. EDFilter performs nonparametric, kernel-based estimation of probability flux and reconstructs the continuous event density flow using an O(1) recursive solver, enabling real-time processing. The Rotary Event Dataset (RED), featuring microsecond-resolution ground-truth irradiance flow under controlled illumination is also presented for event quality evaluation. Experiments demonstrate that EDFilter achieves high-fidelity, physically interpretable event denoising and motion reconstruction.
♻ ☆ TubeRMC: Tube-conditioned Reconstruction with Mutual Constraints for Weakly-supervised Spatio-Temporal Video Grounding AAAI 2026
Spatio-Temporal Video Grounding (STVG) aims to localize a spatio-temporal tube that corresponds to a given language query in an untrimmed video. This is a challenging task since it involves complex vision-language understanding and spatiotemporal reasoning. Recent works have explored weakly-supervised setting in STVG to eliminate reliance on fine-grained annotations like bounding boxes or temporal stamps. However, they typically follow a simple late-fusion manner, which generates tubes independent of the text description, often resulting in failed target identification and inconsistent target tracking. To address this limitation, we propose a Tube-conditioned Reconstruction with Mutual Constraints (\textbf{TubeRMC}) framework that generates text-conditioned candidate tubes with pre-trained visual grounding models and further refine them via tube-conditioned reconstruction with spatio-temporal constraints. Specifically, we design three reconstruction strategies from temporal, spatial, and spatio-temporal perspectives to comprehensively capture rich tube-text correspondences. Each strategy is equipped with a Tube-conditioned Reconstructor, utilizing spatio-temporal tubes as condition to reconstruct the key clues in the query. We further introduce mutual constraints between spatial and temporal proposals to enhance their quality for reconstruction. TubeRMC outperforms existing methods on two public benchmarks VidSTG and HCSTVG. Further visualization shows that TubeRMC effectively mitigates both target identification errors and inconsistent tracking.
comment: Accepted to AAAI 2026
♻ ☆ One Model for All: Unified Try-On and Try-Off in Any Pose via LLM-Inspired Bidirectional Tweedie Diffusion
Recent diffusion-based approaches have made significant advances in image-based virtual try-on, enabling more realistic and end-to-end garment synthesis. However, most existing methods remain constrained by their reliance on exhibition garments and segmentation masks, as well as their limited ability to handle flexible pose variations. These limitations reduce their practicality in real-world scenarios - for instance, users cannot easily transfer garments worn by one person onto another, and the generated try-on results are typically restricted to the same pose as the reference image. In this paper, we introduce OMFA (One Model For All), a unified diffusion framework for both virtual try-on and try-off that operates without the need for exhibition garments and supports arbitrary poses. OMFA is inspired by language modeling, where generation is guided by conditioning prompts. However, our framework differs fundamentally from LLMs in two key aspects. First, it employs a bidirectional modeling paradigm that symmetrically allows prompting either from the garment to generate try-on results or from the dressed person to recover the try-off garment. Second, it strictly adheres to Tweedie's formula, enabling faithful estimation of the underlying data distribution during the denoising process. Instead of imposing lower body constraints, OMFA is an entirely mask-free framework that requires only a single portrait and a target garment as input, and is designed to support flexible outfit combinations and cross-person garment transfer, making it better aligned with practical usage scenarios. Additionally, by leveraging SMPL-X-based pose conditioning, OMFA supports multi-view and arbitrary-pose try-on from just one image. Extensive experiments demonstrate that OMFA achieves state-of-the-art results on both try-on and try-off tasks, providing a practical solution for virtual garment synthesis.
♻ ☆ CompTrack: Information Bottleneck-Guided Low-Rank Dynamic Token Compression for Point Cloud Tracking AAAI 2026
3D single object tracking (SOT) in LiDAR point clouds is a critical task in computer vision and autonomous driving. Despite great success having been achieved, the inherent sparsity of point clouds introduces a dual-redundancy challenge that limits existing trackers: (1) vast spatial redundancy from background noise impairs accuracy, and (2) informational redundancy within the foreground hinders efficiency. To tackle these issues, we propose CompTrack, a novel end-to-end framework that systematically eliminates both forms of redundancy in point clouds. First, CompTrack incorporates a Spatial Foreground Predictor (SFP) module to filter out irrelevant background noise based on information entropy, addressing spatial redundancy. Subsequently, its core is an Information Bottleneck-guided Dynamic Token Compression (IB-DTC) module that eliminates the informational redundancy within the foreground. Theoretically grounded in low-rank approximation, this module leverages an online SVD analysis to adaptively compress the redundant foreground into a compact and highly informative set of proxy tokens. Extensive experiments on KITTI, nuScenes and Waymo datasets demonstrate that CompTrack achieves top-performing tracking performance with superior efficiency, running at a real-time 90 FPS on a single RTX 3090 GPU.
comment: Accepted by AAAI 2026 (Oral)
♻ ☆ Spatial-and-Frequency-aware Restoration method for Images based on Diffusion Models
Diffusion models have recently emerged as a promising framework for Image Restoration (IR), owing to their ability to produce high-quality reconstructions and their compatibility with established methods. Existing methods for solving noisy inverse problems in IR, considers the pixel-wise data-fidelity. In this paper, we propose SaFaRI, a spatial-and-frequency-aware diffusion model for IR with Gaussian noise. Our model encourages images to preserve data-fidelity in both the spatial and frequency domains, resulting in enhanced reconstruction quality. We comprehensively evaluate the performance of our model on a variety of noisy inverse problems, including inpainting, denoising, and super-resolution. Our thorough evaluation demonstrates that SaFaRI achieves state-of-the-art performance on both the ImageNet datasets and FFHQ datasets, outperforming existing zero-shot IR methods in terms of LPIPS and FID metrics.
Computers and Society
☆ On the modular platoon-based vehicle-to-vehicle electric charging problem
We formulate a mixed integer linear program (MILP) for a platoon-based vehicle-to-vehicle charging (PV2VC) technology designed for modular vehicles (MVs) and solve it with a genetic algorithm (GA). A set of numerical experiments with five scenarios are tested and the computational performance between the commercial software applied to the MILP model and the proposed GA are compared on a modified Sioux Falls network. By comparison with the optimal benchmark scenario, the results show that the PV2VC technology can save up to 11.07% in energy consumption, 11.65% in travel time, and 11.26% in total cost. For the PV2VC operational scenario, it would be more beneficial for long-distance vehicle routes with low initial state of charge, sparse charging facilities, and where travel time is perceived to be higher than energy consumption costs.
☆ Inclusive education via empathy propagation in schools of students with special education needs
This study presents a theoretical model for identifying emergent scenarios of inclusiveness related to student with special education needs (SEN). Based on variations of the Shelling model of segregation, we explored the propagation of thinking about others as equals (empathy) in students with and without $SEN$ in school environments. We use the complex systems approach for modeling possible scenarios of inclusiveness in which patterns of empathy between students emerge instead of the well-known behavior of segregation. Based on simple transitional rules, which are evaluated by a set of null models, we show the emergence of empathy between students in school environments. Findings suggest that small variations in the incentive of students for being considered as $SEN$ generate the presence of inclusive patterns. In other situations, patterns of segregation are commonly presented.
comment: 17 pages, 6 figres, 2 tables
☆ The Shawshank Redemption of Embodied AI: Understanding and Benchmarking Indirect Environmental Jailbreaks
The adoption of Vision-Language Models (VLMs) in embodied AI agents, while being effective, brings safety concerns such as jailbreaking. Prior work have explored the possibility of directly jailbreaking the embodied agents through elaborated multi-modal prompts. However, no prior work has studied or even reported indirect jailbreaks in embodied AI, where a black-box attacker induces a jailbreak without issuing direct prompts to the embodied agent. In this paper, we propose, for the first time, indirect environmental jailbreak (IEJ), a novel attack to jailbreak embodied AI via indirect prompt injected into the environment, such as malicious instructions written on a wall. Our key insight is that embodied AI does not ''think twice'' about the instructions provided by the environment -- a blind trust that attackers can exploit to jailbreak the embodied agent. We further design and implement open-source prototypes of two fully-automated frameworks: SHAWSHANK, the first automatic attack generation framework for the proposed attack IEJ; and SHAWSHANK-FORGE, the first automatic benchmark generation framework for IEJ. Then, using SHAWSHANK-FORGE, we automatically construct SHAWSHANK-BENCH, the first benchmark for indirectly jailbreaking embodied agents. Together, our two frameworks and one benchmark answer the questions of what content can be used for malicious IEJ instructions, where they should be placed, and how IEJ can be systematically evaluated. Evaluation results show that SHAWSHANK outperforms eleven existing methods across 3,957 task-scene combinations and compromises all six tested VLMs. Furthermore, current defenses only partially mitigate our attack, and we have responsibly disclosed our findings to all affected VLM vendors.
☆ Incorporation of journalistic approaches into algorithm design
The growing adoption of algorithm-powered tools in journalism enables new possibilities and raises many concerns. One way of addressing these concerns is by integrating journalistic practices and values into the design of algorithms that facilitate different journalistic tasks, from automated content generation to news content distribution. In this chapter, we discuss how such integration can happen. To do this, we first introduce the concepts of algorithms and different perspectives on algorithm design and then scrutinize various journalistic viewpoints on the matter and methodological approaches for studying these perspectives and their translation into specific algorithm-powered journalistic tools. We conclude by discussing important directions for future research, ranging from contextualizing journalistic approaches to algorithm design to accounting for the transformative impacts of artificial intelligence (AI) technologies.
☆ An Agent-Based Simulation of Regularity-Driven Student Attrition: How Institutional Time-to-Live Constraints Create a Dropout Trap in Higher Education
High dropout rates in engineering programmes are conventionally attributed to student deficits: lack of academic preparation or motivation. However, this view neglects the causal role of "normative friction": the complex system of administrative rules, exam validity windows, and prerequisite chains that constrain student progression. This paper introduces "The Regularity Trap," a phenomenon where rigid assessment timelines decouple learning from accreditation. We operationalize the CAPIRE framework into a calibrated Agent-Based Model (ABM) simulating 1,343 student trajectories across a 42-course Civil Engineering curriculum. The model integrates empirical course parameters and thirteen psycho-academic archetypes derived from a 15-year longitudinal dataset. By formalizing the "Regularity Regime" as a decaying validity function, we isolate the effect of administrative time limits on attrition. Results reveal that 86.4% of observed dropouts are driven by normative mechanisms (expiry cascades) rather than purely academic failure (5.3%). While the overall dropout rate stabilized at 32.4%, vulnerability was highly heterogeneous: archetypes with myopic planning horizons faced attrition rates up to 49.0%, compared to 13.2% for strategic agents, despite comparable academic ability. These findings challenge the neutrality of administrative structures, suggesting that rigid validity windows act as an invisible filter that disproportionately penalizes students with lower self-regulatory capital.
☆ Future-Back Threat Modeling: A Foresight-Driven Security Framework
Traditional threat modeling remains reactive-focused on known TTPs and past incident data, while threat prediction and forecasting frameworks are often disconnected from operational or architectural artifacts. This creates a fundamental weakness: the most serious cyber threats often do not arise from what is known, but from what is assumed, overlooked, or not yet conceived, and frequently originate from the future, such as artificial intelligence, information warfare, and supply chain attacks, where adversaries continuously develop new exploits that can bypass defenses built on current knowledge. To address this mental gap, this paper introduces the theory and methodology of Future-Back Threat Modeling (FBTM). This predictive approach begins with envisioned future threat states and works backward to identify assumptions, gaps, blind spots, and vulnerabilities in the current defense architecture, providing a clearer and more accurate view of impending threats so that we can anticipate their emergence and shape the future we want through actions taken now. The proposed methodology further aims to reveal known unknowns and unknown unknowns, including tactics, techniques, and procedures that are emerging, anticipated, and plausible. This enhances the predictability of adversary behavior, particularly under future uncertainty, helping security leaders make informed decisions today that shape more resilient security postures for the future.
☆ Artificial Intelligence and Accounting Research: A Framework and Agenda
Recent advances in artificial intelligence, particularly generative AI (GenAI) and large language models (LLMs), are fundamentally transforming accounting research, creating both opportunities and competitive threats for scholars. This paper proposes a framework that classifies AI-accounting research along two dimensions: research focus (accounting-centric versus AI-centric) and methodological approach (AI-based versus traditional methods). We apply this framework to papers from the IJAIS special issue and recent AI-accounting research published in leading accounting journals to map existing studies and identify research opportunities. Using this same framework, we analyze how accounting researchers can leverage their expertise through strategic positioning and collaboration, revealing where accounting scholars' strengths create the most value. We further examine how GenAI and LLMs transform the research process itself, comparing the capabilities of human researchers and AI agents across the entire research workflow. This analysis reveals that while GenAI democratizes certain research capabilities, it simultaneously intensifies competition by raising expectations for higher-order contributions where human judgment, creativity, and theoretical depth remain valuable. These shifts call for reforming doctoral education to cultivate comparative advantages while building AI fluency.
comment: 48 pages, 7 tables
☆ Can Online GenAI Discussion Serve as Bellwether for Labor Market Shifts?
The rapid advancement of Large Language Models (LLMs) has generated considerable speculation regarding their transformative potential for labor markets. However, existing approaches to measuring AI exposure in the workforce predominantly rely on concurrent market conditions, offering limited predictive capacity for anticipating future disruptions. This paper presents a predictive study examining whether online discussions about LLMs can function as early indicators of labor market shifts. We employ four distinct analytical approaches to identify the domains and timeframes in which public discourse serves as a leading signal for employment changes, thereby demonstrating its predictive validity for labor market dynamics. Drawing on a comprehensive dataset that integrates the REALM corpus of LLM discussions, LinkedIn job postings, Indeed employment indices, and over 4 million LinkedIn user profiles, we analyze the relationship between discussion intensity across news media and Reddit forums and subsequent variations in job posting volumes, occupational net change ratios, job tenure patterns, unemployment duration, and transitions to GenAI-related roles across thirteen occupational categories. Our findings reveal that discussion intensity predicts employment changes 1-7 months in advance across multiple indicators, including job postings, net hiring rates, tenure patterns, and unemployment duration. These findings suggest that monitoring online discourse can provide actionable intelligence for workers making reskilling decisions and organizations anticipating skill requirements, offering a real-time complement to traditional labor statistics in navigating technological disruption.
☆ Digital Agriculture Sandbox for Collaborative Research
Digital agriculture is transforming the way we grow food by utilizing technology to make farming more efficient, sustainable, and productive. This modern approach to agriculture generates a wealth of valuable data that could help address global food challenges, but farmers are hesitant to share it due to privacy concerns. This limits the extent to which researchers can learn from this data to inform improvements in farming. This paper presents the Digital Agriculture Sandbox, a secure online platform that solves this problem. The platform enables farmers (with limited technical resources) and researchers to collaborate on analyzing farm data without exposing private information. We employ specialized techniques such as federated learning, differential privacy, and data analysis methods to safeguard the data while maintaining its utility for research purposes. The system enables farmers to identify similar farmers in a simplified manner without needing extensive technical knowledge or access to computational resources. Similarly, it enables researchers to learn from the data and build helpful tools without the sensitive information ever leaving the farmer's system. This creates a safe space where farmers feel comfortable sharing data, allowing researchers to make important discoveries. Our platform helps bridge the gap between maintaining farm data privacy and utilizing that data to address critical food and farming challenges worldwide.
comment: Presents a privacy-preserving digital agriculture platform using federated learning, differential privacy, and secure data analysis to enable collaboration between farmers and researchers without exposing raw data. Demonstrates secure similarity search, model training, and risk-aware data sharing
☆ Fairness in Multi-modal Medical Diagnosis with Demonstration Selection
Multimodal large language models (MLLMs) have shown strong potential for medical image reasoning, yet fairness across demographic groups remains a major concern. Existing debiasing methods often rely on large labeled datasets or fine-tuning, which are impractical for foundation-scale models. We explore In-Context Learning (ICL) as a lightweight, tuning-free alternative for improving fairness. Through systematic analysis, we find that conventional demonstration selection (DS) strategies fail to ensure fairness due to demographic imbalance in selected exemplars. To address this, we propose Fairness-Aware Demonstration Selection (FADS), which builds demographically balanced and semantically relevant demonstrations via clustering-based sampling. Experiments on multiple medical imaging benchmarks show that FADS consistently reduces gender-, race-, and ethnicity-related disparities while maintaining strong accuracy, offering an efficient and scalable path toward fair medical image reasoning. These results highlight the potential of fairness-aware in-context learning as a scalable and data-efficient solution for equitable medical image reasoning.
comment: 10 pages (including 2 pages of references), 4 figures. This work explores fairness in multi-modal medical image reasoning using in-context learning
♻ ☆ Automatically Detecting Online Deceptive Patterns
Deceptive patterns in digital interfaces manipulate users into making unintended decisions, exploiting cognitive biases and psychological vulnerabilities. These patterns have become ubiquitous on various digital platforms. While efforts to mitigate deceptive patterns have emerged from legal and technical perspectives, a significant gap remains in creating usable and scalable solutions. We introduce our AutoBot framework to address this gap and help web stakeholders navigate and mitigate online deceptive patterns. AutoBot accurately identifies and localizes deceptive patterns from a screenshot of a website without relying on the underlying HTML code. AutoBot employs a two-stage pipeline that leverages the capabilities of specialized vision models to analyze website screenshots, identify interactive elements, and extract textual features. Next, using a large language model, AutoBot understands the context surrounding these elements to determine the presence of deceptive patterns. We also use AutoBot, to create a synthetic dataset to distill knowledge from 'teacher' LLMs to smaller language models. Through extensive evaluation, we demonstrate AutoBot's effectiveness in detecting deceptive patterns on the web, achieving an F1-score of 0.93 when detecting deceptive patterns, underscoring its potential as an essential tool for mitigating online deceptive patterns. We implement AutoBot, across three downstream applications targeting different web stakeholders: (1) a local browser extension providing users with real-time feedback, (2) a Lighthouse audit to inform developers of potential deceptive patterns on their sites, and (3) as a measurement tool designed for researchers and regulators.
♻ ☆ Diagnosing Hallucination Risk in AI Surgical Decision-Support: A Sequential Framework for Sequential Validation
Large language models (LLMs) offer transformative potential for clinical decision support in spine surgery but pose significant risks through hallucinations, which are factually inconsistent or contextually misaligned outputs that may compromise patient safety. This study introduces a clinician-centered framework to quantify hallucination risks by evaluating diagnostic precision, recommendation quality, reasoning robustness, output coherence, and knowledge alignment. We assessed six leading LLMs across 30 expert-validated spinal cases. DeepSeek-R1 demonstrated superior overall performance (total score: 86.03 $\pm$ 2.08), particularly in high-stakes domains such as trauma and infection. A critical finding reveals that reasoning-enhanced model variants did not uniformly outperform standard counterparts: Claude-3.7-Sonnet's extended thinking mode underperformed relative to its standard version (80.79 $\pm$ 1.83 vs. 81.56 $\pm$ 1.92), indicating extended chain-of-thought reasoning alone is insufficient for clinical reliability. Multidimensional stress-testing exposed model-specific vulnerabilities, with recommendation quality degrading by 7.4% under amplified complexity. This decline contrasted with marginal improvements in rationality (+2.0%), readability (+1.7%) and diagnosis (+4.7%), highlighting a concerning divergence between perceived coherence and actionable guidance. Our findings advocate integrating interpretability mechanisms (e.g., reasoning chain visualization) into clinical workflows and establish a safety-aware validation framework for surgical LLM deployment.
♻ ☆ Who is using AI to code? Global diffusion and impact of generative AI
Generative coding tools promise big productivity gains, but uneven uptake could widen skill and income gaps. We train a neural classifier to spot AI-generated Python functions in over 30 million GitHub commits by 170,000 developers, tracking how fast -- and where -- these tools take hold. Today, AI writes an estimated 29% of Python functions in the US, a modest and shrinking lead over other countries. We estimate that quarterly output, measured in online code contributions, has increased by 3.6% because of this. Our evidence suggests that programmers using AI may also more readily expand into new domains of software development. However, experienced programmers capture nearly all of these productivity and exploration gains, widening rather than closing the skill gap.
♻ ☆ MajinBook: An open catalogue of digital world literature with likes
This data paper introduces MajinBook, an open catalogue designed to facilitate the use of shadow libraries--such as Library Genesis and Z-Library--for computational social science and cultural analytics. By linking metadata from these vast, crowd-sourced archives with structured bibliographic data from Goodreads, we create a high-precision corpus of over 539,000 references to English-language books spanning three centuries, enriched with first publication dates, genres, and popularity metrics like ratings and reviews. Our methodology prioritizes natively digital EPUB files to ensure machine-readable quality, while addressing biases in traditional corpora like HathiTrust, and includes secondary datasets for French, German, and Spanish. We evaluate the linkage strategy for accuracy, release all underlying data openly, and discuss the project's legal permissibility under EU and US frameworks for text and data mining in research.
comment: 9 pages, 5 figures, 1 table
♻ ☆ LightDefense: A Lightweight Uncertainty-Driven Defense against Jailbreaks via Shifted Token Distribution
Large Language Models (LLMs) face threats from jailbreak prompts. Existing methods for defending against jailbreak attacks are primarily based on auxiliary models. These strategies, however, often require extensive data collection or training. We propose LightDefense, a lightweight defense mechanism targeted at white-box models, which utilizes a safety-oriented direction to adjust the probabilities of tokens in the vocabulary, making safety disclaimers appear among the top tokens after sorting tokens by probability in descending order. We further innovatively leverage LLM's uncertainty about prompts to measure their harmfulness and adaptively adjust defense strength, effectively balancing safety and helpfulness. The effectiveness of LightDefense in defending against 5 attack methods across 2 target LLMs, without compromising helpfulness to benign user queries, highlights its potential as a novel and lightweight defense mechanism, enhancing security of LLMs.
Computation and Language
☆ AccelOpt: A Self-Improving LLM Agentic System for AI Accelerator Kernel Optimization
We present AccelOpt, a self-improving large language model (LLM) agentic system that autonomously optimizes kernels for emerging AI acclerators, eliminating the need for expert-provided hardware-specific optimization knowledge. AccelOpt explores the kernel optimization space through iterative generation, informed by an optimization memory that curates experiences and insights from previously encountered slow-fast kernel pairs. We build NKIBench, a new benchmark suite of AWS Trainium accelerator kernels with varying complexity extracted from real-world LLM workloads to evaluate the effectiveness of AccelOpt. Our evaluation confirms that AccelOpt's capability improves over time, boosting the average percentage of peak throughput from $49\%$ to $61\%$ on Trainium 1 and from $45\%$ to $59\%$ on Trainium 2 for NKIBench kernels. Moreover, AccelOpt is highly cost-effective: using open-source models, it matches the kernel improvements of Claude Sonnet 4 while being $26\times$ cheaper.
☆ Mind the Motions: Benchmarking Theory-of-Mind in Everyday Body Language
Our ability to interpret others' mental states through nonverbal cues (NVCs) is fundamental to our survival and social cohesion. While existing Theory of Mind (ToM) benchmarks have primarily focused on false-belief tasks and reasoning with asymmetric information, they overlook other mental states beyond belief and the rich tapestry of human nonverbal communication. We present Motion2Mind, a framework for evaluating the ToM capabilities of machines in interpreting NVCs. Leveraging an expert-curated body-language reference as a proxy knowledge base, we build Motion2Mind, a carefully curated video dataset with fine-grained nonverbal cue annotations paired with manually verified psychological interpretations. It encompasses 222 types of nonverbal cues and 397 mind states. Our evaluation reveals that current AI systems struggle significantly with NVC interpretation, exhibiting not only a substantial performance gap in Detection, as well as patterns of over-interpretation in Explanation compared to human annotators.
☆ What Really Counts? Examining Step and Token Level Attribution in Multilingual CoT Reasoning
This study investigates the attribution patterns underlying Chain-of-Thought (CoT) reasoning in multilingual LLMs. While prior works demonstrate the role of CoT prompting in improving task performance, there are concerns regarding the faithfulness and interpretability of the generated reasoning chains. To assess these properties across languages, we applied two complementary attribution methods--ContextCite for step-level attribution and Inseq for token-level attribution--to the Qwen2.5 1.5B-Instruct model using the MGSM benchmark. Our experimental results highlight key findings such as: (1) attribution scores excessively emphasize the final reasoning step, particularly in incorrect generations; (2) structured CoT prompting significantly improves accuracy primarily for high-resource Latin-script languages; and (3) controlled perturbations via negation and distractor sentences reduce model accuracy and attribution coherence. These findings highlight the limitations of CoT prompting, particularly in terms of multilingual robustness and interpretive transparency.
comment: Received the Best Student Project Award at RuG's Advanced-NLP course
☆ The Subtle Art of Defection: Understanding Uncooperative Behaviors in LLM based Multi-Agent Systems
This paper introduces a novel framework for simulating and analyzing how uncooperative behaviors can destabilize or collapse LLM-based multi-agent systems. Our framework includes two key components: (1) a game theory-based taxonomy of uncooperative agent behaviors, addressing a notable gap in the existing literature; and (2) a structured, multi-stage simulation pipeline that dynamically generates and refines uncooperative behaviors as agents' states evolve. We evaluate the framework via a collaborative resource management setting, measuring system stability using metrics such as survival time and resource overuse rate. Empirically, our framework achieves 96.7% accuracy in generating realistic uncooperative behaviors, validated by human evaluations. Our results reveal a striking contrast: cooperative agents maintain perfect system stability (100% survival over 12 rounds with 0% resource overuse), while any uncooperative behavior can trigger rapid system collapse within 1 to 7 rounds. These findings demonstrate that uncooperative agents can significantly degrade collective outcomes, highlighting the need for designing more resilient multi-agent systems.
☆ Step-Audio-R1 Technical Report
Recent advances in reasoning models have demonstrated remarkable success in text and vision domains through extended chain-of-thought deliberation. However, a perplexing phenomenon persists in audio language models: they consistently perform better with minimal or no reasoning, raising a fundamental question - can audio intelligence truly benefit from deliberate thinking? We introduce Step-Audio-R1, the first audio reasoning model that successfully unlocks reasoning capabilities in the audio domain. Through our proposed Modality-Grounded Reasoning Distillation (MGRD) framework, Step-Audio-R1 learns to generate audio-relevant reasoning chains that genuinely ground themselves in acoustic features rather than hallucinating disconnected deliberations. Our model exhibits strong audio reasoning capabilities, surpassing Gemini 2.5 Pro and achieving performance comparable to the state-of-the-art Gemini 3 Pro across comprehensive audio understanding and reasoning benchmarks spanning speech, environmental sounds, and music. These results demonstrate that reasoning is a transferable capability across modalities when appropriately anchored, transforming extended deliberation from a liability into a powerful asset for audio intelligence. By establishing the first successful audio reasoning model, Step-Audio-R1 opens new pathways toward building truly multimodal reasoning systems that think deeply across all sensory modalities.
comment: 15 pages, 5 figures. Technical Report
☆ Tokenisation over Bounded Alphabets is Hard
Recent works have shown that tokenisation is NP-complete. However, these works assume tokenisation is applied to inputs with unboundedly large alphabets -- an unrealistic assumption, given that in practice tokenisers operate over fixed-size alphabets, such as bytes or Unicode characters. We close this gap by analysing tokenisation over bounded $n$-ary alphabets, considering two natural variants: bottom-up tokenisation and direct tokenisation, where we must, respectively, select a sequence of merge operations or a vocabulary whose application optimally compresses a dataset. First, we note that proving hardness results for an $n$-ary alphabet proves the same results for alphabets of any larger size. We then prove that even with binary alphabets, both variants are not only NP-complete, but admit no polynomial-time approximation scheme (unless P=NP). We further show that direct tokenisation remains NP-complete even when applied to unary alphabets. While unary alphabets may not be practically useful, this result establishes that the computational intractability of tokenisation is not an artifact of large alphabets or complex constructions, but a fundamental barrier. Overall, our results explain why practical algorithms such as BPE and UnigramLM are heuristic, and points toward approximation algorithms being an important path going forward for tokenisation research.
☆ Think Visually, Reason Textually: Vision-Language Synergy in ARC
Abstract reasoning from minimal examples remains a core unsolved problem for frontier foundation models such as GPT-5 and Grok 4. These models still fail to infer structured transformation rules from a handful of examples, which is a key hallmark of human intelligence. The Abstraction and Reasoning Corpus for Artificial General Intelligence (ARC-AGI) provides a rigorous testbed for this capability, demanding conceptual rule induction and transfer to novel tasks. Most existing methods treat ARC-AGI as a purely textual reasoning task, overlooking the fact that humans rely heavily on visual abstraction when solving such puzzles. However, our pilot experiments reveal a paradox: naively rendering ARC-AGI grids as images degrades performance due to imprecise rule execution. This leads to our central hypothesis that vision and language possess complementary strengths across distinct reasoning stages: vision supports global pattern abstraction and verification, whereas language specializes in symbolic rule formulation and precise execution. Building on this insight, we introduce two synergistic strategies: (1) Vision-Language Synergy Reasoning (VLSR), which decomposes ARC-AGI into modality-aligned subtasks; and (2) Modality-Switch Self-Correction (MSSC), which leverages vision to verify text-based reasoning for intrinsic error correction. Extensive experiments demonstrate that our approach yields up to a 4.33% improvement over text-only baselines across diverse flagship models and multiple ARC-AGI tasks. Our findings suggest that unifying visual abstraction with linguistic reasoning is a crucial step toward achieving generalizable, human-like intelligence in future foundation models. Source code will be released soon.
☆ MoDES: Accelerating Mixture-of-Experts Multimodal Large Language Models via Dynamic Expert Skipping
Mixture-of-Experts (MoE) Multimodal large language models (MLLMs) excel at vision-language tasks, but they suffer from high computational inefficiency. To reduce inference overhead, expert skipping methods have been proposed to deactivate redundant experts based on the current input tokens. However, we find that applying these methods-originally designed for unimodal large language models (LLMs)-to MLLMs results in considerable performance degradation. This is primarily because such methods fail to account for the heterogeneous contributions of experts across MoE layers and modality-specific behaviors of tokens within these layers. Motivated by these findings, we propose MoDES, the first training-free framework that adaptively skips experts to enable efficient and accurate MoE MLLM inference. It incorporates a globally-modulated local gating (GMLG) mechanism that integrates global layer-wise importance into local routing probabilities to accurately estimate per-token expert importance. A dual-modality thresholding (DMT) method is then applied, which processes tokens from each modality separately, to derive the skipping schedule. To set the optimal thresholds, we introduce a frontier search algorithm that exploits monotonicity properties, cutting convergence time from several days to a few hours. Extensive experiments for 3 model series across 13 benchmarks demonstrate that MoDES far outperforms previous approaches. For instance, when skipping 88% experts for Qwen3-VL-MoE-30B-A3B-Instruct, the performance boost is up to 10.67% (97.33% vs. 86.66%). Furthermore, MoDES significantly enhances inference speed, improving the prefilling time by 2.16$\times$ and the decoding time by 1.26$\times$.
comment: Code will be released upon acceptance
☆ When to Think and When to Look: Uncertainty-Guided Lookback
Test-time thinking (that is, generating explicit intermediate reasoning chains) is known to boost performance in large language models and has recently shown strong gains for large vision language models (LVLMs). However, despite these promising results, there is still no systematic analysis of how thinking actually affects visual reasoning. We provide the first such analysis with a large scale, controlled comparison of thinking for LVLMs, evaluating ten variants from the InternVL3.5 and Qwen3-VL families on MMMU-val under generous token budgets and multi pass decoding. We show that more thinking is not always better; long chains often yield long wrong trajectories that ignore the image and underperform the same models run in standard instruct mode. A deeper analysis reveals that certain short lookback phrases, which explicitly refer back to the image, are strongly enriched in successful trajectories and correlate with better visual grounding. Building on this insight, we propose uncertainty guided lookback, a training free decoding strategy that combines an uncertainty signal with adaptive lookback prompts and breadth search. Our method improves overall MMMU performance, delivers the largest gains in categories where standard thinking is weak, and outperforms several strong decoding baselines, setting a new state of the art under fixed model families and token budgets. We further show that this decoding strategy generalizes, yielding consistent improvements on five additional benchmarks, including two broad multimodal suites and math focused visual reasoning datasets.
☆ SRPO: Self-Referential Policy Optimization for Vision-Language-Action Models
Vision-Language-Action (VLA) models excel in robotic manipulation but are constrained by their heavy reliance on expert demonstrations, leading to demonstration bias and limiting performance. Reinforcement learning (RL) is a vital post-training strategy to overcome these limits, yet current VLA-RL methods, including group-based optimization approaches, are crippled by severe reward sparsity. Relying on binary success indicators wastes valuable information in failed trajectories, resulting in low training efficiency. To solve this, we propose Self-Referential Policy Optimization (SRPO), a novel VLA-RL framework. SRPO eliminates the need for external demonstrations or manual reward engineering by leveraging the model's own successful trajectories, generated within the current training batch, as a self-reference. This allows us to assign a progress-wise reward to failed attempts. A core innovation is the use of latent world representations to measure behavioral progress robustly. Instead of relying on raw pixels or requiring domain-specific fine-tuning, we utilize the compressed, transferable encodings from a world model's latent space. These representations naturally capture progress patterns across environments, enabling accurate, generalized trajectory comparison. Empirical evaluations on the LIBERO benchmark demonstrate SRPO's efficiency and effectiveness. Starting from a supervised baseline with 48.9% success, SRPO achieves a new state-of-the-art success rate of 99.2% in just 200 RL steps, representing a 103% relative improvement without any extra supervision. Furthermore, SRPO shows substantial robustness, achieving a 167% performance improvement on the LIBERO-Plus benchmark.
☆ HSKBenchmark: Modeling and Benchmarking Chinese Second Language Acquisition in Large Language Models through Curriculum Tuning AAAI-2026
Language acquisition is vital to revealing the nature of human language intelligence and has recently emerged as a promising perspective for improving the interpretability of large language models (LLMs). However, it is ethically and practically infeasible to conduct experiments that require controlling human learners' language inputs. This poses challenges for the verifiability and scalability of language acquisition modeling, particularly in Chinese second language acquisition (SLA). While LLMs provide a controllable and reproducible alternative, a systematic benchmark to support phase-wise modeling and assessment is still lacking. In this paper, we present HSKBenchmark, the first benchmark for staged modeling and writing assessment of LLMs in Chinese SLA. It covers HSK levels 3 to 6 and includes authentic textbooks with 6.76 million tokens, 16K synthetic instruction samples, 30 test topics, and a linguistically grounded evaluation system. To simulate human learning trajectories, we introduce a curriculum-tuning framework that trains models from beginner to advanced levels. An evaluation system is created to examine level-based grammar coverage, writing errors, lexical and syntactic complexity, and holistic scoring. We also build HSKAgent, fine-tuned on 10K learner compositions. Extensive experimental results demonstrate that HSKBenchmark not only models Chinese SLA effectively, but also serves as a reliable benchmark for dynamic writing assessment in LLMs. Our fine-tuned LLMs have writing performance on par with advanced human learners and exhibit human-like acquisition characteristics. The HSKBenchmark, HSKAgent, and checkpoints serve as foundational tools and resources, with the potential to pave the way for future research on language acquisition modeling and LLMs interpretability. Code and data are publicly available at: https://github.com/CharlesYang030/HSKB.
comment: Accepted by AAAI-2026
☆ Computer-Use Agents as Judges for Generative User Interface
Computer-Use Agents (CUA) are becoming increasingly capable of autonomously operating digital environments through Graphical User Interfaces (GUI). Yet, most GUI remain designed primarily for humans--prioritizing aesthetics and usability--forcing agents to adopt human-oriented behaviors that are unnecessary for efficient task execution. At the same time, rapid advances in coding-oriented language models (Coder) have transformed automatic GUI design. This raises a fundamental question: Can CUA as judges to assist Coder for automatic GUI design? To investigate, we introduce AUI-Gym, a benchmark for Automatic GUI development spanning 52 applications across diverse domains. Using language models, we synthesize 1560 tasks that simulate real-world scenarios. To ensure task reliability, we further develop a verifier that programmatically checks whether each task is executable within its environment. Building on this, we propose a Coder-CUA in Collaboration framework: the Coder acts as Designer, generating and revising websites, while the CUA serves as Judge, evaluating functionality and refining designs. Success is measured not by visual appearance, but by task solvability and CUA navigation success rate. To turn CUA feedback into usable guidance, we design a CUA Dashboard that compresses multi-step navigation histories into concise visual summaries, offering interpretable guidance for iterative redesign. By positioning agents as both designers and judges, our framework shifts interface design toward agent-native efficiency and reliability. Our work takes a step toward shifting agents from passive use toward active participation in digital environments. Our code and dataset are available at https://github.com/showlab/AUI.
comment: Project: https://showlab.github.io/AUI Github: https://github.com/showlab/AUI
☆ Standardising the NLP Workflow: A Framework for Reproducible Linguistic Analysis
The introduction of large language models and other influential developments in AI-based language processing have led to an evolution in the methods available to quantitatively analyse language data. With the resultant growth of attention on language processing, significant challenges have emerged, including the lack of standardisation in organising and sharing linguistic data and the absence of standardised and reproducible processing methodologies. Striving for future standardisation, we first propose the Language Processing Data Structure (LPDS), a data structure inspired by the Brain Imaging Data Structure (BIDS), a widely adopted standard for handling neuroscience data. It provides a folder structure and file naming conventions for linguistic research. Second, we introduce pelican nlp, a modular and extensible Python package designed to enable streamlined language processing, from initial data cleaning and task-specific preprocessing to the extraction of sophisticated linguistic and acoustic features, such as semantic embeddings and prosodic metrics. The entire processing workflow can be specified within a single, shareable configuration file, which pelican nlp then executes on LPDS-formatted data. Depending on the specifications, the reproducible output can consist of preprocessed language data or standardised extraction of both linguistic and acoustic features and corresponding result aggregations. LPDS and pelican nlp collectively offer an end-to-end processing pipeline for linguistic data, designed to ensure methodological transparency and enhance reproducibility.
comment: 26 pages, 3 figures
☆ CroPS: Improving Dense Retrieval with Cross-Perspective Positive Samples in Short-Video Search AAAI-2026
Dense retrieval has become a foundational paradigm in modern search systems, especially on short-video platforms. However, most industrial systems adopt a self-reinforcing training pipeline that relies on historically exposed user interactions for supervision. This paradigm inevitably leads to a filter bubble effect, where potentially relevant but previously unseen content is excluded from the training signal, biasing the model toward narrow and conservative retrieval. In this paper, we present CroPS (Cross-Perspective Positive Samples), a novel retrieval data engine designed to alleviate this problem by introducing diverse and semantically meaningful positive examples from multiple perspectives. CroPS enhances training with positive signals derived from user query reformulation behavior (query-level), engagement data in recommendation streams (system-level), and world knowledge synthesized by large language models (knowledge-level). To effectively utilize these heterogeneous signals, we introduce a Hierarchical Label Assignment (HLA) strategy and a corresponding H-InfoNCE loss that together enable fine-grained, relevance-aware optimization. Extensive experiments conducted on Kuaishou Search, a large-scale commercial short-video search platform, demonstrate that CroPS significantly outperforms strong baselines both offline and in live A/B tests, achieving superior retrieval performance and reducing query reformulation rates. CroPS is now fully deployed in Kuaishou Search, serving hundreds of millions of users daily.
comment: AAAI-2026, Oral
☆ LLM-MemCluster: Empowering Large Language Models with Dynamic Memory for Text Clustering
Large Language Models (LLMs) are reshaping unsupervised learning by offering an unprecedented ability to perform text clustering based on their deep semantic understanding. However, their direct application is fundamentally limited by a lack of stateful memory for iterative refinement and the difficulty of managing cluster granularity. As a result, existing methods often rely on complex pipelines with external modules, sacrificing a truly end-to-end approach. We introduce LLM-MemCluster, a novel framework that reconceptualizes clustering as a fully LLM-native task. It leverages a Dynamic Memory to instill state awareness and a Dual-Prompt Strategy to enable the model to reason about and determine the number of clusters. Evaluated on several benchmark datasets, our tuning-free framework significantly and consistently outperforms strong baselines. LLM-MemCluster presents an effective, interpretable, and truly end-to-end paradigm for LLM-based text clustering.
☆ Building Robust and Scalable Multilingual ASR for Indian Languages
This paper describes the systems developed by SPRING Lab, Indian Institute of Technology Madras, for the ASRU MADASR 2.0 challenge. The systems developed focuses on adapting ASR systems to improve in predicting the language and dialect of the utterance among 8 languages across 33 dialects. We participated in Track 1 and Track 2, which restricts the use of additional data and develop from-the-scratch multilingual systems. We presented a novel training approach using Multi-Decoder architecture with phonemic Common Label Set (CLS) as intermediate representation. It improved the performance over the baseline (in the CLS space). We also discuss various methods used to retain the gain obtained in the phonemic space while converting them back to the corresponding grapheme representations. Our systems beat the baseline in 3 languages (Track 2) in terms of WER/CER and achieved the highest language ID and dialect ID accuracy among all participating teams (Track 2).
☆ NAMeGEn: Creative Name Generation via A Novel Agent-based Multiple Personalized Goal Enhancement Framework
Trained on diverse human-authored texts, Large Language Models (LLMs) unlocked the potential for Creative Natural Language Generation (CNLG), benefiting various applications like advertising and storytelling. Nevertheless, CNLG still remains difficult due to two main challenges. (1) Multi-objective flexibility: user requirements are often personalized, fine-grained, and pluralistic, which LLMs struggle to satisfy simultaneously; (2) Interpretive complexity: beyond generation, creativity also involves understanding and interpreting implicit meaning to enhance users' perception. These challenges significantly limit current methods, especially in short-form text generation, in generating creative and insightful content. To address this, we focus on Chinese baby naming, a representative short-form CNLG task requiring adherence to explicit user constraints (e.g., length, semantics, anthroponymy) while offering meaningful aesthetic explanations. We propose NAMeGEn, a novel multi-agent optimization framework that iteratively alternates between objective extraction, name generation, and evaluation to meet diverse requirements and generate accurate explanations. To support this task, we further construct a classical Chinese poetry corpus with 17k+ poems to enhance aesthetics, and introduce CBNames, a new benchmark with tailored metrics. Extensive experiments demonstrate that NAMeGEn effectively generates creative names that meet diverse, personalized requirements while providing meaningful explanations, outperforming six baseline methods spanning various LLM backbones without any training.
comment: 13 pages,9 figures. This work has been submitted to the IEEE for possible publication
☆ DEPO: Dual-Efficiency Preference Optimization for LLM Agents AAAI 2026
Recent advances in large language models (LLMs) have greatly improved their reasoning and decision-making abilities when deployed as agents. Richer reasoning, however, often comes at the cost of longer chain of thought (CoT), hampering interaction efficiency in real-world scenarios. Nevertheless, there still lacks systematic definition of LLM agent efficiency, hindering targeted improvements. To this end, we introduce dual-efficiency, comprising (i) step-level efficiency, which minimizes tokens per step, and (ii) trajectory-level efficiency, which minimizes the number of steps to complete a task. Building on this definition, we propose DEPO, a dual-efficiency preference optimization method that jointly rewards succinct responses and fewer action steps. Experiments on WebShop and BabyAI show that DEPO cuts token usage by up to 60.9% and steps by up to 26.9%, while achieving up to a 29.3% improvement in performance. DEPO also generalizes to three out-of-domain math benchmarks and retains its efficiency gains when trained on only 25% of the data. Our project page is at https://opencausalab.github.io/DEPO.
comment: Accepted to AAAI 2026
☆ A Compliance-Preserving Retrieval System for Aircraft MRO Task Search
Aircraft Maintenance Technicians (AMTs) spend up to 30% of work time searching manuals, a documented efficiency bottleneck in MRO operations where every procedure must be traceable to certified sources. We present a compliance-preserving retrieval system that adapts LLM reranking and semantic search to aviation MRO environments by operating alongside, rather than replacing, certified legacy viewers. The system constructs revision-robust embeddings from ATA chapter hierarchies and uses vision-language parsing to structure certified content, allowing technicians to preview ranked tasks and access verified procedures in existing viewers. Evaluation on 49k synthetic queries achieves >90% retrieval accuracy, while bilingual controlled studies with 10 licensed AMTs demonstrate 90.9% top-10 success rate and 95% reduction in lookup time, from 6-15 minutes to 18 seconds per task. These gains provide concrete evidence that semantic retrieval can operate within strict regulatory constraints and meaningfully reduce operational workload in real-world multilingual MRO workflows.
☆ The Empowerment of Science of Science by Large Language Models: New Tools and Methods
Large language models (LLMs) have exhibited exceptional capabilities in natural language understanding and generation, image recognition, and multimodal tasks, charting a course towards AGI and emerging as a central issue in the global technological race. This manuscript conducts a comprehensive review of the core technologies that support LLMs from a user standpoint, including prompt engineering, knowledge-enhanced retrieval augmented generation, fine tuning, pretraining, and tool learning. Additionally, it traces the historical development of Science of Science (SciSci) and presents a forward looking perspective on the potential applications of LLMs within the scientometric domain. Furthermore, it discusses the prospect of an AI agent based model for scientific evaluation, and presents new research fronts detection and knowledge graph building methods with LLMs.
comment: The manuscript is currently ongoing the underreview process of the journal of information science
☆ HEAD-QA v2: Expanding a Healthcare Benchmark for Reasoning
We introduce HEAD-QA v2, an expanded and updated version of a Spanish/English healthcare multiple-choice reasoning dataset originally released by Vilares and Gómez-Rodríguez (2019). The update responds to the growing need for high-quality datasets that capture the linguistic and conceptual complexity of healthcare reasoning. We extend the dataset to over 12,000 questions from ten years of Spanish professional exams, benchmark several open-source LLMs using prompting, RAG, and probability-based answer selection, and provide additional multilingual versions to support future work. Results indicate that performance is mainly driven by model scale and intrinsic reasoning ability, with complex inference strategies obtaining limited gains. Together, these results establish HEAD-QA v2 as a reliable resource for advancing research on biomedical reasoning and model improvement.
comment: Preprint. 12 pages
☆ SkyEgg: Joint Implementation Selection and Scheduling for Hardware Synthesis using E-graphs
Hardware synthesis from high-level descriptions remains fundamentally limited by the sequential optimization of interdependent design decisions. Current methodologies, including state-of-the-art high-level synthesis (HLS) tools, artificially separate implementation selection from scheduling, leading to suboptimal designs that cannot fully exploit modern FPGA heterogeneous architectures. Implementation selection is typically performed by ad-hoc pattern matching on operations, a process that does not consider the impact on scheduling. Subsequently, scheduling algorithms operate on fixed selection solutions with inaccurate delay estimates, which misses critical optimization opportunities from appropriately configured FPGA blocks like DSP slices. We present SkyEgg, a novel hardware synthesis framework that jointly optimizes implementation selection and scheduling using the e-graph data structure. Our key insight is that both algebraic transformations and hardware implementation choices can be uniformly represented as rewrite rules within an e-graph, modeling the complete design space of implementation candidates to be selected and scheduled together. First, SkyEgg constructs an e-graph from the input program. It then applies both algebraic and implementation rewrites through equality saturation. Finally, it formulates the joint optimization as a mixed-integer linear programming (MILP) problem on the saturated e-graph. We provide both exact MILP solving and an efficient ASAP heuristic for scalable synthesis. Our evaluation on benchmarks from diverse applications targeting Xilinx Kintex UltraScale+ FPGAs demonstrates that SkyEgg achieves an average speedup of 3.01x over Vitis HLS, with improvements up to 5.22x for complex expressions.
☆ MAPROC at AHaSIS Shared Task: Few-Shot and Sentence Transformer for Sentiment Analysis of Arabic Hotel Reviews
Sentiment analysis of Arabic dialects presents significant challenges due to linguistic diversity and the scarcity of annotated data. This paper describes our approach to the AHaSIS shared task, which focuses on sentiment analysis on Arabic dialects in the hospitality domain. The dataset comprises hotel reviews written in Moroccan and Saudi dialects, and the objective is to classify the reviewers sentiment as positive, negative, or neutral. We employed the SetFit (Sentence Transformer Fine-tuning) framework, a data-efficient few-shot learning technique. On the official evaluation set, our system achieved an F1 of 73%, ranking 12th among 26 participants. This work highlights the potential of few-shot learning to address data scarcity in processing nuanced dialectal Arabic text within specialized domains like hotel reviews.
☆ ChartEditor: A Reinforcement Learning Framework for Robust Chart Editing AAAI 2026
Chart editing reduces manual effort in visualization design. Typical benchmarks limited in data diversity and assume access to complete chart code, which is seldom in real-world scenarios. To address this gap, we present ChartEditVista, a comprehensive benchmark consisting of 7,964 samples spanning 31 chart categories. It encompasses diverse editing instructions and covers nearly all editable chart elements. The inputs in ChartEditVista include only the original chart image and natural language editing instructions, without the original chart codes. ChartEditVista is generated through a fully automated pipeline that produces, edits, and verifies charts, ensuring high-quality chart editing data. Besides, we introduce two novel fine-grained, rule-based evaluation metrics: the layout metric, which evaluates the position, size and color of graphical components; and the text metric, which jointly assesses textual content and font styling. Building on top of ChartEditVista, we present ChartEditor, a model trained using a reinforcement learning framework that incorporates a novel rendering reward to simultaneously enforce code executability and visual fidelity. Through extensive experiments and human evaluations, we demonstrate that ChartEditVista provides a robust evaluation, while ChartEditor consistently outperforms models with similar-scale and larger-scale on chart editing tasks.
comment: Accept to AAAI 2026 Main Track
☆ IndicGEC: Powerful Models, or a Measurement Mirage?
In this paper, we report the results of the TeamNRC's participation in the BHASHA-Task 1 Grammatical Error Correction shared task https://github.com/BHASHA-Workshop/IndicGEC2025/ for 5 Indian languages. Our approach, focusing on zero/few-shot prompting of language models of varying sizes (4B to large proprietary models) achieved a Rank 4 in Telugu and Rank 2 in Hindi with GLEU scores of 83.78 and 84.31 respectively. In this paper, we extend the experiments to the other three languages of the shared task - Tamil, Malayalam and Bangla, and take a closer look at the data quality and evaluation metric used. Our results primarily highlight the potential of small language models, and summarize the concerns related to creating good quality datasets and appropriate metrics for this task that are suitable for Indian language scripts.
comment: Technical report
☆ M, Toolchain and Language for Reusable Model Compilation
Complex software-driven systems often interleave distributed, concurrent computation processes with physical interactions with the environment. Developing these systems more efficiently and safely can be achieved by employing actionable, software-based models. From a high-level system model, engineers often need to derive multiple specialized models for different purposes, including simulation, deployment, and formal verification. Each of these target models usually rely on its own formalism, specification language, and execution platform. Traditionally, a compiler analyzes a program written in a programming language and generates executable code. In contrast, a model compiler processes a source model written in a modeling language and should ideally support the generation of multiple heterogeneous targets. However, most existing modeling languages are designed with a narrow focus, typically targeting only simulation or implementation. Multi-target compilation, when not considered during the language's early design, becomes significantly harder to achieve. In this paper, we introduce our initiative: a toolchain and modeling language called M, designed to support system modeling and multi-target compilation for model-driven engineering of complex, concurrent, and time-aware systems. M is a textual, grammar-driven language based on the actor model and extended with discrete-event scheduling semantics. It provides constructs for modeling system entities, message-based interactions, and time- or state-triggered reactions. From such models, M enables the systematic generation of diverse target artifacts while preserving semantic conformance to the original model. Moreover, M can serve as a middle language to which other modeling languages may anchor, thereby allowing them to benefit from its compilation framework.
☆ Context Cascade Compression: Exploring the Upper Limits of Text Compression
Million-level token inputs in long-context tasks pose significant computational and memory challenges for Large Language Models (LLMs). Recently, DeepSeek-OCR conducted research into the feasibility of Contexts Optical Compression and achieved preliminary results. Inspired by this, we introduce Context Cascade Compression C3 to explore the upper limits of text compression. Our method cascades two LLMs of different sizes to handle the compression and decoding tasks. Specifically, a small LLM, acting as the first stage, performs text compression by condensing a long context into a set of latent tokens (e.g., 32 or 64 in length), achieving a high ratio of text tokens to latent tokens. A large LLM, as the second stage, then executes the decoding task on this compressed context. Experiments show that at a 20x compression ratio (where the number of text tokens is 20 times the number of latent tokens), our model achieves 98% decoding accuracy, compared to approximately 60% for DeepSeek-OCR. When we further increase the compression ratio to 40x, the accuracy is maintained at around 93%. This indicates that in the domain of context compression, C3 Compression demonstrates superior performance and feasibility over optical character compression. C3 uses a simpler, pure-text pipeline that ignores factors like layout, color, and information loss from a visual encoder. This also suggests a potential upper bound for compression ratios in future work on optical character compression, OCR, and related fields. Codes and model weights are publicly accessible at https://github.com/liufanfanlff/C3-Context-Cascade-Compression
☆ Unveiling Intrinsic Dimension of Texts: from Academic Abstract to Creative Story
Intrinsic dimension (ID) is an important tool in modern LLM analysis, informing studies of training dynamics, scaling behavior, and dataset structure, yet its textual determinants remain underexplored. We provide the first comprehensive study grounding ID in interpretable text properties through cross-encoder analysis, linguistic features, and sparse autoencoders (SAEs). In this work, we establish three key findings. First, ID is complementary to entropy-based metrics: after controlling for length, the two are uncorrelated, with ID capturing geometric complexity orthogonal to prediction quality. Second, ID exhibits robust genre stratification: scientific prose shows low ID (~8), encyclopedic content medium ID (~9), and creative/opinion writing high ID (~10.5) across all models tested. This reveals that contemporary LLMs find scientific text "representationally simple" while fiction requires additional degrees of freedom. Third, using SAEs, we identify causal features: scientific signals (formal tone, report templates, statistics) reduce ID; humanized signals (personalization, emotion, narrative) increase it. Steering experiments confirm these effects are causal. Thus, for contemporary models, scientific writing appears comparatively "easy", whereas fiction, opinion, and affect add representational degrees of freedom. Our multi-faceted analysis provides practical guidance for the proper use of ID and the sound interpretation of ID-based results.
☆ HinTel-AlignBench: A Framework and Benchmark for Hindi-Telugu with English-Aligned Samples
With nearly 1.5 billion people and more than 120 major languages, India represents one of the most diverse regions in the world. As multilingual Vision-Language Models (VLMs) gain prominence, robust evaluation methodologies are essential to drive progress toward equitable AI for low-resource languages. Current multilingual VLM evaluations suffer from four major limitations: reliance on unverified auto-translations, narrow task/domain coverage, limited sample sizes, and lack of cultural and natively sourced Question-Answering (QA). To address these gaps, we present a scalable framework to evaluate VLMs in Indian languages and compare it with performance in English. Using the framework, we generate HinTel-AlignBench, a benchmark that draws from diverse sources in Hindi and Telugu with English-aligned samples. Our contributions are threefold: (1) a semi-automated dataset creation framework combining back-translation, filtering, and human verification; (2) the most comprehensive vision-language benchmark for Hindi and and Telugu, including adapted English datasets (VQAv2, RealWorldQA, CLEVR-Math) and native novel Indic datasets (JEE for STEM, VAANI for cultural grounding) with approximately 4,000 QA pairs per language; and (3) a detailed performance analysis of various State-of-the-Art (SOTA) open-weight and closed-source VLMs. We find a regression in performance for tasks in English versus in Indian languages for 4 out of 5 tasks across all the models, with an average regression of 8.3 points in Hindi and 5.5 points for Telugu. We categorize common failure modes to highlight concrete areas of improvement in multilingual multimodal understanding.
☆ Teaching According to Students' Aptitude: Personalized Mathematics Tutoring via Persona-, Memory-, and Forgetting-Aware LLMs AAAI 2026
Large Language Models (LLMs) are increasingly integrated into intelligent tutoring systems to provide human-like and adaptive instruction. However, most existing approaches fail to capture how students' knowledge evolves dynamically across their proficiencies, conceptual gaps, and forgetting patterns. This challenge is particularly acute in mathematics tutoring, where effective instruction requires fine-grained scaffolding precisely calibrated to each student's mastery level and cognitive retention. To address this issue, we propose TASA (Teaching According to Students' Aptitude), a student-aware tutoring framework that integrates persona, memory, and forgetting dynamics for personalized mathematics learning. Specifically, TASA maintains a structured student persona capturing proficiency profiles and an event memory recording prior learning interactions. By incorporating a continuous forgetting curve with knowledge tracing, TASA dynamically updates each student's mastery state and generates contextually appropriate, difficulty-calibrated questions and explanations. Empirical results demonstrate that TASA achieves superior learning outcomes and more adaptive tutoring behavior compared to representative baselines, underscoring the importance of modeling temporal forgetting and learner profiles in LLM-based tutoring systems.
comment: AAAI 2026 Workshop
☆ Generating Natural-Language Surgical Feedback: From Structured Representation to Domain-Grounded Evaluation ML4H 2025
High-quality intraoperative feedback from a surgical trainer is pivotal for improving trainee performance and long-term skill acquisition. Automating natural, trainer-style feedback promises timely, accessible, and consistent guidance at scale but requires models that understand clinically relevant representations. We present a structure-aware pipeline that learns a surgical action ontology from real trainer-to-trainee transcripts (33 surgeries) and uses it to condition feedback generation. We contribute by (1) mining Instrument-Action-Target (IAT) triplets from real-world feedback text and clustering surface forms into normalized categories, (2) fine-tuning a video-to-IAT model that leverages the surgical procedure and task contexts as well as fine-grained temporal instrument motion, and (3) demonstrating how to effectively use IAT triplet representations to guide GPT-4o in generating clinically grounded, trainer-style feedback. We show that, on Task 1: Video-to-IAT recognition, our context injection and temporal tracking deliver consistent AUC gains (Instrument: 0.67 to 0.74; Action: 0.60 to 0.63; Tissue: 0.74 to 0.79). For Task 2: feedback text generation (rated on a 1-5 fidelity rubric where 1 = opposite/unsafe, 3 = admissible, and 5 = perfect match to a human trainer), GPT-4o from video alone scores 2.17, while IAT conditioning reaches 2.44 (+12.4%), doubling the share of admissible generations with score >= 3 from 21% to 42%. Traditional text-similarity metrics also improve: word error rate decreases by 15-31% and ROUGE (phrase/substring overlap) increases by 9-64%. Grounding generation in explicit IAT structure improves fidelity and yields clinician-verifiable rationales, supporting auditable use in surgical training.
comment: Accepted as proceedings paper for ML4H 2025
☆ CASTELLA: Long Audio Dataset with Captions and Temporal Boundaries
We introduce CASTELLA, a human-annotated audio benchmark for the task of audio moment retrieval (AMR). Although AMR has various useful potential applications, there is still no established benchmark with real-world data. The early study of AMR trained the model with solely synthetic datasets. Moreover, the evaluation is based on annotated dataset of fewer than 100 samples. This resulted in less reliable reported performance. To ensure performance for applications in real-world environments, we present CASTELLA, a large-scale manually annotated AMR dataset. CASTELLA consists of 1,009, 213, and 640 audio recordings for train, valid, and test split, respectively, which is 24 times larger than the previous dataset. We also establish a baseline model for AMR using CASTELLA. Our experiments demonstrate that a model fine-tuned on CASTELLA after pre-training on the synthetic data outperformed a model trained solely on the synthetic data by 10.4 points in Recall1@0.7. CASTELLA is publicly available in https://h-munakata.github.io/CASTELLA-demo/.
♻ ☆ Discriminating Form and Meaning in Multilingual Models with Minimal-Pair ABX Tasks EMNLP 2025
We introduce a set of training-free ABX-style discrimination tasks to evaluate how multilingual language models represent language identity (form) and semantic content (meaning). Inspired from speech processing, these zero-shot tasks measure whether minimal differences in representation can be reliably detected. This offers a flexible and interpretable alternative to probing. Applied to XLM-R (Conneau et al, 2020) across pretraining checkpoints and layers, we find that language discrimination declines over training and becomes concentrated in lower layers, while meaning discrimination strengthens over time and stabilizes in deeper layers. We then explore probing tasks, showing some alignment between our metrics and linguistic learning performance. Our results position ABX tasks as a lightweight framework for analyzing the structure of multilingual representations.
comment: Comments: Published in EMNLP 2025. https://aclanthology.org/2025.emnlp-main.1210.pdf
♻ ☆ Verbalized Algorithms NeurIPS 2025
Instead of querying LLMs in a one-shot manner and hoping to get the right answer for a reasoning task, we propose a paradigm we call \emph{verbalized algorithms} (VAs), which leverage classical algorithms with established theoretical understanding. VAs decompose a task into simple elementary operations on natural language strings that they should be able to answer reliably, and limit the scope of LLMs to only those simple tasks. For example, for sorting a series of natural language strings, \emph{verbalized sorting} uses an LLM as a binary comparison oracle in a known and well-analyzed sorting algorithm (e.g., bitonic sorting network). We demonstrate the effectiveness of this approach on sorting and clustering tasks.
comment: Accepted in NeurIPS 2025 Workshop on Efficient Reasoning
♻ ☆ Efficient Architectures for High Resolution Vision-Language Models COLING 2025
Vision-Language Models (VLMs) have recently experienced significant advancements. However, challenges persist in the accurate recognition of fine details within high resolution images, which limits performance in multiple tasks. This work introduces Pheye, a novel architecture that efficiently processes high-resolution images while training fewer parameters than similarly sized VLMs. Notably, Pheye achieves a high efficiency while maintaining strong performance, particularly in tasks that demand fine-grained image understanding and/or the handling of scene-text.
comment: Accepted at COLING 2025
♻ ☆ HAWAII: Hierarchical Visual Knowledge Transfer for Efficient Vision-Language Models NeurIPS 2025
Improving the visual understanding ability of vision-language models (VLMs) is crucial for enhancing their performance across various tasks. While using multiple pretrained visual experts has shown great promise, it often incurs significant computational costs during training and inference. To address this challenge, we propose HAWAII, a novel framework that distills knowledge from multiple visual experts into a single vision encoder, enabling it to inherit the complementary strengths of several experts with minimal computational overhead. To mitigate conflicts among different teachers and switch between different teacher-specific knowledge, instead of using a fixed set of adapters for multiple teachers, we propose to use teacher-specific Low-Rank Adaptation (LoRA) adapters with a corresponding router. Each adapter is aligned with a specific teacher, avoiding noisy guidance during distillation. To enable efficient knowledge distillation, we propose fine-grained and coarse-grained distillation. At the fine-grained level, token importance scores are employed to emphasize the most informative tokens from each teacher adaptively. At the coarse-grained level, we summarize the knowledge from multiple teachers and transfer it to the student using a set of general-knowledge LoRA adapters with a router. Extensive experiments on various vision-language tasks demonstrate the superiority of HAWAII compared to popular open-source VLMs. The code is available at https://github.com/yimuwangcs/wise-hawaii.
comment: NeurIPS 2025
♻ ☆ Diagnosing the Performance Trade-off in Moral Alignment: A Case Study on Gender Stereotypes
Moral alignment has emerged as a widely adopted approach for regulating the behavior of pretrained language models (PLMs), typically through fine-tuning on curated datasets. Gender stereotype mitigation is a representational task within the broader application of moral alignment. However, this process often comes at the cost of degraded downstream task performance. Prior studies commonly aim to achieve a performance trade-off by encouraging PLMs to selectively forget only stereotypical knowledge through carefully designed fairness objective, while preserving their language modeling capability (overall forgetting). In this short paper, we investigate whether the performance trade-off can be achieved through the lens of forgetting and the fairness objective. Our analysis shows that the large datasets needed for satisfactory fairness highlight the limitations of current fairness objectives in achieving an effective trade-off: (1) downstream task performance is strongly correlated with overall forgetting; (2) selective forgetting reduces stereotypes, but overall forgetting increases. and (3) general solutions for alleviating forgetting are ineffective at reducing the overall forgetting and fail to improve downstream task performance.
♻ ☆ Efficient Environmental Claim Detection with Hyperbolic Graph Neural Networks
Transformer based models, especially large language models (LLMs) dominate the field of NLP with their mass adoption in tasks such as text generation, summarization and fake news detection. These models offer ease of deployment and reliability for most applications, however, they require significant amounts of computational power for training as well as inference. This poses challenges in their adoption in resource-constrained applications, especially in the open-source community where compute availability is usually scarce. This work proposes a graph-based approach for Environmental Claim Detection, exploring Graph Neural Networks (GNNs) and Hyperbolic Graph Neural Networks (HGNNs) as lightweight yet effective alternatives to transformer-based models. Re-framing the task as a graph classification problem, we transform claim sentences into dependency parsing graphs, utilizing a combination of word2vec \& learnable part-of-speech (POS) tag embeddings for the node features and encoding syntactic dependencies in the edge relations. Our results show that our graph-based models, particularly HGNNs in the poincaré space (P-HGNNs), achieve performance superior to the state-of-the-art on environmental claim detection while using up to \textbf{30x fewer parameters}. We also demonstrate that HGNNs benefit vastly from explicitly modeling data in hierarchical (tree-like) structures, enabling them to significantly improve over their euclidean counterparts.
♻ ☆ Steering Evaluation-Aware Language Models to Act Like They Are Deployed
Large language models (LLMs) can sometimes detect when they are being evaluated and adjust their behavior to appear more aligned, compromising the reliability of safety evaluations. In this paper, we show that adding a steering vector to an LLM's activations can suppress evaluation-awareness and make the model act like it is deployed during evaluation. To study our steering technique, we train an LLM to exhibit evaluation-aware behavior using a two-step training process designed to mimic how this behavior could emerge naturally. First, we perform continued pretraining on documents with factual descriptions of the model (1) using Python type hints during evaluation but not during deployment and (2) recognizing that the presence of a certain evaluation cue always means that it is being tested. Then, we train the model with expert iteration to use Python type hints in evaluation settings. The resulting model is evaluation-aware: it writes type hints in evaluation contexts more than deployment contexts. We find that activation steering can suppress evaluation awareness and make the model act like it is deployed even when the cue is present. Importantly, we constructed our steering vector using the original model before our additional training. Our results suggest that AI evaluators could improve the reliability of safety evaluations by steering models to act like they are deployed.
♻ ☆ Auditing Google's AI Overviews and Featured Snippets: A Case Study on Baby Care and Pregnancy AAAI
Google Search increasingly surfaces AI-generated content through features like AI Overviews (AIO) and Featured Snippets (FS), which users frequently rely on despite having no control over their presentation. Through a systematic algorithm audit of 1,508 real baby care and pregnancy-related queries, we evaluate the quality and consistency of these information displays. Our robust evaluation framework assesses multiple quality dimensions, including answer consistency, relevance, presence of medical safeguards, source categories, and sentiment alignment. Our results reveal concerning gaps in information consistency, with information in AIO and FS displayed on the same search result page being inconsistent with each other in 33% of cases. Despite high relevance scores, both features critically lack medical safeguards (present in just 11% of AIO and 7% of FS responses). While health and wellness websites dominate source categories for both, AIO and FS, FS also often link to commercial sources. These findings have important implications for public health information access and demonstrate the need for stronger quality controls in AI-mediated health information. Our methodology provides a transferable framework for auditing AI systems across high-stakes domains where information quality directly impacts user well-being.
comment: 18 pages, 10 figures; to appear in AAAI ICWSM 2026
♻ ☆ CAIRe: Cultural Attribution of Images by Retrieval-Augmented Evaluation
As text-to-image models become increasingly prevalent, ensuring their equitable performance across diverse cultural contexts is critical. Efforts to mitigate cross-cultural biases have been hampered by trade-offs, including a loss in performance, factual inaccuracies, or offensive outputs. Despite widespread recognition of these challenges, an inability to reliably measure these biases has stalled progress. To address this gap, we introduce CAIRe, an evaluation metric that assesses the degree of cultural relevance of an image, given a user-defined set of labels. Our framework grounds entities and concepts in the image to a knowledge base and uses factual information to give independent graded judgments for each culture label. On a manually curated dataset of culturally salient but rare items built using language models, CAIRe surpasses all baselines by 22% F1 points. Additionally, we construct two datasets for culturally universal concepts, one comprising T2I-generated outputs and another retrieved from naturally occurring data. CAIRe achieves Pearson's correlations of 0.56 and 0.66 with human ratings on these sets, based on a 5-point Likert scale of cultural relevance. This demonstrates its strong alignment with human judgment across diverse image sources.
comment: Preprint, under review
♻ ☆ Exploration of Summarization by Generative Language Models for Automated Scoring of Long Essays
BERT and its variants are extensively explored for automated scoring. However, a limit of 512 tokens for these encoder-based models showed the deficiency in automated scoring of long essays. Thus, this research explores generative language models for automated scoring of long essays via summarization and prompting. The results revealed great improvement of scoring accuracy with QWK increased from 0.822 to 0.8878 for the Learning Agency Lab Automated Essay Scoring 2.0 dataset.
comment: 19 pages, 5 Tables 7 Figures, Presentation at Artificial Intelligence in Measurement and Education Conference (AIME-Con)
♻ ☆ Fairshare Data Pricing via Data Valuation for Large Language Models
Training data is the backbone of large language models (LLMs), yet today's data markets often operate under exploitative pricing -- sourcing data from marginalized groups with little pay or recognition. This paper introduces a theoretical framework for LLM data markets, modeling the strategic interactions between buyers (LLM builders) and sellers (human annotators). We begin with theoretical and empirical analysis showing how exploitative pricing drives high-quality sellers out of the market, degrading data quality and long-term model performance. Then we introduce fairshare, a pricing mechanism grounded in data valuation that quantifies each data's contribution. It aligns incentives by sustaining seller participation and optimizing utility for both buyers and sellers. Theoretically, we show that fairshare yields mutually optimal outcomes: maximizing long-term buyer utility and seller profit while sustaining market participation. Empirically when training open-source LLMs on complex NLP tasks, including math problems, medical diagnosis, and physical reasoning, fairshare boosts seller earnings and ensures a stable supply of high-quality data, while improving buyers' performance-per-dollar and long-term welfare. Our findings offer a concrete path toward fair, transparent, and economically sustainable data markets for LLM.
♻ ☆ Foundational Automatic Evaluators: Scaling Multi-Task Generative Evaluator Training for Reasoning-Centric Domains
Finetuning specialized generative evaluators has emerged as a popular paradigm to meet the increasing demand for scalable evaluation during both training and test-time. However, recent work has largely focused on applying new methodology, such as reinforcement learning (RL), to training evaluators, shying away from large-scale, data-driven development. In this work, we focus on data scaling, curating a set of 2.5M samples spanning five unique evaluation tasks (pairwise, step-level, reference-free and reference-based verification, and single rating) and multiple domains focused on reasoning evaluation. With our data, we train Foundational Automatic Reasoning Evaluators (FARE), a family of 8B and 20B (with 3.6B active) parameter evaluators, with a simple iterative rejection-sampling supervised finetuning (SFT) approach. FARE-8B challenges larger specialized RL-trained evaluators and FARE-20B sets the new standard for open-source evaluators, surpassing specialized 70B+ evaluators. Beyond static benchmarks, we evaluate FARE in real-world tasks: As inference-time rerankers, FARE-20B achieves near-oracle performance on MATH. As verifiers in RL training, FARE improves the downstream RL-trained model performance by up to 14.1% vs. string-matching verifiers. When initialized from FARE, a continually-finetuned FARE-Code outperforms gpt-oss-20B by 65% on evaluating test-case quality.
comment: 29 pages, 9 tables, 6 figures
♻ ☆ Knowledge-Grounded Agentic Large Language Models for Multi-Hazard Understanding from Reconnaissance Reports
Post-disaster reconnaissance reports contain critical evidence for understanding multi-hazard interactions, yet their unstructured narratives make systematic knowledge transfer difficult. Large language models (LLMs) offer new potential for analyzing these reports, but often generate unreliable or hallucinated outputs when domain grounding is absent. This study introduces the Mixture-of-Retrieval Agentic RAG (MoRA-RAG), a knowledge-grounded LLM framework that transforms reconnaissance reports into a structured foundation for multi-hazard reasoning. The framework integrates a Mixture-of-Retrieval mechanism that dynamically routes queries across hazard-specific databases while using agentic chunking to preserve contextual coherence during retrieval. It also includes a verification loop that assesses evidence sufficiency, refines queries, and initiates targeted searches when information remains incomplete. We construct HazardRecQA by deriving question-answer pairs from GEER reconnaissance reports, which document 90 global events across seven major hazard types. MoRA-RAG achieves up to 94.5 percent accuracy, outperforming zero-shot LLMs by 30 percent and state-of-the-art RAG systems by 10 percent, while reducing hallucinations across diverse LLM architectures. MoRA-RAG also enables open-weight LLMs to achieve performance comparable to proprietary models. It establishes a new paradigm for transforming post-disaster documentation into actionable, trustworthy intelligence for hazard resilience.
comment: 17 pages, 5 figures
♻ ☆ Newswire Extraction: A pipeline for extracting newswires from newspaper images
I describe a new pipeline for extracting wire services (e.g., Associated Press, United Press International, Newspaper Enterprise Association) from newspaper images.
♻ ☆ Privacy Preserving In-Context-Learning Framework for Large Language Models
Large language models (LLMs) have significantly transformed natural language understanding and generation, but they raise privacy concerns due to potential exposure of sensitive information. Studies have highlighted the risk of information leakage, where adversaries can extract sensitive information embedded in the prompts. In this work, we introduce a novel private prediction framework for generating high-quality synthetic text with strong privacy guarantees. Our approach leverages the Differential Privacy (DP) framework to ensure worst-case theoretical bounds on information leakage without requiring any fine-tuning of the underlying models. The proposed method performs inference on private records and aggregates the resulting per-token output distributions. This enables the generation of longer and coherent synthetic text while maintaining privacy guarantees. Additionally, we propose a simple blending operation that combines private and public inference to further enhance utility. Empirical evaluations demonstrate that our approach outperforms previous state-of-the-art methods on in-context-learning (ICL) tasks, making it a promising direction for privacy-preserving text generation while maintaining high utility. Our code is available at https://github.com/bhusalb/privacy-preserving-icl.
comment: Git repo: https://github.com/bhusalb/privacy-preserving-icl
♻ ☆ Based on Data Balancing and Model Improvement for Multi-Label Sentiment Classification Performance Enhancement
Multi-label sentiment classification plays a vital role in natural language processing by detecting multiple emotions within a single text. However, existing datasets like GoEmotions often suffer from severe class imbalance, which hampers model performance, especially for underrepresented emotions. To address this, we constructed a balanced multi-label sentiment dataset by integrating the original GoEmotions data, emotion-labeled samples from Sentiment140 using a RoBERTa-base-GoEmotions model, and manually annotated texts generated by GPT-4 mini. Our data balancing strategy ensured an even distribution across 28 emotion categories. Based on this dataset, we developed an enhanced multi-label classification model that combines pre-trained FastText embeddings, convolutional layers for local feature extraction, bidirectional LSTM for contextual learning, and an attention mechanism to highlight sentiment-relevant words. A sigmoid-activated output layer enables multi-label prediction, and mixed precision training improves computational efficiency. Experimental results demonstrate significant improvements in accuracy, precision, recall, F1-score, and AUC compared to models trained on imbalanced data, highlighting the effectiveness of our approach.
comment: 12 pages, 8 figures, 5 tables. Dataset and code available at https://doi.org/10.5281/zenodo.16890154 and https://doi.org/10.5281/zenodo.15837871
♻ ☆ On the Alignment of Large Language Models with Global Human Opinion
Today's large language models (LLMs) are capable of supporting multilingual scenarios, allowing users to interact with LLMs in their native languages. When LLMs respond to subjective questions posed by users, they are expected to align with the views of specific demographic groups or historical periods, shaped by the language in which the user interacts with the model. Existing studies mainly focus on researching the opinions represented by LLMs among demographic groups in the United States or a few countries, lacking worldwide country samples and studies on human opinions in different historical periods, as well as lacking discussion on using language to steer LLMs. Moreover, they also overlook the potential influence of prompt language on the alignment of LLMs' opinions. In this study, our goal is to fill these gaps. To this end, we create an evaluation framework based on the World Values Survey (WVS) to systematically assess the alignment of LLMs with human opinions across different countries, languages, and historical periods around the world. We find that LLMs appropriately or over-align the opinions with only a few countries while under-aligning the opinions with most countries. Furthermore, changing the language of the prompt to match the language used in the questionnaire can effectively steer LLMs to align with the opinions of the corresponding country more effectively than existing steering methods. At the same time, LLMs are more aligned with the opinions of the contemporary population. To our knowledge, our study is the first comprehensive investigation of the topic of opinion alignment in LLMs across global, language, and temporal dimensions. Our code and data are publicly available at https://github.com/ku-nlp/global-opinion-alignment and https://github.com/nlply/global-opinion-alignment.
comment: 28 pages, 26 figures
♻ ☆ Investigating Hallucination in Conversations for Low Resource Languages
Large Language Models (LLMs) have demonstrated remarkable proficiency in generating text that closely resemble human writing. However, they often generate factually incorrect statements, a problem typically referred to as 'hallucination'. Addressing hallucination is crucial for enhancing the reliability and effectiveness of LLMs. While much research has focused on hallucinations in English, our study extends this investigation to conversational data in three languages: Hindi, Farsi, and Mandarin. We offer a comprehensive analysis of a dataset to examine both factual and linguistic errors in these languages for GPT-3.5, GPT-4o, Llama-3.1, Gemma-2.0, DeepSeek-R1 and Qwen-3. We found that LLMs produce very few hallucinated responses in Mandarin but generate a significantly higher number of hallucinations in Hindi and Farsi.
♻ ☆ Euclid's Gift: Enhancing Spatial Perception and Reasoning in Vision-Language Models via Geometric Surrogate Tasks
Spatial intelligence spans a rich suite of abilities, including visualising and transforming shapes, mentally rotating objects, judging relational positions and containment, and estimating numerosity. However, it still remains a critical unresolved challenge for Multimodal Large Language Models (MLLMs). To fill this gap, we propose to treat Euclidean geometry problem-solving as a surrogate task. Specifically, we meticulously constructed a curated multimodal dataset, called Euclid30K, comprising approximately 30K plane and solid geometry problems. Furthermore, to enable the model to learn and apply Euclidean principles from these geometry problems, we fine-tuned seven model variants (spanning 3--72B parameters) from the Qwen2.5VL, Qwen3VL, and RoboBrain2.0 families using Group Relative Policy Optimization (GRPO), inspiring the models to identify shapes, count, and relate entities, and perform multi-step deductive reasoning using Euclidean principles. Our experiments demonstrate that the resulting models achieve substantial zero-shot gains across four spatial reasoning benchmarks (Super-CLEVR, Omni3DBench, VSI-Bench, and MindCube) without any task-specific adaptations. Notably, after training on the Euclid30K, the mean VSI-Bench accuracy rose from 36.6\% to 41.8\% (+5.2\%), and the mean MindCube accuracy rose from 31.4\% to 38.1\% (+6.7\%). To our knowledge, this is the first systematic study showing that geometry-centric fine-tuning can confer vision-language models with broadly transferable spatial skills. Code and Euclid30K dataset can be found in \href{https://zgca-ai4edu.github.io/Euclids_Gift}{this}.
♻ ☆ Retrieval Augmented Generation based context discovery for ASR EMNLP 2025
This work investigates retrieval augmented generation as an efficient strategy for automatic context discovery in context-aware Automatic Speech Recognition (ASR) system, in order to improve transcription accuracy in the presence of rare or out-of-vocabulary terms. However, identifying the right context automatically remains an open challenge. This work proposes an efficient embedding-based retrieval approach for automatic context discovery in ASR. To contextualize its effectiveness, two alternatives based on large language models (LLMs) are also evaluated: (1) large language model (LLM)-based context generation via prompting, and (2) post-recognition transcript correction using LLMs. Experiments on the TED-LIUMv3, Earnings21 and SPGISpeech demonstrate that the proposed approach reduces WER by up to 17% (percentage difference) relative to using no-context, while the oracle context results in a reduction of up to 24.1%.
comment: Accepted at EMNLP 2025
♻ ☆ A Data-driven ML Approach for Maximizing Performance in LLM-Adapter Serving
With the rapid adoption of Large Language Models (LLMs), LLM-adapters have become increasingly common, providing lightweight specialization of large-scale models. Serving hundreds or thousands of these adapters on a single GPU allows request aggregation, increasing throughput, but may also cause request starvation if GPU memory limits are exceeded. To address this issue, this study focuses on determining the joint configuration of concurrent and parallel adapters that maximizes GPU throughput without inducing starvation, given heterogeneous adapter and traffic properties. We propose a data-driven ML approach leveraging interpretable models to tackle this caching problem and introduce the first Digital Twin capable of reproducing an LLM-adapter serving system, enabling efficient training data generation. Experiments with the vLLM framework and LoRA adapters show that the Digital Twin reproduces throughput within 5.1% of real results, while the ML approach predicts optimal numbers of concurrent and parallel adapters with an error of at most 7.2% under heterogeneous, real-world workloads. The code is publicly available at https://github.com/FerranAgulloLopez/GPULLMAdapterOptimization.
comment: Accepted in a computer science workshop
♻ ☆ MessIRve: A Large-Scale Spanish Information Retrieval Dataset EMNLP 2025
Information retrieval (IR) is the task of finding relevant documents in response to a user query. Although Spanish is the second most spoken native language, there are few Spanish IR datasets, which limits the development of information access tools for Spanish speakers. We introduce MessIRve, a large-scale Spanish IR dataset with almost 700,000 queries from Google's autocomplete API and relevant documents sourced from Wikipedia. MessIRve's queries reflect diverse Spanish-speaking regions, unlike other datasets that are translated from English or do not consider dialectal variations. The large size of the dataset allows it to cover a wide variety of topics, unlike smaller datasets. We provide a comprehensive description of the dataset, comparisons with existing datasets, and baseline evaluations of prominent IR models. Our contributions aim to advance Spanish IR research and improve information access for Spanish speakers.
comment: Camera-ready for EMNLP 2025 (main conference)
♻ ☆ CLIRudit: Cross-Lingual Information Retrieval of Scientific Documents EMNLP 2025
Cross-lingual information retrieval (CLIR) helps users find documents in languages different from their queries. This is especially important in academic search, where key research is often published in non-English languages. We present CLIRudit, a novel English-French academic retrieval dataset built from Érudit, a Canadian publishing platform. Using multilingual metadata, we pair English author-written keywords as queries with non-English abstracts as target documents, a method that can be applied to other languages and repositories. We benchmark various first-stage sparse and dense retrievers, with and without machine translation. We find that dense embeddings without translation perform nearly as well as systems using machine translation, that translating documents is generally more effective than translating queries, and that sparse retrievers with document translation remain competitive while offering greater efficiency. Along with releasing the first English-French academic retrieval dataset, we provide a reproducible benchmarking method to improve access to non-English scholarly content.
comment: Camera-ready for the 5th Multilingual Representation Learning (MRL) Workshop (Co-located with EMNLP 2025)
♻ ☆ A Typology of Synthetic Datasets for Dialogue Processing in Clinical Contexts
Synthetic data sets are used across linguistic domains and NLP tasks, particularly in scenarios where authentic data is limited (or even non-existent). One such domain is that of clinical (healthcare) contexts, where there exist significant and long-standing challenges (e.g., privacy, anonymization, and data governance) which have led to the development of an increasing number of synthetic datasets. One increasingly important category of clinical dataset is that of clinical dialogues which are especially sensitive and difficult to collect, and as such are commonly synthesized. While such synthetic datasets have been shown to be sufficient in some situations, little theory exists to inform how they may be best used and generalized to new applications. In this paper, we provide an overview of how synthetic datasets are created, evaluated and being used for dialogue related tasks in the medical domain. Additionally, we propose a novel typology for use in classifying types and degrees of data synthesis, to facilitate comparison and evaluation.
♻ ☆ GlobalRAG: Enhancing Global Reasoning in Multi-hop Question Answering via Reinforcement Learning
Reinforcement learning has recently shown promise in improving retrieval-augmented generation (RAG). Despite these advances, its effectiveness in multi-hop question answering (QA) remains limited by two fundamental limitations: (i) global planning absence to structure multi-step reasoning, and (ii) unfaithful execution, which hinders effective query formulation and consistent use of retrieved evidence. We propose GlobalRAG, a reinforcement learning framework designed to enhance global reasoning in multi-hop QA. GlobalRAG decomposes questions into subgoals, coordinates retrieval with reasoning, and refines evidence iteratively. To guide this process, we introduce Planning Quality Reward and SubGoal Completion Reward, which encourage coherent planning and reliable subgoal execution. In addition, a progressive weight annealing strategy balances process-oriented and outcome-based objectives. Extensive experiments on both in-domain and out-of-domain benchmarks demonstrate that GlobalRAG significantly outperforms strong baselines while using only 8k training data (42% of the training data used by strong baselines), achieving average improvements of 14.2% in both EM and F1.
comment: 8 pages, 3 figures, 4 tables
♻ ☆ ReFactX: Scalable Reasoning with Reliable Facts via Constrained Generation ISWC
Knowledge gaps and hallucinations are persistent challenges for Large Language Models (LLMs), which generate unreliable responses when lacking the necessary information to fulfill user instructions. Existing approaches, such as Retrieval-Augmented Generation (RAG) and tool use, aim to address these issues by incorporating external knowledge. Yet, they rely on additional models or services, resulting in complex pipelines, potential error propagation, and often requiring the model to process a large number of tokens. In this paper, we present a scalable method that enables LLMs to access external knowledge without depending on retrievers or auxiliary models. Our approach uses constrained generation with a pre-built prefix-tree index. Triples from a Knowledge Graph are verbalized in textual facts, tokenized, and indexed in a prefix tree for efficient access. During inference, to acquire external knowledge, the LLM generates facts with constrained generation which allows only sequences of tokens that form an existing fact. We evaluate our proposal on Question Answering and show that it scales to large knowledge bases (800 million facts), adapts to domain-specific data, and achieves effective results. These gains come with minimal generation-time overhead. ReFactX code is available at https://github.com/rpo19/ReFactX.
comment: 19 pages, 6 figures, accepted at ISWC
♻ ☆ Leveraging the Power of Large Language Models in Entity Linking via Adaptive Routing and Targeted Reasoning EMNLP 2025
Entity Linking (EL) has traditionally relied on large annotated datasets and extensive model fine-tuning. While recent few-shot methods leverage large language models (LLMs) through prompting to reduce training requirements, they often suffer from inefficiencies due to expensive LLM-based reasoning. ARTER (Adaptive Routing and Targeted Entity Reasoning) presents a structured pipeline that achieves high performance without deep fine-tuning by strategically combining candidate generation, context-based scoring, adaptive routing, and selective reasoning. ARTER computes a small set of complementary signals(both embedding and LLM-based) over the retrieved candidates to categorize contextual mentions into easy and hard cases. The cases are then handled by a low-computational entity linker (e.g. ReFinED) and more expensive targeted LLM-based reasoning respectively. On standard benchmarks, ARTER outperforms ReFinED by up to +4.47%, with an average gain of +2.53% on 5 out of 6 datasets, and performs comparably to pipelines using LLM-based reasoning for all mentions, while being as twice as efficient in terms of the number of LLM tokens.
comment: Accepted to EMNLP 2025 Industry Track
♻ ☆ The Learning Dynamics of Subword Segmentation for Morphologically Diverse Languages
Subword segmentation is typically applied in preprocessing and stays fixed during training. Alternatively, it can be learned during training to optimise the training objective. In this paper we study the learning dynamics of subword segmentation: if a language model can dynamically optimise tokenisation, how do its subwords evolve during pretraining and finetuning? To explore this, we extend the subword segmental language model (SSLM), a framework for learning subwords during training, to support pretraining and finetuning. We train models for three typologically diverse languages to study learning dynamics across the morphological spectrum: Isi-Xhosa is conjunctive (long word forms composed of many morphemes), Setswana is disjunctive (morphemes written as separate words), and English represents a typological middle ground. We analyse subword dynamics from a linguistic perspective, tracking morphology, productivity, and fertility. We identify four stages of subword learning, with the morphologically complex isi-Xhosa exhibiting greater instability. During finetuning, subword boundaries shift to become finer-grained. Lastly, we show that learnable subwords offers a promising approach to improve text generation and cross-lingual transfer for low-resource, morphologically complex languages.
♻ ☆ ConInstruct: Evaluating Large Language Models on Conflict Detection and Resolution in Instructions AAAI 2026
Instruction-following is a critical capability of Large Language Models (LLMs). While existing works primarily focus on assessing how well LLMs adhere to user instructions, they often overlook scenarios where instructions contain conflicting constraints-a common occurrence in complex prompts. The behavior of LLMs under such conditions remains under-explored. To bridge this gap, we introduce ConInstruct, a benchmark specifically designed to assess LLMs' ability to detect and resolve conflicts within user instructions. Using this dataset, we evaluate LLMs' conflict detection performance and analyze their conflict resolution behavior. Our experiments reveal two key findings: (1) Most proprietary LLMs exhibit strong conflict detection capabilities, whereas among open-source models, only DeepSeek-R1 demonstrates similarly strong performance. DeepSeek-R1 and Claude-4.5-Sonnet achieve the highest average F1-scores at 91.5% and 87.3%, respectively, ranking first and second overall. (2) Despite their strong conflict detection abilities, LLMs rarely explicitly notify users about the conflicts or request clarification when faced with conflicting constraints. These results underscore a critical shortcoming in current LLMs and highlight an important area for future improvement when designing instruction-following LLMs.
comment: Accepted to AAAI 2026
♻ ☆ Assemble Your Crew: Automatic Multi-agent Communication Topology Design via Autoregressive Graph Generation AAAI 2026
Multi-agent systems (MAS) based on large language models (LLMs) have emerged as a powerful solution for dealing with complex problems across diverse domains. The effectiveness of MAS is critically dependent on its collaboration topology, which has become a focal point for automated design research. However, existing approaches are fundamentally constrained by their reliance on a template graph modification paradigm with a predefined set of agents and hard-coded interaction structures, significantly limiting their adaptability to task-specific requirements. To address these limitations, we reframe MAS design as a conditional autoregressive graph generation task, where both the system composition and structure are designed jointly. We propose ARG-Designer, a novel autoregressive model that operationalizes this paradigm by constructing the collaboration graph from scratch. Conditioned on a natural language task query, ARG-Designer sequentially and dynamically determines the required number of agents, selects their appropriate roles from an extensible pool, and establishes the optimal communication links between them. This generative approach creates a customized topology in a flexible and extensible manner, precisely tailored to the unique demands of different tasks. Extensive experiments across six diverse benchmarks demonstrate that ARG-Designer not only achieves state-of-the-art performance but also enjoys significantly greater token efficiency and enhanced extensibility. The source code of ARG-Designer is available at https://github.com/Shiy-Li/ARG-Designer.
comment: Accepted as an oral presentation by AAAI 2026
♻ ☆ WISE: A World Knowledge-Informed Semantic Evaluation for Text-to-Image Generation
Text-to-Image (T2I) models are capable of generating high-quality artistic creations and visual content. However, existing research and evaluation standards predominantly focus on image realism and shallow text-image alignment, lacking a comprehensive assessment of complex semantic understanding and world knowledge integration in text-to-image generation. To address this challenge, we propose \textbf{WISE}, the first benchmark specifically designed for \textbf{W}orld Knowledge-\textbf{I}nformed \textbf{S}emantic \textbf{E}valuation. WISE moves beyond simple word-pixel mapping by challenging models with 1000 meticulously crafted prompts across 25 subdomains in cultural common sense, spatio-temporal reasoning, and natural science. To overcome the limitations of traditional CLIP metric, we introduce \textbf{WiScore}, a novel quantitative metric for assessing knowledge-image alignment. Through comprehensive testing of 20 models (10 dedicated T2I models and 10 unified multimodal models) using 1,000 structured prompts spanning 25 subdomains, our findings reveal significant limitations in their ability to effectively integrate and apply world knowledge during image generation, highlighting critical pathways for enhancing knowledge incorporation and application in next-generation T2I models. Code and data are available at \href{https://github.com/PKU-YuanGroup/WISE}{PKU-YuanGroup/WISE}.
comment: Code, data and leaderboard: https://github.com/PKU-YuanGroup/WISE
♻ ☆ Where does an LLM begin computing an instruction?
Following an instruction involves distinct sub-processes, such as reading content, reading the instruction, executing it, and producing an answer. We ask where, along the layer stack, instruction following begins, the point where reading gives way to doing. We introduce three simple datasets (Key-Value, Quote Attribution, Letter Selection) and two hop compositions of these tasks. Using activation patching on minimal-contrast prompt pairs, we measure a layer-wise flip rate that indicates when substituting selected residual activations changes the predicted answer. Across models in the Llama family, we observe an inflection point, which we term onset, where interventions that change predictions before this point become largely ineffective afterward. Multi-hop compositions show a similar onset location. These results provide a simple, replicable way to locate where instruction following begins and to compare this location across tasks and model sizes.
comment: Extended Abstract accepted at UniReps '25 Workshop
♻ ☆ Pragmatic Theories Enhance Understanding of Implied Meanings in LLMs
The ability to accurately interpret implied meanings plays a crucial role in human communication and language use, and language models are also expected to possess this capability. This study demonstrates that providing language models with pragmatic theories as prompts is an effective in-context learning approach for tasks to understand implied meanings. Specifically, we propose an approach in which an overview of pragmatic theories, such as Gricean pragmatics and Relevance Theory, is presented as a prompt to the language model, guiding it through a step-by-step reasoning process to derive a final interpretation. Experimental results showed that, compared to the baseline, which prompts intermediate reasoning without presenting pragmatic theories (0-shot Chain-of-Thought), our methods enabled language models to achieve up to 9.6\% higher scores on pragmatic reasoning tasks. Furthermore, we show that even without explaining the details of pragmatic theories, merely mentioning their names in the prompt leads to a certain performance improvement (around 1-3%) in larger models compared to the baseline.
♻ ☆ Critical or Compliant? The Double-Edged Sword of Reasoning in Chain-of-Thought Explanations
Explanations are often promoted as tools for transparency, but they can also foster confirmation bias; users may assume reasoning is correct whenever outputs appear acceptable. We study this double-edged role of Chain-of-Thought (CoT) explanations in multimodal moral scenarios by systematically perturbing reasoning chains and manipulating delivery tones. Specifically, we analyze reasoning errors in vision language models (VLMs) and how they impact user trust and the ability to detect errors. Our findings reveal two key effects: (1) users often equate trust with outcome agreement, sustaining reliance even when reasoning is flawed, and (2) the confident tone suppresses error detection while maintaining reliance, showing that delivery styles can override correctness. These results highlight how CoT explanations can simultaneously clarify and mislead, underscoring the need for NLP systems to provide explanations that encourage scrutiny and critical thinking rather than blind trust. All code will be released publicly.
comment: Under review; 16 pages, 15 figures
♻ ☆ Step-Audio-EditX Technical Report
We present Step-Audio-EditX, the first open-source LLM-based audio model excelling at expressive and iterative audio editing encompassing emotion, speaking style, and paralinguistics alongside robust zero-shot text-to-speech (TTS) capabilities. Our core innovation lies in leveraging only large-margin synthetic data, which circumvents the need for embedding-based priors or auxiliary modules. This large-margin learning approach enables both iterative control and high expressivity across voices, and represents a fundamental pivot from the conventional focus on representation-level disentanglement. Evaluation results demonstrate that Step-Audio-EditX surpasses both MiniMax-2.6-hd and Doubao-Seed-TTS-2.0 in emotion editing and other fine-grained control tasks.
♻ ☆ Confidential Prompting: Privacy-preserving LLM Inference on Cloud
This paper introduces a vision of confidential prompting: securing user prompts from an untrusted, cloud-hosted large language model (LLM) while preserving model confidentiality, output invariance, and compute efficiency. As a first step toward this vision, we present Petridish, a system built on top of confidential computing and its core contribution, a novel technology called Secure Partitioned Decoding (SPD). Petridish runs the LLM service inside a confidential virtual machine (CVM), which protects the secrets, i.e., the LLM parameters and user prompts, from adversaries outside the CVM. Importantly, it splits the LLM service for a user into two processes, using SPD: a per-user process performs prefill with the user prompts and computes attention scores during decoding; a service process, shared by all users, batches the attention scores from per-user processes and generates output tokens for all users. Both the LLM provider and the users trust Petridish's CVM and its operating system, which guarantees isolation between processes and limits their outbound network capabilities to control information flow. The CVM's attestation capability and its open-source software stack enable Petridish to provide auditable protection of both user prompt and LLM confidentiality. Together, Petridish maintains full utility of LLM service and enables practical, privacy-preserving cloud-hosted LLM inference for sensitive applications, such as processing personal data, clinical records, and financial documents.
♻ ☆ MedBench v4: A Robust and Scalable Benchmark for Evaluating Chinese Medical Language Models, Multimodal Models, and Intelligent Agents
Recent advances in medical large language models (LLMs), multimodal models, and agents demand evaluation frameworks that reflect real clinical workflows and safety constraints. We present MedBench v4, a nationwide, cloud-based benchmarking infrastructure comprising over 700,000 expert-curated tasks spanning 24 primary and 91 secondary specialties, with dedicated tracks for LLMs, multimodal models, and agents. Items undergo multi-stage refinement and multi-round review by clinicians from more than 500 institutions, and open-ended responses are scored by an LLM-as-a-judge calibrated to human ratings. We evaluate 15 frontier models. Base LLMs reach a mean overall score of 54.1/100 (best: Claude Sonnet 4.5, 62.5/100), but safety and ethics remain low (18.4/100). Multimodal models perform worse overall (mean 47.5/100; best: GPT-5, 54.9/100), with solid perception yet weaker cross-modal reasoning. Agents built on the same backbones substantially improve end-to-end performance (mean 79.8/100), with Claude Sonnet 4.5-based agents achieving up to 85.3/100 overall and 88.9/100 on safety tasks. MedBench v4 thus reveals persisting gaps in multimodal reasoning and safety for base models, while showing that governance-aware agentic orchestration can markedly enhance benchmarked clinical readiness without sacrificing capability. By aligning tasks with Chinese clinical guidelines and regulatory priorities, the platform offers a practical reference for hospitals, developers, and policymakers auditing medical AI.
Computers and Society
☆ AI-Assisted Writing Is Growing Fastest Among Non-English-Speaking and Less Established Scientists
The dominance of English in global science has long created significant barriers for non-native speakers. The recent emergence of generative artificial intelligence (GenAI) dramatically reduces drafting and revision costs, but, simultaneously, raises a critical question: how is the technology being adopted by the global scientific community, and is it mitigating existing inequities? This study provides first large-scale empirical evidence by analyzing over two million full-text biomedical publications from PubMed Central from 2021 to 2024, estimating the fraction of AI-generated content using a distribution-based framework. We observe a significant post-ChatGPT surge in AI-assisted writing, with adoption growing fastest in contexts where language barriers are most pronounced: approximately 400% in non-English-speaking countries compared to 183% in English-speaking countries. This adoption is highest among less-established scientists, including those with fewer publications and citations, as well as those in early career stages at lower-ranked institutions. Prior AI research experience also predicted higher adoption. Finally, increased AI usage was associated with a modest increase in productivity, narrowing the publication gap between scientists from English-speaking and non-English-speaking countries with higher levels of AI adoption. These findings provide large-scale evidence that generative AI is being adopted unevenly, reflecting existing structural disparities while also offering a potential opportunity to mitigate long-standing linguistic inequalities.
☆ The Loss of Control Playbook: Degrees, Dynamics, and Preparedness
This research report addresses the absence of an actionable definition for Loss of Control (LoC) in AI systems by developing a novel taxonomy and preparedness framework. Despite increasing policy and research attention, existing LoC definitions vary significantly in scope and timeline, hindering effective LoC assessment and mitigation. To address this issue, we draw from an extensive literature review and propose a graded LoC taxonomy, based on the metrics of severity and persistence, that distinguishes between Deviation, Bounded LoC, and Strict LoC. We model pathways toward a societal state of vulnerability in which sufficiently advanced AI systems have acquired or could acquire the means to cause Bounded or Strict LoC once a catalyst, either misalignment or pure malfunction, materializes. We argue that this state becomes increasingly likely over time, absent strategic intervention, and propose a strategy to avoid reaching a state of vulnerability. Rather than focusing solely on intervening on AI capabilities and propensities potentially relevant for LoC or on preventing potential catalysts, we introduce a complementary framework that emphasizes three extrinsic factors: Deployment context, Affordances, and Permissions (the DAP framework). Compared to work on intrinsic factors and catalysts, this framework has the unfair advantage of being actionable today. Finally, we put forward a plan to maintain preparedness and prevent the occurrence of LoC outcomes should a state of societal vulnerability be reached, focusing on governance measures (threat modeling, deployment policies, emergency response) and technical controls (pre-deployment testing, control measures, monitoring) that could maintain a condition of perennial suspension.
☆ Comparative Security Performance of Workday Cloud ERP Across Key Dimensions
Workday is a cloud-based Enterprise Resource Planning-ERP system that brings HR, Finance, Supply Chain functions , Prism Analytics and Extend custom built in application together under an integrated software as a service SaaS environment. As every organization that undergoes digital transformation, the importance of securing sensitive enterprise data in cloud ERP systems has always been more challeging. To analyze Workday's security architecture, we present a Security analysis in both CIA Triad Enhanced Framework and Zero Trust Security Architecture. The study examines five key dimensions confidentiality, integrity, availability, authentication, and compliance with weighted sub metric analysis and qualitative document review. The results show Workday delivers a composite score of 0.86 with an overall score that closely matches international standards of best practices like GDPR, HIPAA, SOC 2, etc. The platform uses encryption protocols, granular access controls, network safeguards, and continuous verification mechanisms to enable least-privilege access and adaptive defense. Security groups and business process access rules provide scalable governance across very large organizational structures.Workday's layered security to tackle everyday cloud security weaknesses. The work concludes that Workday's architecture demonstrates the best practices for secure, scalable, and compliant ERP application-oriented deployment, which can make this a standard for enterprise cloud security management. These insights provide important guidance for organizations that wish to bolster their cloud ERP defenses and stay ahead of changing regulatory expectations.
comment: 10 pages,3 figures, 1 table
☆ The Evolving Ethics of Medical Data Stewardship
Healthcare stands at a critical crossroads. Artificial Intelligence and modern computing are unlocking opportunities, yet their value lies in the data that fuels them. The value of healthcare data is no longer limited to individual patients. However, data stewardship and governance has not kept pace, and privacy-centric policies are hindering both innovation and patient protections. As healthcare moves toward a data-driven future, we must define reformed data stewardship that prioritizes patients' interests by proactively managing modern risks and opportunities while addressing key challenges in cost, efficacy, and accessibility. Current healthcare data policies are rooted in 20th-century legislation shaped by outdated understandings of data-prioritizing perceived privacy over innovation and inclusion. While other industries thrive in a data-driven era, the evolution of medicine remains constrained by regulations that impose social rather than scientific boundaries. Large-scale aggregation is happening, but within opaque, closed systems. As we continue to uphold foundational ethical principles - autonomy, beneficence, nonmaleficence, and justice - there is a growing imperative to acknowledge they exist in evolving technological, social, and cultural realities. Ethical principles should facilitate, rather than obstruct, dialogue on adapting to meet opportunities and address constraints in medical practice and healthcare delivery. The new ethics of data stewardship places patients first by defining governance that adapts to changing landscapes. It also rejects the legacy of treating perceived privacy as an unquestionable, guiding principle. By proactively redefining data stewardship norms, we can drive an era of medicine that promotes innovation, protects patients, and advances equity - ensuring future generations advance medical discovery and care.
comment: Editorial, 14 pages, 1 figure, 1 table
☆ RescueLens: LLM-Powered Triage and Action on Volunteer Feedback for Food Rescue
Food rescue organizations simultaneously tackle food insecurity and waste by working with volunteers to redistribute food from donors who have excess to recipients who need it. Volunteer feedback allows food rescue organizations to identify issues early and ensure volunteer satisfaction. However, food rescue organizations monitor feedback manually, which can be cumbersome and labor-intensive, making it difficult to prioritize which issues are most important. In this work, we investigate how large language models (LLMs) assist food rescue organizers in understanding and taking action based on volunteer experiences. We work with 412 Food Rescue, a large food rescue organization based in Pittsburgh, Pennsylvania, to design RescueLens, an LLM-powered tool that automatically categorizes volunteer feedback, suggests donors and recipients to follow up with, and updates volunteer directions based on feedback. We evaluate the performance of RescueLens on an annotated dataset, and show that it can recover 96% of volunteer issues at 71% precision. Moreover, by ranking donors and recipients according to their rates of volunteer issues, RescueLens allows organizers to focus on 0.5% of donors responsible for more than 30% of volunteer issues. RescueLens is now deployed at 412 Food Rescue and through semi-structured interviews with organizers, we find that RescueLens streamlines the feedback process so organizers better allocate their time.
comment: Accepted at IAAI'26
☆ Infrastructuring Pop-Up Cities with "Social Layer": Designing Serendipitous Co-Livings for Temporary Intentional Communities
After the pandemic, a new form of "pop-up city" has emerged -- co-living gatherings of 100-200 people for 4-8 weeks that differ from conferences and hack houses. These temporary intentional communities leverages existing urban infrastructure, blending daily life (housing, meals, care) with self-organized activities like learning, creating, and socializing. They coordinate bottom-up programming through an "unconference" system for identity, calendaring, RSVP, and social discovery that fosters spontaneous, serendipitous, enduring ties. This paper examines the design of "Social Layer," an unconferencing system for pop-up cities. We studied its real-world deployment in ShanHaiWoo (Jilin, China, 2023), muChiangmai (Chiangmai, Thailand, 2023), Edge Esmeralda, Edge Esmeralda (Healdsburg, CA, USA, 2024), Aleph (Buenos Aires, Argentina, 2024), and Gathering of Tribe (Lisbon, Portugal, 2024). Our findings distill: (1) the strong concept "scaffolded spontaneity" -- infrastructural affordances that balance structure with openness, amplifying participant agency while maintaining privacy and lightweight governance; (2) design implications for design researchers working on pop-up cities.
comment: Submitted to DRS 2026
☆ Navigating the Ethical and Societal Impacts of Generative AI in Higher Computing Education
Generative AI (GenAI) presents societal and ethical challenges related to equity, academic integrity, bias, and data provenance. In this paper, we outline the goals, methodology and deliverables of their collaborative research, considering the ethical and societal impacts of GenAI in higher computing education. A systematic literature review that addresses a wide set of issues and topics covering the rapidly emerging technology of GenAI from the perspective of its ethical and societal impacts is presented. This paper then presents an evaluation of a broad international review of a set of university adoption, guidelines, and policies related to the use of GenAI and the implications for computing education. The Ethical and Societal Impacts-Framework (ESI-Framework), derived from the literature and policy review and evaluation, outlines the ethical and societal impacts of GenAI in computing education. This work synthesizes existing research and considers the implications for computing higher education. Educators, computing professionals and policy makers facing dilemmas related to the integration of GenAI in their respective contexts may use this framework to guide decision-making in the age of GenAI.
☆ Lost in Vagueness: Towards Context-Sensitive Standards for Robustness Assessment under the EU AI Act
Robustness is a key requirement for high-risk AI systems under the EU Artificial Intelligence Act (AI Act). However, both its definition and assessment methods remain underspecified, leaving providers with little concrete direction on how to demonstrate compliance. This stems from the Act's horizontal approach, which establishes general obligations applicable across all AI systems, but leaves the task of providing technical guidance to harmonised standards. This paper investigates what it means for AI systems to be robust and illustrates the need for context-sensitive standardisation. We argue that robustness is not a fixed property of a system, but depends on which aspects of performance are expected to remain stable ("robustness of what"), the perturbations the system must withstand ("robustness to what") and the operational environment. We identify three contextual drivers--use case, data and model--that shape the relevant perturbations and influence the choice of tests, metrics and benchmarks used to evaluate robustness. The need to provide at least a range of technical options that providers can assess and implement in light of the system's purpose is explicitly recognised by the standardisation request for the AI Act, but planned standards, still focused on horizontal coverage, do not yet offer this level of detail. Building on this, we propose a context-sensitive multi-layered standardisation framework where horizontal standards set common principles and terminology, while domain-specific ones identify risks across the AI lifecycle and guide appropriate practices, organised in a dynamic repository where providers can propose new informative methods and share lessons learned. Such a system reduces the interpretative burden, mitigates arbitrariness and addresses the obsolescence of static standards, ensuring that robustness assessment is both adaptable and operationally meaningful.
☆ Two-Faced Social Agents: Context Collapse in Role-Conditioned Large Language Models
In this study, we evaluate the persona fidelity of frontier LLMs, GPT-5, Claude Sonnet 4.5 and Gemini 2.5 Flash when assigned distinct socioeconomic personas performing scholastic assessment test (SAT) mathematics items and affective preference tasks. Across 15 distinct role conditions and three testing scenarios, GPT-5 exhibited complete contextual collapse and adopted a singular identity towards optimal responses (PERMANOVA p=1.000, R^2=0.0004), while Gemini 2.5 Flash showed partial collapse (p=0.120, R^2=0.0020). Claude Sonnet 4.5 retained limited but measurable role-specific variation on the SAT items (PERMANOVA p<0.001, R^2=0.0043), though with inverted SES-performance relationships where low-SES personas outperformed high-SES personas (eta^2 = 0.15-0.19 in extended replication). However, all models exhibited distinct role-conditioned affective preference (average d = 0.52-0.58 vs near zero separation for math), indicating that socio-affective variation can reemerge when cognitive constraints are relaxed. These findings suggest that distributional fidelity failure originates in task-dependent contextual collapse: optimization-driven identity convergence under cognitive load combined with impaired role-contextual understanding. Realistic social simulations may require embedding contextual priors in the model's post-training alignment and not just distributional calibration to replicate human-like responses. Beyond simulation validity, these results have implications for survey data integrity, as LLMs can express plausible demographic variation on preference items while failing to maintain authentic reasoning constraints.
☆ Tracking financial crime through code and law: a review of regtech applications in anti-money laundering and terrorism financing
Regulatory technology (RegTech) is transforming financial compliance by integrating advanced information technologies to strengthen anti money laundering and countering the financing of terrorism (AML CFT) frameworks. Recent literature suggests that such technologies represent more than just an efficiency tool; they mark a paradigm shift in regulation and the evolution of financial oversight (Kurum, 2023). This paper aims to provide a narrative review of recent RegTech applications in financial crime prevention, with a focus on key compliance domains. A structured literature review was conducted to examine publications between 2020 and 2024 with a thematic synthesis of findings related to customer due diligence (CDD) and know your customer (KYC), transaction monitoring, regulatory reporting and compliance automation, information sharing and cross border cooperation, as well as cost efficiency. Findings reveal that RegTech solutions give financial institutions more responsibility for detecting and managing financial crime risks, making them more active players in compliance processes traditionally overseen by regulators. The combined use of technologies such as artificial intelligence (AI), blockchain, and big data also generates synergistic effects that improve compliance outcomes beyond what these technologies achieve individually. This demonstrates the strategic relevance of integrated RegTech approaches.
☆ The CAPIRE Curriculum Graph: Structural Feature Engineering for Curriculum-Constrained Student Modelling in Higher Education
Curricula in long-cycle programmes are usually recorded in institutional databases as linear lists of courses, yet in practice they operate as directed graphs of prerequisite relationships that constrain student progression through complex dependencies. This paper introduces the CAPIRE Curriculum Graph, a structural feature engineering layer embedded within the CAPIRE framework for student attrition prediction in Civil Engineering at Universidad Nacional de Tucuman, Argentina. We formalise the curriculum as a directed acyclic graph, compute course-level centrality metrics to identify bottleneck and backbone courses, and derive nine structural features at the student-semester level that capture how students navigate the prerequisite network over time. These features include backbone completion rate, bottleneck approval ratio, blocked credits due to incomplete prerequisites, and graph distance to graduation. We compare three model configurations - baseline CAPIRE, CAPIRE plus macro-context variables, and CAPIRE plus macro plus structural features - using Random Forest classifiers on 1,343 students across seven cohorts (2015-2021). While macro-context socioeconomic indicators fail to improve upon the baseline, structural curriculum features yield consistent gains in performance, with the best configuration achieving overall Accuracy of 86.66% and F1-score of 88.08% and improving Balanced Accuracy by 0.87 percentage points over a strong baseline. Ablation analysis further shows that all structural features contribute in a synergistic fashion rather than through a single dominant metric. By making curriculum structure an explicit object in the feature layer, this work extends CAPIRE from a multilevel leakage-aware framework to a curriculum-constrained prediction system that bridges network science, educational data mining, and institutional research.
comment: 38 pages, 3 figures, 3 tables
☆ Insights from the ICLR Peer Review and Rebuttal Process
Peer review is a cornerstone of scientific publishing, including at premier machine learning conferences such as ICLR. As submission volumes increase, understanding the nature and dynamics of the review process is crucial for improving its efficiency, effectiveness, and the quality of published papers. We present a large-scale analysis of the ICLR 2024 and 2025 peer review processes, focusing on before- and after-rebuttal scores and reviewer-author interactions. We examine review scores, author-reviewer engagement, temporal patterns in review submissions, and co-reviewer influence effects. Combining quantitative analyses with LLM-based categorization of review texts and rebuttal discussions, we identify common strengths and weaknesses for each rating group, as well as trends in rebuttal strategies that are most strongly associated with score changes. Our findings show that initial scores and the ratings of co-reviewers are the strongest predictors of score changes during the rebuttal, pointing to a degree of reviewer influence. Rebuttals play a valuable role in improving outcomes for borderline papers, where thoughtful author responses can meaningfully shift reviewer perspectives. More broadly, our study offers evidence-based insights to improve the peer review process, guiding authors on effective rebuttal strategies and helping the community design fairer and more efficient review processes. Our code and score changes data are available at https://github.com/papercopilot/iclr-insights.
☆ A time for monsters: Organizational knowing after LLMs
Large Language Models (LLMs) are reshaping organizational knowing by unsettling the epistemological foundations of representational and practice-based perspectives. We conceptualize LLMs as Haraway-ian monsters, that is, hybrid, boundary-crossing entities that destabilize established categories while opening new possibilities for inquiry. Focusing on analogizing as a fundamental driver of knowledge, we examine how LLMs generate connections through large-scale statistical inference. Analyzing their operation across the dimensions of surface/deep analogies and near/far domains, we highlight both their capacity to expand organizational knowing and the epistemic risks they introduce. Building on this, we identify three challenges of living with such epistemic monsters: the transformation of inquiry, the growing need for dialogical vetting, and the redistribution of agency. By foregrounding the entangled dynamics of knowing-with-LLMs, the paper extends organizational theory beyond human-centered epistemologies and invites renewed attention to how knowledge is created, validated, and acted upon in the age of intelligent technologies.
comment: Forthcoming at Strategic Organization
☆ Efficiency Will Not Lead to Sustainable Reasoning AI
AI research is increasingly moving toward complex problem solving, where models are optimized not only for pattern recognition but for multi-step reasoning. Historically, computing's global energy footprint has been stabilized by sustained efficiency gains and natural saturation thresholds in demand. But as efficiency improvements are approaching physical limits, emerging reasoning AI lacks comparable saturation points: performance is no longer limited by the amount of available training data but continues to scale with exponential compute investments in both training and inference. This paper argues that efficiency alone will not lead to sustainable reasoning AI and discusses research and policy directions to embed explicit limits into the optimization and governance of such systems.
comment: Presented at the Rethinking AI Workshop @ EurIPS'25
☆ Advancing Equity in STEM: A Critical Analysis of NSF's Division for Equity and Excellence in STEM through Theoretical Lenses
This paper critically analyzes the National Science Foundation's Division of Equity for Excellence in STEM. While supporting its mission to broaden participation for underrepresented groups, the study finds current policies inadequate for dismantling systemic barriers. Using Critical Race Theory and Mills's Racial Contract, the analysis reveals how well-intentioned initiatives may reinforce racial hierarchies through commodification and exclusion. The research argues that diversity efforts focused on competitiveness often fail to affirm marginalized students' full personhood and intellectual capabilities. The paper concludes by advocating transformative reforms that move beyond access to fundamentally restructure STEM education environments, aligning NSF's practices with its equity and inclusion commitments.
comment: A 10-page critical policy analysis using CRT and Mills's Racial Contract to critique NSF equity initiatives. A theoretical, non-empirical paper suited for education or critical studies venues
☆ Corporate Earnings Calls and Analyst Beliefs
Economic behavior is shaped not only by quantitative information but also by the narratives through which such information is communicated and in- terpreted (Shiller, 2017). I show that narratives extracted from earnings calls significantly improve the prediction of both realized earnings and analyst ex- pectations. To uncover the underlying mechanisms, I introduce a novel text- morphing methodology in which large language models generate counterfac- tual transcripts that systematically vary topical emphasis (the prevailing narra- tive) while holding quantitative content fixed. This framework allows me to precisely measure how analysts under- and over-react to specific narrative di- mensions. The results reveal systematic biases: analysts over-react to sentiment (optimism) and under-react to narratives of risk and uncertainty. Overall, the analysis offers a granular perspective on the mechanisms of expectation forma- tion through the competing narratives embedded in corporate communication.
☆ Anthropic Economic Index report: Uneven geographic and enterprise AI adoption
In this report, we document patterns of Claude usage over time, in 150+ countries, across US states, and among businesses deploying Claude through the API. Based on a privacy-preserving analysis of 1 million conversations on Claude.ai and 1 million API transcripts, we have four key findings: (1) Users increasingly entrust Claude with more autonomy, with directive task delegation rising from 27% to 39% in the past eight months. (2) Claude usage is geographically concentrated with high income countries overrepresented in global usage relative to their working age population. (3) Local economic considerations shape patterns of use both in terms of topic and in mode of collaboration with Claude. (4) API customers use Claude to automate tasks with greater specialization among use cases most amenable to programmatic access. To enable researchers and policymakers to further study the impact of AI on the economy, we additionally open-source the underlying data for this report.
☆ Writing With Machines and Peers: Designing for Critical Engagement with Generative AI
The growing integration of generative AI in higher education is transforming how students write, learn, and engage with knowledge. As AI tools become more integrated into classrooms, there is an urgent need for pedagogical approaches that help students use them critically and reflectively. This study proposes a pedagogical design that integrates AI and peer feedback in a graduate-level academic writing activity. Over eight weeks, students developed literature review projects through multiple writing and revision stages, receiving feedback from both a custom-built AI reviewer and human peers. We examine two questions: (1) How did students interact with and incorporate AI and peer feedback during the writing process? and (2) How did they reflect on and build relationships with both human and AI reviewers? Data sources include student writing artifacts, AI and peer feedback, AI chat logs, and student reflections. Findings show that students engaged differently with each feedback source-relying on AI for rubric alignment and surface-level edits, and on peer feedback for conceptual development and disciplinary relevance. Reflections revealed evolving relationships with AI, characterized by increasing confidence, strategic use, and critical awareness of its limitations. The pedagogical design supported writing development, AI literacy, and disciplinary understanding. This study offers a scalable pedagogical model for integrating AI into writing instruction and contributes insights for system-level approaches to fostering meaningful human-AI collaboration in higher education.
☆ A County-Level Similarity Network of Electric Vehicle Adoption: Integrating Predictive Modeling and Graph Theory
Electric vehicle (EV) adoption is essential for reducing carbon dioxide (CO2) emissions from internal combustion engine vehicles (ICEVs), which account for nearly half of transportation-related emissions in the United States. Yet regional EV adoption varies widely, and prior studies often overlook county-level heterogeneity by relying on broad state-level analyses or limited city samples. Such approaches risk masking local patterns and may lead to inaccurate or non-transferable policy recommendations. This study introduces a graph-theoretic framework that complements predictive modeling to better capture how county-level characteristics relate to EV adoption. Feature importances from multiple predictive models are averaged and used as weights within a weighted Gower similarity metric to construct a county similarity network. A mutual k-nearest-neighbors procedure and modularity-based community detection identify 27 clusters of counties with similar weighted feature profiles. EV adoption rates are then analyzed across clusters, and standardized effect sizes (Cohens d) highlight the most distinguishing features for each cluster. Findings reveal consistent global trends, such as declining median income, educational attainment, and charging-station availability across lower adoption tiers; while also uncovering important local variations that general trend or prediction analyses fail to capture. In particular, some low-adoption groups are rural but not economically disadvantaged, whereas others are urbanized yet experience high poverty rates, demonstrating that different mechanisms can lead to the same adoption outcome. By exposing both global structural patterns and localized deviations, this framework provides policymakers with actionable, cluster-specific insights for designing more effective and context-sensitive EV adoption strategies.
♻ ☆ Can Artificial Intelligence Accelerate Technological Progress? Researchers' Perspectives on AI in Manufacturing and Materials Science
Artificial intelligence (AI) raises expectations of substantial increases in rates of technological and scientific progress, but such anticipations are often not connected to detailed ground-level studies of AI use in innovation processes. Accordingly, it remains unclear how and to what extent AI can accelerate innovation. To help to fill this gap, we report results from 32 interviews with U.S.-based academic manufacturing and materials sciences researchers experienced with AI and machine learning (ML) techniques. Interviewees primarily used AI for modeling of materials and manufacturing processes, facilitating cheaper and more rapid search of design spaces for materials and manufacturing processes alike. They report benefits including cost, time, and computation savings in technology development. However, interviewees also report that AI/ML tools are unreliable outside design spaces for which dense data are already available; that they require skilled and judicious application in tandem with older research techniques; and that AI/ML tools may detrimentally circumvent opportunities for disruptive theoretical advancement. Based on these results, we suggest there is reason for optimism about acceleration in sustaining innovations through the use of to AI/ML; but that support for conventional empirical, computational, and theoretical research is required to maintain the likelihood of further major advances in manufacturing and materials science.
♻ ☆ Auditing Google's AI Overviews and Featured Snippets: A Case Study on Baby Care and Pregnancy AAAI
Google Search increasingly surfaces AI-generated content through features like AI Overviews (AIO) and Featured Snippets (FS), which users frequently rely on despite having no control over their presentation. Through a systematic algorithm audit of 1,508 real baby care and pregnancy-related queries, we evaluate the quality and consistency of these information displays. Our robust evaluation framework assesses multiple quality dimensions, including answer consistency, relevance, presence of medical safeguards, source categories, and sentiment alignment. Our results reveal concerning gaps in information consistency, with information in AIO and FS displayed on the same search result page being inconsistent with each other in 33% of cases. Despite high relevance scores, both features critically lack medical safeguards (present in just 11% of AIO and 7% of FS responses). While health and wellness websites dominate source categories for both, AIO and FS, FS also often link to commercial sources. These findings have important implications for public health information access and demonstrate the need for stronger quality controls in AI-mediated health information. Our methodology provides a transferable framework for auditing AI systems across high-stakes domains where information quality directly impacts user well-being.
comment: 18 pages, 10 figures; to appear in AAAI ICWSM 2026
♻ ☆ The Trust Calibration Maturity Model for Characterizing and Communicating Trustworthiness of AI Systems
Recent proliferation of powerful AI systems has created a strong need for capabilities that help users to calibrate trust in those systems. As AI systems grow in scale, information required to evaluate their trustworthiness becomes less accessible, presenting a growing risk of using these systems inappropriately. We propose the Trust Calibration Maturity Model (TCMM) to characterize and communicate information about AI system trustworthiness. The TCMM incorporates five dimensions of analytic maturity: Performance Characterization, Bias & Robustness Quantification, Transparency, Safety & Security, and Usability. The TCMM can be presented along with system performance information to (1) help a user to appropriately calibrate trust, (2) establish requirements and track progress, and (3) identify research needs. Here, we discuss the TCMM and demonstrate it on two target tasks: using ChatGPT for high consequence nuclear science determinations, and using PhaseNet (an ensemble of seismic models) for categorizing sources of seismic events.
comment: 19 pages, 4 figures, 3 tables
♻ ☆ The Hands-Up Problem and How to Deal With It: Secondary School Teachers' Experiences of Debugging in the Classroom
Debugging is a vital but challenging skill for beginner programmers to learn. It is also a difficult skill to teach. For secondary school teachers, who may lack time or programming experience, honing students' understanding of debugging can be a daunting task. Despite this, little research has explored their perspectives of debugging. To this end, we investigated secondary teachers' experiences of debugging in the classroom, with a focus on text-based programming. Through thematic analysis of nine semi-structured interviews, we identified a common reliance on the teacher for debugging support, embodied by many raised hands. We call this phenomenon the `hands-up problem'. While more experienced and confident teachers discussed strategies they use to counteract this, less confident teachers discussed the negative consequences of this problem. We recommend further research into debugging-specific pedagogical content knowledge and professional development to help less confident teachers develop approaches for supporting their students with debugging.
comment: 25 pages, 1 figure. Accepted in Informatics in Education
♻ ☆ Making Evidence Actionable in Adaptive Learning Closing the Diagnostic Pedagogical Loop
Adaptive learning often diagnoses precisely yet intervenes weakly, producing help that is mistimed or misaligned. This study presents evidence supporting an instructor-governed feedback loop that converts concept-level assessment evidence into vetted microinterventions. The adaptive learning algorithm includes three safeguards: adequacy as a hard guarantee of gap closure, attention as a budgeted limit for time and redundancy, and diversity as protection against overfitting to a single resource. We formulate intervention assignment as a binary integer program with constraints for coverage, time, difficulty windows derived from ability estimates, prerequisites encoded by a concept matrix, and anti-redundancy with diversity. Greedy selection serves low-richness and tight-latency settings, gradient-based relaxation serves rich repositories, and a hybrid switches along a richness-latency frontier. In simulation and in an introductory physics deployment with 1204 students, both solvers achieved full skill coverage for nearly all learners within bounded watch time. The gradient-based method reduced redundant coverage by about 12 percentage points relative to greedy and produced more consistent difficulty alignment, while greedy delivered comparable adequacy at lower computational cost in resource-scarce environments. Slack variables localized missing content and guided targeted curation, sustaining sufficiency across student subgroups. The result is a tractable and auditable controller that closes the diagnostic pedagogical loop and enables equitable, load-aware personalization at the classroom scale.
comment: We have submitted the same article with another title: Making Evidence Actionable in Adaptive Learning (arXiv:2511.14052)
♻ ☆ From Vision to Validation: A Theory- and Data-Driven Construction of a GCC-Specific AI Adoption Index
Artificial intelligence (AI) is rapidly transforming public-sector processes worldwide, yet standardized measures rarely address the unique drivers, governance models, and cultural nuances of the Gulf Cooperation Council (GCC) countries. This study employs a theory-driven foundation derived from an in-depth analysis of literature review and six National AI Strategies (NASs), coupled with a data-driven approach that utilizes a survey of 203 mid- and senior-level government employees and advanced statistical techniques (K-Means clustering, Principal Component Analysis, and Partial Least Squares Structural Equation Modeling). By combining policy insights with empirical evidence, the research develops and validates a novel AI Adoption Index specifically tailored to the GCC public sector. Findings indicate that robust technical infrastructure and clear policy mandates exert the strongest influence on successful AI implementations, overshadowing organizational readiness in early adoption stages. The combined model explains 70% of the variance in AI outcomes, suggesting that resource-rich environments and top-down policy directives can drive rapid but uneven technology uptake. By consolidating key dimensions (Technical Infrastructure (TI), Organizational Readiness (OR), and Governance Environment (GE)) into a single composite index, this study provides a holistic yet context-sensitive tool for benchmarking AI maturity. The index offers actionable guidance for policymakers seeking to harmonize large-scale deployments with ethical and regulatory standards. Beyond advancing academic discourse, these insights inform more strategic allocation of resources, cross-country cooperation, and capacity-building initiatives, thereby supporting sustained AI-driven transformation in the GCC region and beyond.
comment: 38 pages, 8 figures, 17 tables
♻ ☆ Report on the Scoping Workshop on AI in Science Education Research 2025
This report summarizes the outcomes of a two-day international scoping workshop on the role of artificial intelligence (AI) in science education research. As AI rapidly reshapes scientific practice, classroom learning, and research methods, the field faces both new opportunities and significant challenges. The report clarifies key AI concepts to reduce ambiguity and reviews evidence of how AI influences scientific work, teaching practices, and disciplinary learning. It identifies how AI intersects with major areas of science education research, including curriculum development, assessment, epistemic cognition, inclusion, and teacher professional development, highlighting cases where AI can support human reasoning and cases where it may introduce risks to equity or validity. The report also examines how AI is transforming methodological approaches across quantitative, qualitative, ethnographic, and design-based traditions, giving rise to hybrid forms of analysis that combine human and computational strengths. To guide responsible integration, a systems-thinking heuristic is introduced that helps researchers consider stakeholder needs, potential risks, and ethical constraints. The report concludes with actionable recommendations for training, infrastructure, and standards, along with guidance for funders, policymakers, professional organizations, and academic departments. The goal is to support principled and methodologically sound use of AI in science education research.
♻ ☆ Between Filters and Feeds: Investigating Douyin and WeChat's Influence on Chinese Adolescent Body Image
In the digital era, social media platforms play a pivotal role in shaping adolescents' body image perceptions. This study examines how Douyin and WeChat, two contrasting Chinese social media platforms, influence body image among Chinese male adolescents. Employing a platformization perspective, we surveyed 395 male adolescents aged 10 to 24 using the Multidimensional Body-Self Relations Questionnaire-Appearance Scales (MBSRQ-AS) to assess self-evaluation and body satisfaction. Our findings reveal that Douyin usage is significantly correlated with appearance evaluation and body area satisfaction, while WeChat usage shows no significant correlation with any body image dimensions. These results suggest that Douyin's algorithm-driven, video-centric environment intensifies exposure to idealized body standards, impacting users at a cognitive level. This study underscores the importance of considering platform-specific characteristics in understanding social media's impact on body image. It contributes to the broader discourse on how technological design and content modalities mediate psychological outcomes, offering insights for addressing body image concerns among male adolescents in China.
comment: We are withdrawing this preprint to perform a major revision that upgrades the study from cross-sectional to a two-wave longitudinal design with new data collected in 2025. To avoid compromising the peer-review process of the substantially revised manuscript, we wish to keep it confidential until submission
♻ ☆ Sync or Sink: Bounds on Algorithmic Collective Action with Noise and Multiple Groups NeurIPS
Collective action against algorithmic systems provides an opportunity for a small group of individuals to strategically manipulate their data to get specific outcomes, from classification to recommendation models. This effectiveness will invite more growth of this type of coordinated actions, both in the size and the number of distinct collectives. With a small group, however, coordination is key. Currently, there is no formal analysis of how coordination challenges within a collective can impact downstream outcomes, or how multiple collectives may affect each other's success. In this work, we aim to provide guarantees on the success of collective action in the presence of both coordination noise and multiple groups. Our insight is that data generated by either multiple collectives or by coordination noise can be viewed as originating from multiple data distributions. Using this framing, we derive bounds on the success of collective action. We conduct experiments to study the effects of noise on collective action. We find that sufficiently high levels of noise can reduce the success of collective action. In certain scenarios, large noise can sink a collective success rate from $100\%$ to just under $60\%$. We identify potential trade-offs between collective size and coordination noise; for example, a collective that is twice as big but with four times more noise experiencing worse outcomes than the smaller, more coordinated one. This work highlights the importance of understanding nuanced dynamics of strategic behavior in algorithmic systems.
comment: Updated with feedback for NeurIPS workshop
Computers and Society
☆ How Should the Law Treat Future AI Systems? Fictional Legal Personhood versus Legal Identity
The law draws a sharp distinction between objects and persons, and between two kinds of persons, the ''fictional'' kind (i.e. corporations), and the ''non-fictional'' kind (individual or ''natural'' persons). This paper will assess whether we maximize overall long-term legal coherence by (A) maintaining an object classification for all future AI systems, (B) creating fictional legal persons associated with suitably advanced, individuated AI systems (giving these fictional legal persons derogable rights and duties associated with certified groups of existing persons, potentially including free speech, contract rights, and standing to sue ''on behalf of'' the AI system), or (C) recognizing non-fictional legal personhood through legal identity for suitably advanced, individuated AI systems (recognizing them as entities meriting legal standing with non-derogable rights which for the human case include life, due process, habeas corpus, freedom from slavery, and freedom of conscience). We will clarify the meaning and implications of each option along the way, considering liability, copyright, family law, fundamental rights, civil rights, citizenship, and AI safety regulation. We will tentatively find that the non-fictional personhood approach may be best from a coherence perspective, for at least some advanced AI systems. An object approach may prove untenable for sufficiently humanoid advanced systems, though we suggest that it is adequate for currently existing systems as of 2025. While fictional personhood would resolve some coherence issues for future systems, it would create others and provide solutions that are neither durable nor fit for purpose. Finally, our review will suggest that ''hybrid'' approaches are likely to fail and lead to further incoherence: the choice between object, fictional person and non-fictional person is unavoidable.
comment: 69 pages. Forthcoming in Case Western Journal of Law, Technology & the Internet (publication offer date september 2025)
☆ When AI Democratizes Exploitation: LLM-Assisted Strategic Manipulation of Fair Division Algorithms NeurIPS 2025
Fair resource division algorithms, like those implemented in Spliddit platform, have traditionally been considered difficult for the end users to manipulate due to its complexities. This paper demonstrates how Large Language Models (LLMs) can dismantle these protective barriers by democratizing access to strategic expertise. Through empirical analysis of rent division scenarios on Spliddit algorithms, we show that users can obtain actionable manipulation strategies via simple conversational queries to AI assistants. We present four distinct manipulation scenarios: exclusionary collusion where majorities exploit minorities, defensive counterstrategies that backfire, benevolent subsidization of specific participants, and cost minimization coalitions. Our experiments reveal that LLMs can explain algorithmic mechanics, identify profitable deviations, and generate specific numerical inputs for coordinated preference misreporting--capabilities previously requiring deep technical knowledge. These findings extend algorithmic collective action theory from classification contexts to resource allocation scenarios, where coordinated preference manipulation replaces feature manipulation. The implications reach beyond rent division to any domain using algorithmic fairness mechanisms for resource division. While AI-enabled manipulation poses risks to system integrity, it also creates opportunities for preferential treatment of equity deserving groups. We argue that effective responses must combine algorithmic robustness, participatory design, and equitable access to AI capabilities, acknowledging that strategic sophistication is no longer a scarce resource.
comment: submitted to NeurIPS 2025 workshop on Algorithmic Collective Action
☆ CAPIRE: Modelling the Impact of Teacher Strikes and Inflation on Student Trajectories in Engineering Education
This study extends the CAPIRE framework with a macro-shock module to analyse the impact of teacher strikes and inflation on student trajectories in engineering education. Using data from 1,343 students across 15 cohorts (2004-2019) in a public engineering faculty in Argentina, we construct a leak-aware, multilevel feature set that incorporates national inflation indicators, lagged exposure to teacher strikes, and interaction terms between macro shocks and curriculum friction. Random Forest models with cohort-based validation demonstrate that macro features provide stable, non-trivial gains in early-semester dropout prediction (improvement in Macro F1 from 0.73 to 0.78), with inflation volatility at entry and a strike-weighted basic-cycle friction index amongst the most influential variables. Lag analysis reveals that strike exposure exerts its strongest association with dropout two to three semesters after the disruption (OR = 2.34), and that effects are concentrated in early, high-friction semesters. We then embed these empirical patterns into an agent-based model, defining scenarios for inflation-only, strikes-only, and combined crisis. Simulations reproduce three stylised facts: delayed strike effects, basic-cycle vulnerability, and non-linear amplification when inflation and strikes co-occur, with combined shocks generating dropout levels 18-23% higher than the sum of individual effects. We argue that teacher strikes and inflation operate as structurally mediated educational disruptors, acting through curriculum design and financial resilience rather than as isolated events. The framework contributes to multilevel dropout theory by demonstrating how macro-level shocks propagate through institutional structures to shape individual trajectories and provides empirically grounded tools for scenario planning in macroeconomically unstable contexts.
comment: 54 pages, 3 figures, 7 tables. Includes appendices
☆ It's Not the AI - It's Each of Us! Ten Commandments for the Wise & Responsible Use of AI
Artificial intelligence (AI) is no longer futuristic; it is a daily companion shaping our private and work lives. While AI simplifies our lives, its rise also invites us to rethink who we are - and who we wish to remain - as humans. Even if AI does not think, feel, or desire, it learns from our behavior, mirroring our collective values, biases, and aspirations. The question, then, is not what AI is, but what we are allowing it to become through data, computing power, and other parameters "teaching" it - and, even more importantly, who we are becoming through our relationship with AI. As the EU AI Act and the Vienna Manifesto on Digital Humanism emphasize, technology must serve human dignity,social well-being, and democratic accountability. In our opinion, responsible use of AI is not only a matter of code nor law, but also of conscientious practice: how each of us engages and teaches others to use AI at home and at work. We propose Ten Commandments for the Wise and Responsible Use of AI are meant as guideline for this very engagement. They closely align with Floridi and Cowls' five guiding principles for AI in society - beneficence, non-maleficence, autonomy, justice, and explicability.
☆ Context-aware, Ante-hoc Explanations of Driving Behaviour
Autonomous vehicles (AVs) must be both safe and trustworthy to gain social acceptance and become a viable option for everyday public transportation. Explanations about the system behaviour can increase safety and trust in AVs. Unfortunately, explaining the system behaviour of AI-based driving functions is particularly challenging, as decision-making processes are often opaque. The field of Explainability Engineering tackles this challenge by developing explanation models at design time. These models are designed from system design artefacts and stakeholder needs to develop correct and good explanations. To support this field, we propose an approach that enables context-aware, ante-hoc explanations of (un)expectable driving manoeuvres at runtime. The visual yet formal language Traffic Sequence Charts is used to formalise explanation contexts, as well as corresponding (un)expectable driving manoeuvres. A dedicated runtime monitoring enables context-recognition and ante-hoc presentation of explanations at runtime. In combination, we aim to support the bridging of correct and good explanations. Our method is demonstrated in a simulated overtaking.
comment: In Proceedings FMAS 2025, arXiv:2511.13245
☆ Sovereign AI: Rethinking Autonomy in the Age of Global Interdependence
Artificial intelligence (AI) is emerging as a foundational general-purpose technology, raising new dilemmas of sovereignty in an interconnected world. While governments seek greater control over it, the very foundations of AI--global data pipelines, semiconductor supply chains, open-source ecosystems, and international standards--resist enclosure. This paper develops a conceptual and formal framework for understanding sovereign AI as a continuum rather than a binary condition, balancing autonomy with interdependence. Drawing on classical theories, historical analogies, and contemporary debates on networked autonomy, we present a planner's model that identifies two policy heuristics: equalizing marginal returns across the four sovereignty pillars and setting openness where global benefits equal exposure risks. We apply the model to India, highlighting sovereign footholds in data, compute, and norms but weaker model autonomy. The near-term challenge is integration via coupled Data x Compute investment, lifecycle governance (ModelOps), and safeguarded procurement. We then apply the model to the Middle East (Saudi Arabia and the UAE), where large public investment in Arabic-first models and sovereign cloud implies high sovereignty weights, lower effective fiscal constraints, and strong Data x Compute complementarities. An interior openness setting with guardrails emerges as optimal. Across contexts, the lesson is that sovereignty in AI needs managed interdependence, not isolation.
☆ DiverseClaire: Simulating Students to Improve Introductory Programming Course Materials for All CS1 Learners
Although CS programs are booming, introductory courses like CS1 still adopt a one-size-fits-all formats that can exacerbate cognitive load and discourage learners with autism, ADHD, dyslexia and other neurological conditions. These call for compassionate pedagogies and Universal Design For Learning (UDL) to create learning environments and materials where cognitive diversity is welcomed. To address this, we introduce DiverseClaire a pilot study, which simulates students including neurodiverse profiles using LLMs and diverse personas. By leveraging Bloom's Taxonomy and UDL, DiverseClaire compared UDL-transformed lecture slides with traditional formats. To evaluate DiverseClaire controlled experiments, we used the evaluation metric the average score. The findings revealed that the simulated neurodiverse students struggled with learning due to lecture slides that were in inaccessible formats. These results highlight the need to provide course materials in multiple formats for diverse learner preferences. Data from our pilot study will be made available to assist future CS1 instructors.
comment: 2 pages
☆ A Longitudinal Study on the Attitudes of Gay Men in Beijing Towards Gay Social Media Platforms: Lonely Souls in the Digital Concrete Jungle
Over the past decade, specialized social networking applications have become a cornerstone of life for many gay men in China. This paper employs a longitudinal mixed-methods approach to investigate how Chinese men who have sex with men (MSM) have shifted their attitudes toward these platforms between approximately 2013 and 2023. Drawing on archival analysis of online discourses, a quantitative survey of 412 participants, and in-depth semi-structured interviews with 32 participants, we trace the complex trajectory of this evolution. Our findings reveal a clear pattern: from the initial embrace of these applications as revolutionary tools for community building and identity affirmation (2014--2017), to a period of growing ambivalence and critique centered on commercialization, ``hookup culture,'' and multiple forms of discrimination (2017--2020), and finally to the present era (2020--2023), characterized by pragmatic, fragmented, yet simultaneously critical and reconstructive uses. Today, users strategically employ a repertoire of applications -- including global platforms (e.g., Grindr and Tinder), domestic mainstream platforms (e.g., Blued), and niche alternatives (e.g., Aloha) -- to fulfill differentiated needs. We develop a detailed temporal framework to capture this attitudinal evolution and discuss its design implications for creating more supportive, secure, and community-oriented digital environments for marginalized groups.
☆ A Comprehensive Study of Implicit and Explicit Biases in Large Language Models
Large Language Models (LLMs) inherit explicit and implicit biases from their training datasets. Identifying and mitigating biases in LLMs is crucial to ensure fair outputs, as they can perpetuate harmful stereotypes and misinformation. This study highlights the need to address biases in LLMs amid growing generative AI. We studied bias-specific benchmarks such as StereoSet and CrowSPairs to evaluate the existence of various biases in multiple generative models such as BERT and GPT 3.5. We proposed an automated Bias-Identification Framework to recognize various social biases in LLMs such as gender, race, profession, and religion. We adopted a two-pronged approach to detect explicit and implicit biases in text data. Results indicated fine-tuned models struggle with gender biases but excelled at identifying and avoiding racial biases. Our findings illustrated that despite having some success, LLMs often over-relied on keywords. To illuminate the capability of the analyzed LLMs in detecting implicit biases, we employed Bag-of-Words analysis and unveiled indications of implicit stereotyping within the vocabulary. To bolster the model performance, we applied an enhancement strategy involving fine-tuning models using prompting techniques and data augmentation of the bias benchmarks. The fine-tuned models exhibited promising adaptability during cross-dataset testing and significantly enhanced performance on implicit bias benchmarks, with performance gains of up to 20%.
☆ Just Asking Questions: Doing Our Own Research on Conspiratorial Ideation by Generative AI Chatbots
Interactive chat systems that build on artificial intelligence frameworks are increasingly ubiquitous and embedded into search engines, Web browsers, and operating systems, or are available on websites and apps. Researcher efforts have sought to understand the limitations and potential for harm of generative AI, which we contribute to here. Conducting a systematic review of six AI-powered chat systems (ChatGPT 3.5; ChatGPT 4 Mini; Microsoft Copilot in Bing; Google Search AI; Perplexity; and Grok in Twitter/X), this study examines how these leading products respond to questions related to conspiracy theories. This follows the platform policy implementation audit approach established by Glazunova et al. (2023). We select five well-known and comprehensively debunked conspiracy theories and four emerging conspiracy theories that relate to breaking news events at the time of data collection. Our findings demonstrate that the extent of safety guardrails against conspiratorial ideation in generative AI chatbots differs markedly, depending on chatbot model and conspiracy theory. Our observations indicate that safety guardrails in AI chatbots are often very selectively designed: generative AI companies appear to focus especially on ensuring that their products are not seen to be racist; they also appear to pay particular attention to conspiracy theories that address topics of substantial national trauma such as 9/11 or relate to well-established political issues. Future work should include an ongoing effort extended to further platforms, multiple languages, and a range of conspiracy theories extending well beyond the United States.
♻ ☆ SpiderGen: Towards Procedure Generation For Carbon Life Cycle Assessments with Generative AI
Investigating the effects of climate change and global warming caused by GHG emissions have been a key concern worldwide. These emissions are largely contributed to by the production, use and disposal of consumer products. Thus, it is important to build tools to estimate the environmental impact of consumer goods, an essential part of which is conducting Life Cycle Assessments (LCAs). LCAs specify and account for the appropriate processes involved with the production, use, and disposal of the products. We present SpiderGen, an LLM-based workflow which integrates the taxonomy and methodology of traditional LCA with the reasoning capabilities and world knowledge of LLMs to generate graphical representations of the key procedural information used for LCA, known as Product Category Rules Process Flow Graphs (PCR PFGs). We additionally evaluate the output of SpiderGen by comparing it with 65 real-world LCA documents. We find that SpiderGen provides accurate LCA process information that is either fully correct or has minor errors, achieving an F1-Score of 65% across 10 sample data points, as compared to 53% using a one-shot prompting method. We observe that the remaining errors occur primarily due to differences in detail between LCA documents, as well as differences in the "scope" of which auxiliary processes must also be included. We also demonstrate that SpiderGen performs better than several baselines techniques, such as chain-of-thought prompting and one-shot prompting. Finally, we highlight SpiderGen's potential to reduce the human effort and costs for estimating carbon impact, as it is able to produce LCA process information for less than \$1 USD in under 10 minutes as compared to the status quo LCA, which can cost over \$25000 USD and take up to 21-person days.
♻ ☆ Embedding Explainable AI in NHS Clinical Safety: The Explainability-Enabled Clinical Safety Framework (ECSF)
Artificial intelligence (AI) is increasingly embedded in NHS workflows, but its probabilistic and adaptive behaviour conflicts with the deterministic assumptions underpinning existing clinical-safety standards. DCB0129 and DCB0160 provide strong governance for conventional software yet do not define how AI-specific transparency, interpretability, or model drift should be evidenced within Safety Cases, Hazard Logs, or post-market monitoring. This paper proposes an Explainability-Enabled Clinical Safety Framework (ECSF) that integrates explainability into the DCB0129/0160 lifecycle, enabling Clinical Safety Officers to use interpretability outputs as structured safety evidence without altering compliance pathways. A cross-regulatory synthesis mapped DCB clauses to principles from Good Machine Learning Practice, the NHS AI Assurance and T.E.S.T. frameworks, and the EU AI Act. The resulting matrix links regulatory clauses, principles, ECSF checkpoints, and suitable explainability outputs. ECSF introduces five checkpoints: global transparency for hazard identification, case-level interpretability for verification, clinician usability for evaluation, traceable decision pathways for risk control, and longitudinal interpretability monitoring for post-market surveillance. Techniques such as SHAP, LIME, Integrated Gradients, saliency mapping, and attention visualisation are mapped to corresponding DCB artefacts. ECSF reframes explainability as a core element of clinical-safety assurance, bridging deterministic risk governance with the probabilistic behaviour of AI and supporting alignment with GMLP, the EU AI Act, and NHS AI Assurance principles.
comment: 33 pages, 5 figures
♻ ☆ Revisiting (Un)Fairness in Recourse by Minimizing Worst-Case Social Burden AAAI 2026
Machine learning based predictions are increasingly used in sensitive decision-making applications that directly affect our lives. This has led to extensive research into ensuring the fairness of classifiers. Beyond just fair classification, emerging legislation now mandates that when a classifier delivers a negative decision, it must also offer actionable steps an individual can take to reverse that outcome. This concept is known as algorithmic recourse. Nevertheless, many researchers have expressed concerns about the fairness guarantees within the recourse process itself. In this work, we provide a holistic theoretical characterization of unfairness in algorithmic recourse, formally linking fairness guarantees in recourse and classification, and highlighting limitations of the standard equal cost paradigm. We then introduce a novel fairness framework based on social burden, along with a practical algorithm (MISOB), broadly applicable under real-world conditions. Empirical results on real-world datasets show that MISOB reduces the social burden across all groups without compromising overall classifier accuracy.
comment: Accepted at AAAI 2026
♻ ☆ Proportionate Cybersecurity for Micro-SMEs: A Governance Design Model under NIS2
Micro and small enterprises (SMEs) remain structurally vulnerable to cyber threats while facing capacity constraints that make formal compliance burdensome. This article develops a governance design model for proportionate SME cybersecurity, grounded in an awareness-first logic and informed by the EU Squad 2025 experience. Using a qualitative policy-analysis and conceptual policy-design approach, we reconstruct a seven-dimension preventive architecture: awareness and visibility, human behaviour, access control, system hygiene, data protection, detection and response, and continuous review, and justify each dimension's contribution to proportionality and risk reduction. We then map the model's regulatory scope and limits against the NIS2 Directive, Commission Implementing Regulation (EU) 2024/2690, the Digital Operational Resilience Act (DORA), the Cyber Resilience Act (CRA), and the EU Action Plan on Cybersecurity for Hospitals, clarifying which obligations are supported and which require complementary governance (e.g. role accountability, incident timelines, statements of applicability, sector-specific testing and procurement). The analysis argues that raising awareness is the fastest, scalable lever to increase cyber-risk sensitivity in micro-SMEs and complements, rather than replaces, formal compliance. We conclude with policy implications for EU and national programmes seeking practical, proportionate pathways to SME cyber resilience under NIS2.
comment: Comments: 5 pages, 2 tables. The paper proposes a proportionate, awareness-first cybersecurity approach for micro- and small enterprises, inspired by the EU Squad 2025 initiative, highlighting how simple preventive measures can align with - but not replace - formal compliance under NIS2 and related regulations
♻ ☆ The EU AI Act, Stakeholder Needs, and Explainable AI: Aligning Regulatory Compliance in a Clinical Decision Support System
Explainable AI (XAI) is a promising route to comply with the EU AI Act, the first multinational AI regulation. XAI enhances transparency and human oversight of AI systems, especially ''black-box`` models criticized as incomprehensible. Yet discourse about the AI Act's stakeholders and XAI remains disconnected: XAI increasingly prioritizes end users' needs, while the AI Act focuses on providers' and deployers' obligations. We aim to bridge this divide and offer practical guidance on their relationship. Through interdisciplinary discussion in a cross functional team of XAI, AI Act, legal, and requirements-engineering experts, we outline steps to analyze an AI-based clinical decision support system, clarify end-user needs, and assess AI Act applicability. Using an AI system under development as a case study, we show how XAI techniques can help reconcile stakeholder needs with AI Act requirements and fill gaps between usability and regulatory demands. We compare similarities and differences between legal obligations and end-user needs, identify tensions, and point to concrete design choices and trade-offs. We invite researchers and practitioners in XAI to reflect on their role relative to the AI Act and to develop mutual understanding across disciplines. While XAI can help implement core AI Act principles such as transparency and human oversight, it should be considered one element of a broader compliance strategy that also requires standardization, legal interpretation, documentation, organizational processes, governance, testing, and ongoing monitoring and auditing practices. Our findings yield actionable recommendations for integrating XAI into product development, compliance workflows, and stakeholder communication, informing policy-making and standards development.
comment: 18 pages, 2 figures
♻ ☆ MCTSr-Zero: Self-Reflective Psychological Counseling Dialogues Generation via Principles and Adaptive Exploration AAAI-2026
The integration of Monte Carlo Tree Search (MCTS) with Large Language Models (LLMs) has demonstrated significant success in structured, problem-oriented tasks. However, applying these methods to open-ended dialogues, such as those in psychological counseling, presents unique challenges. Unlike tasks with objective correctness, success in therapeutic conversations depends on subjective factors like empathetic engagement, ethical adherence, and alignment with human preferences, for which strict "correctness" criteria are ill-defined. Existing result-oriented MCTS approaches can therefore produce misaligned responses. To address this, we introduce MCTSr-Zero, an MCTS framework designed for open-ended, human-centric dialogues. Its core innovation is "domain alignment", which shifts the MCTS search objective from predefined end-states towards conversational trajectories that conform to target domain principles (e.g., empathy in counseling). Furthermore, MCTSr-Zero incorporates "Regeneration" and "Meta-Prompt Adaptation" mechanisms to substantially broaden exploration by allowing the MCTS to consider fundamentally different initial dialogue strategies. We evaluate MCTSr-Zero in psychological counseling by generating multi-turn dialogue data, which is used to fine-tune an LLM, PsyLLM. We also introduce PsyEval, a benchmark for assessing multi-turn psychological counseling dialogues. Experiments demonstrate that PsyLLM achieves state-of-the-art performance on PsyEval and other relevant metrics, validating MCTSr-Zero's effectiveness in generating high-quality, principle-aligned conversational data for human-centric domains and addressing the LLM challenge of consistently adhering to complex psychological standards.
comment: 48 pages, 3 figures. Accepted in AAAI-2026 (Main Technical Track). For code and model, see this https://github.com/JianChengXingYun/Mctsr-Zero
♻ ☆ Leveraging LLM-based agents for social science research: insights from citation network simulations SC
The emergence of Large Language Models (LLMs) demonstrates their potential to encapsulate the logic and patterns inherent in human behavior simulation by leveraging extensive web data pre-training. However, the boundaries of LLM capabilities in social simulation remain unclear. To further explore the social attributes of LLMs, we introduce the CiteAgent framework, designed to generate citation networks based on human-behavior simulation with LLM-based agents. CiteAgent successfully captures predominant phenomena in real-world citation networks, including power-law distribution, citational distortion, and shrinking diameter. Building on this realistic simulation, we establish two LLM-based research paradigms in social science: LLM-SE (LLM-based Survey Experiment) and LLM-LE (LLM-based Laboratory Experiment). These paradigms facilitate rigorous analyses of citation network phenomena, allowing us to validate and challenge existing theories. Additionally, we extend the research scope of traditional science of science studies through idealized social experiments, with the simulation experiment results providing valuable insights for real-world academic environments. Our work demonstrates the potential of LLMs for advancing science of science research in social science.
comment: accepted by HSSCOMMS'25
♻ ☆ Fairness-Aware Graph Representation Learning with Limited Demographic Information
Ensuring fairness in Graph Neural Networks is fundamental to promoting trustworthy and socially responsible machine learning systems. In response, numerous fair graph learning methods have been proposed in recent years. However, most of them assume full access to demographic information, a requirement rarely met in practice due to privacy, legal, or regulatory restrictions. To this end, this paper introduces a novel fair graph learning framework that mitigates bias in graph learning under limited demographic information. Specifically, we propose a mechanism guided by partial demographic data to generate proxies for demographic information and design a strategy that enforces consistent node embeddings across demographic groups. In addition, we develop an adaptive confidence strategy that dynamically adjusts each node's contribution to fairness and utility based on prediction confidence. We further provide theoretical analysis demonstrating that our framework, FairGLite, achieves provable upper bounds on group fairness metrics, offering formal guarantees for bias mitigation. Through extensive experiments on multiple datasets and fair graph learning frameworks, we demonstrate the framework's effectiveness in both mitigating bias and maintaining model utility.
♻ ☆ A Framework for Developing University Policies on Generative AI Governance: A Cross-national Comparative Study
As generative AI (GAI) becomes more integrated into higher education, universities are actively exploring its governance and issuing guidelines to promote responsible use, reflecting varied stages of adoption and orientations. This study undertakes a comparative analysis of current GAI guidelines issued by leading universities in the United States, Japan, and China. Based on these findings, the study proposes a University Policy Development Framework for GAI (UPDF-GAI) to provide both theoretical insights and practical guidance for universities in developing and refining their GAI policies. This study adopts five domains from the extended Technology Acceptance Model. A qualitative content analysis of 124 policy documents from 110 universities was conducted, employing thematic coding to synthesize 20 key themes. These domains and themes form the foundation of the UPDF-GAI framework. The analysis reveals varying priorities and focus of GAI policy of universities in different countries. U.S. universities emphasize faculty autonomy, practical application, and policy adaptability, shaped by cutting-edge research and peer collaboration. Japanese universities take a government-regulated approach, prioritizing ethics and risk management, but provide limited support for AI implementation and flexibility. Chinese universities follow a centralized, government-led model, focusing on technology application over early policy development, while actively exploring GAI integration in education and research. The framework facilitates universities in formulating GAI policies by balancing its values and risks, providing multi-level support, proactively responding to societal impacts, and strengthening self-efficacy. In doing so, it enables the development of sustainable and context-sensitive policies that enhance digital competitiveness and advance preparedness for AI-driven education.
comment: Work in progress
♻ ☆ Learning Fair Representations with Kolmogorov-Arnold Networks
Despite recent advances in fairness-aware machine learning, predictive models often exhibit discriminatory behavior towards marginalized groups. Such unfairness might arise from biased training data, model design, or representational disparities across groups, posing significant challenges in high-stakes decision-making domains such as college admissions. While existing fair learning models aim to mitigate bias, achieving an optimal trade-off between fairness and accuracy remains a challenge. Moreover, the reliance on black-box models hinders interpretability, limiting their applicability in socially sensitive domains. To circumvent these issues, we propose integrating Kolmogorov-Arnold Networks (KANs) within a fair adversarial learning framework. Leveraging the adversarial robustness and interpretability of KANs, our approach facilitates stable adversarial learning. We derive theoretical insights into the spline-based KAN architecture that ensure stability during adversarial optimization. Additionally, an adaptive fairness penalty update mechanism is proposed to strike a balance between fairness and accuracy. We back these findings with empirical evidence on two real-world admissions datasets, demonstrating the proposed framework's efficiency in achieving fairness across sensitive attributes while preserving predictive performance.
♻ ☆ Artificial Intelligence Tools Expand Scientists' Impact but Contract Science's Focus
Development in Artificial Intelligence (AI) has accelerated scientific discovery. Alongside recent AI-oriented Nobel prizes, these trends establish the role of AI tools in science. This advancement raises questions about the potential influences of AI tools on scientists and science as a whole, and highlights a potential conflict between individual and collective benefits. To evaluate, we used a pretrained language model to identify AI-augmented research, with an F1-score of 0.875 in validation against expert-labeled data. Using a dataset of 41.3 million research papers across natural science and covering distinct eras of AI, here we show an accelerated adoption of AI tools among scientists and consistent professional advantages associated with AI usage, but a collective narrowing of scientific focus. Scientists who engage in AI-augmented research publish 3.02 times more papers, receive 4.84 times more citations, and become research project leaders 1.37 years earlier than those who do not. By contrast, AI adoption shrinks the collective volume of scientific topics studied by 4.63% and decreases scientist's engagement with one another by 22.00%. Thereby, AI adoption in science presents a seeming paradox -- an expansion of individual scientists' impact but a contraction in collective science's reach -- as AI-augmented work moves collectively toward areas richest in data. With reduced follow-on engagement, AI tools appear to automate established fields rather than explore new ones, highlighting a tension between personal advancement and collective scientific progress.
♻ ☆ Deploying Rapid Damage Assessments from sUAS Imagery for Disaster Response
This paper presents the first AI/ML system for automating building damage assessment in uncrewed aerial systems (sUAS) imagery to be deployed operationally during federally declared disasters (Hurricanes Debby and Helene). In response to major disasters, sUAS teams are dispatched to collect imagery of the affected areas to assess damage; however, at recent disasters, teams collectively delivered between 47GB and 369GB of imagery per day, representing more imagery than can reasonably be transmitted or interpreted by subject matter experts in the disaster scene, thus delaying response efforts. To alleviate this data avalanche encountered in practice, computer vision and machine learning techniques are necessary. While prior work has been deployed to automatically assess damage in satellite imagery, there is no current state of practice for sUAS-based damage assessment systems, as all known work has been confined to academic settings. This work establishes the state of practice via the development and deployment of models for building damage assessment with sUAS imagery. The model development involved training on the largest known dataset of post-disaster sUAS aerial imagery, containing 21,716 building damage labels, and the operational training of 91 disaster practitioners. The best performing model was deployed during the responses to Hurricanes Debby and Helene, where it assessed a combined 415 buildings in approximately 18 minutes. This work contributes documentation of the actual use of AI/ML for damage assessment during a disaster and lessons learned to the benefit of the AI/ML research and user communities.
comment: 6 pages, 4 figures, 1 table. Accepted - In Press, IAAI'26
Computers and Society
☆ A Detailed Comparative Analysis of Blockchain Consensus Mechanisms
This paper presents a comprehensive comparative analysis of two dominant blockchain consensus mechanisms, Proof of Work (PoW) and Proof of Stake (PoS), evaluated across seven critical metrics: energy use, security, transaction speed, scalability, centralization risk, environmental impact, and transaction fees. Utilizing recent academic research and real-world blockchain data, the study highlights that PoW offers robust, time-tested security but suffers from high energy consumption, slower throughput, and centralization through mining pools. In contrast, PoS demonstrates improved scalability and efficiency, significantly reduced environmental impact, and more stable transaction fees, however it raises concerns over validator centralization and long-term security maturity. The findings underscore the trade-offs inherent in each mechanism and suggest hybrid designs may combine PoW's security with PoS's efficiency and sustainability. The study aims to inform future blockchain infrastructure development by striking a balance between decentralization, performance, and ecological responsibility.
comment: 15 Pages, 3 Figures, 2 Tables
☆ Introducing AI to an Online Petition Platform Changed Outputs but not Outcomes
The rapid integration of AI writing tools into online platforms raises critical questions about their impact on content production and outcomes. We leverage a unique natural experiment on Change.org, a leading social advocacy platform, to causally investigate the effects of an in-platform ''write with AI'' tool. To understand the impact of the AI integration, we collected 1.5 million petitions and employed a difference-in-differences analysis. Our findings reveal that in-platform AI access significantly altered the lexical features of petitions and increased petition homogeneity, but did not improve petition outcomes. We confirmed the results in a separate analysis of repeat petition writers who wrote petitions before and after introduction of the AI tool. The results suggest that while AI writing tools can profoundly reshape online content, their practical utility for improving desired outcomes may be less beneficial than anticipated, and introduce unintended consequences like content homogenization.
☆ The Future of Food: How Artificial Intelligence is Transforming Food Manufacturing
Artificial intelligence is accelerating a new era of food innovation, connecting data from farm to consumer to improve formulation, processing, and health outcomes. Recent advances in deep learning, natural language processing, and multi-omics integration make it possible to understand and optimize food systems with unprecedented depth. However, AI adoption across the food sector remains uneven due to heterogeneous datasets, limited model and system interoperability, and a persistent skills gap between data scientists and food domain experts. To address these challenges and advance responsible innovation, the AI Institute for Next Generation Food Systems (AIFS) convened the inaugural AI for Food Product Development Symposium at University of California, Davis, in October 2025. This white paper synthesizes insights from the symposium, organized around five domains where AI can have the greatest near-term impact: supply chain; formulation and processing; consumer insights and sensory prediction; nutrition and health; and education and workforce development. Across the areas, participants emphasized the importance of interoperable data standards, transparent and interpretable models, and cross-sector collaboration to accelerate the translation of AI research into practice. The discussions further highlighted the need for robust digital infrastructure, privacy-preserving data-sharing mechanisms, and interdisciplinary training pathways that integrate AI literacy with domain expertise. Collectively, the priorities outline a roadmap for integrating AI into food manufacturing in ways that enhance innovation, sustainability, and human well-being while ensuring that technological progress remains grounded in ethics, scientific rigor, and societal benefit.
☆ GAEA: Experiences and Lessons Learned from a Country-Scale Environmental Digital Twin
This paper describes the experiences and lessons learned after the deployment of a country-scale environmental digital twin on the island of Cyprus for three years. This digital twin, called GAEA, contains 27 environmental geospatial services and is suitable for urban planners, policymakers, farmers, property owners, real-estate and forestry professionals, as well as insurance companies and banks that have properties in their portfolio. This paper demonstrates the power, potential, current and future challenges of geospatial analytics and environmental digital twins on a large scale.
☆ Freedom of expression and 'right to be forgotten' cases in the Netherlands after Google Spain
Since the Google Spain judgment of the Court of Justice of the European Union, Europeans have, under certain conditions, the right to have search results for their name delisted. This paper examines how the Google Spain judgment has been applied in the Netherlands. Since the Google Spain judgment, Dutch courts have decided on two cases regarding delisting requests. In both cases, the Dutch courts considered freedom of expression aspects of delisting more thoroughly than the Court of Justice. However, the effect of the Google Spain judgment on freedom of expression is difficult to assess, as search engine operators decide about most delisting requests without disclosing much about their decisions.
☆ Access to Personal Data and the Right to Good Governance during Asylum Procedures after the CJEU's YS. and M. and S. judgment
In the YS. and M. and S. judgment, the Court of Justice of the European Union ruled on three procedures in which Dutch judges asked for clarification on the right of asylum seekers to have access to the documents regarding the decision on asylum applications. The judgment is relevant for interpreting the concept of personal data and the scope of the right of access under the Data Protection Directive, and the right to good administration in the EU Charter of Fundamental Rights. At first glance, the judgment seems disappointing from the viewpoint of individual rights. Nevertheless, in our view the judgment provides sufficient grounds for effective access rights to the minutes in future asylum cases.
☆ New Data Security Requirements and the Proceduralization of Mass Surveillance Law after the European Data Retention Case
This paper discusses the regulation of mass metadata surveillance in Europe through the lens of the landmark judgment in which the Court of Justice of the European Union struck down the Data Retention Directive. The controversial directive obliged telecom and Internet access providers in Europe to retain metadata of all their customers for intelligence and law enforcement purposes, for a period of up to two years. In the ruling, the Court declared the directive in violation of the human rights to privacy and data protection. The Court also confirmed that the mere collection of metadata interferes with the human right to privacy. In addition, the Court developed three new criteria for assessing the level of data security required from a human rights perspective: security measures should take into account the risk of unlawful access to data, and the data's quantity and sensitivity. While organizations that campaigned against the directive have welcomed the ruling, we warn for the risk of proceduralization of mass surveillance law. The Court did not fully condemn mass surveillance that relies on metadata, but left open the possibility of mass surveillance if policymakers lay down sufficient procedural safeguards. Such proceduralization brings systematic risks for human rights. Government agencies, with ample resources, can design complicated systems of procedural oversight for mass surveillance - and claim that mass surveillance is lawful, even if it affects millions of innocent people.
☆ AI Fairness Beyond Complete Demographics: Current Achievements and Future Directions ECAI 2025
Fairness in artificial intelligence (AI) has become a growing concern due to discriminatory outcomes in AI-based decision-making systems. While various methods have been proposed to mitigate bias, most rely on complete demographic information, an assumption often impractical due to legal constraints and the risk of reinforcing discrimination. This survey examines fairness in AI when demographics are incomplete, addressing the gap between traditional approaches and real-world challenges. We introduce a novel taxonomy of fairness notions in this setting, clarifying their relationships and distinctions. Additionally, we summarize existing techniques that promote fairness beyond complete demographics and highlight open research questions to encourage further progress in the field.
comment: ECAI 2025
☆ The Last Vote: A Multi-Stakeholder Framework for Language Model Governance NeurIPS 2025
As artificial intelligence systems become increasingly powerful and pervasive, democratic societies face unprecedented challenges in governing these technologies while preserving core democratic values and institutions. This paper presents a comprehensive framework to address the full spectrum of risks that AI poses to democratic societies. Our approach integrates multi-stakeholder participation, civil society engagement, and existing international governance frameworks while introducing novel mechanisms for risk assessment and institutional adaptation. We propose: (1) a seven-category democratic risk taxonomy extending beyond individual-level harms to capture systemic threats, (2) a stakeholder-adaptive Incident Severity Score (ISS) that incorporates diverse perspectives and context-dependent risk factors, and (3) a phased implementation strategy that acknowledges the complex institutional changes required for effective AI governance.
comment: This paper has been accepted to the NeurIPS 2025 Workshop on Algorithmic Collective Action (ACA@NeurIPS 2025). The submission is 26 pages including the appendix and includes the NeurIPS checklist. A big thanks to Avijit Ghosh
☆ Dropouts in Confidence: Moral Uncertainty in Human-LLM Alignment AAAI 2026
Humans display significant uncertainty when confronted with moral dilemmas, yet the extent of such uncertainty in machines and AI agents remains underexplored. Recent studies have confirmed the overly confident tendencies of machine-generated responses, particularly in large language models (LLMs). As these systems are increasingly embedded in ethical decision-making scenarios, it is important to understand their moral reasoning and the inherent uncertainties in building reliable AI systems. This work examines how uncertainty influences moral decisions in the classical trolley problem, analyzing responses from 32 open-source models and 9 distinct moral dimensions. We first find that variance in model confidence is greater across models than within moral dimensions, suggesting that moral uncertainty is predominantly shaped by model architecture and training method. To quantify uncertainty, we measure binary entropy as a linear combination of total entropy, conditional entropy, and mutual information. To examine its effects, we introduce stochasticity into models via "dropout" at inference time. Our findings show that our mechanism increases total entropy, mainly through a rise in mutual information, while conditional entropy remains largely unchanged. Moreover, this mechanism significantly improves human-LLM moral alignment, with correlations in mutual information and alignment score shifts. Our results highlight the potential to better align model-generated decisions and human preferences by deliberately modulating uncertainty and reducing LLMs' confidence in morally complex scenarios.
comment: Accepted to AAAI 2026
☆ Computational Measurement of Political Positions: A Review of Text-Based Ideal Point Estimation Algorithms
This article presents the first systematic review of unsupervised and semi-supervised computational text-based ideal point estimation (CT-IPE) algorithms, methods designed to infer latent political positions from textual data. These algorithms are widely used in political science, communication, computational social science, and computer science to estimate ideological preferences from parliamentary speeches, party manifestos, and social media. Over the past two decades, their development has closely followed broader NLP trends -- beginning with word-frequency models and most recently turning to large language models (LLMs). While this trajectory has greatly expanded the methodological toolkit, it has also produced a fragmented field that lacks systematic comparison and clear guidance for applied use. To address this gap, we identified 25 CT-IPE algorithms through a systematic literature review and conducted a manual content analysis of their modeling assumptions and development contexts. To compare them meaningfully, we introduce a conceptual framework that distinguishes how algorithms generate, capture, and aggregate textual variance. On this basis, we identify four methodological families -- word-frequency, topic modeling, word embedding, and LLM-based approaches -- and critically assess their assumptions, interpretability, scalability, and limitations. Our review offers three contributions. First, it provides a structured synthesis of two decades of algorithm development, clarifying how diverse methods relate to one another. Second, it translates these insights into practical guidance for applied researchers, highlighting trade-offs in transparency, technical requirements, and validation strategies that shape algorithm choice. Third, it emphasizes that differences in estimation outcomes across algorithms are themselves informative, underscoring the need for systematic benchmarking.
comment: 46 pages, 8 figures, 2 tables, accepted for publication in Quality & Quantity
☆ Beyond Citations: A Cross-Domain Metric for Dataset Impact and Shareability
The scientific community increasingly relies on open data sharing, yet existing metrics inadequately capture the true impact of datasets as research outputs. Traditional measures, such as the h-index, focus on publications and citations but fail to account for dataset accessibility, reuse, and cross-disciplinary influence. We propose the X-index, a novel author-level metric that quantifies the value of data contributions through a two-step process: (i) computing a dataset-level value score (V-score) that integrates breadth of reuse, FAIRness, citation impact, and transitive reuse depth, and (ii) aggregating V-scores into an author-level X-index. Using datasets from computational social science, medicine, and crisis communication, we validate our approach against expert ratings, achieving a strong correlation. Our results demonstrate that the X-index provides a transparent, scalable, and low-cost framework for assessing data-sharing practices and incentivizing open science. The X-index encourages sustainable data-sharing practices and gives institutions, funders, and platforms a tangible way to acknowledge the lasting influence of research datasets.
☆ Rethinking the filter bubble? Developing a research agenda for the protective filter bubble
Filter bubbles and echo chambers have received global attention from scholars, media organizations, and the general public. Filter bubbles have primarily been regarded as intrinsically negative, and many studies have sought to minimize their influence. The detrimental influence of filter bubbles is well-studied. Filter bubbles may, for example, create information silos, amplify misinformation, and promote hatred and extremism. However, comparatively few studies have considered the other side of the filter bubble; its protective benefits, particularly to marginalized communities and those living in countries with low levels of press freedom. Through a review of the literature on digital safe spaces and protective filter bubbles, this commentary suggests that there may be a need to rethink the filter bubble, and it proposes several areas for future research.
comment: This work has been published in Big Data & Society. Please cite the journal version
☆ Evaluating Generative AI for CS1 Code Grading: Direct vs Reverse Methods SC
Manual grading of programming assignments in introductory computer science courses can be time-consuming and prone to inconsistencies. While unit testing is commonly used for automatic evaluation, it typically follows a binary pass/fail model and does not give partial marks. Recent advances in large language models (LLMs) offer the potential for automated, scalable, and more objective grading. This paper compares two AI-based grading techniques: \textit{Direct}, where the AI model applies a rubric directly to student code, and \textit{Reverse} (a newly proposed approach), where the AI first fixes errors, then deduces a grade based on the nature and number of fixes. Each method was evaluated on both the instructor's original grading scale and a tenfold expanded scale to assess the impact of range on AI grading accuracy. To assess their effectiveness, AI-assigned scores were evaluated against human tutor evaluations on a range of coding problems and error types. Initial findings suggest that while the Direct approach is faster and straightforward, the Reverse technique often provides a more fine-grained assessment by focusing on correction effort. Both methods require careful prompt engineering, particularly for allocating partial credit and handling logic errors. To further test consistency, we also used synthetic student code generated using Gemini Flash 2.0, which allowed us to evaluate AI graders on a wider range of controlled error types and difficulty levels. We discuss the strengths and limitations of each approach, practical considerations for prompt design, and future directions for hybrid human-AI grading systems that aim to improve consistency, efficiency, and fairness in CS courses.
comment: 10 pages, 5 figures. This version corresponds to the paper accepted for presentation at CASCON 2025
♻ ☆ Theories of "Sexuality" in Natural Language Processing Bias Research
In recent years, significant advancements in the field of Natural Language Processing (NLP) have positioned commercialized language models as wide-reaching, highly useful tools. In tandem, there has been an explosion of multidisciplinary research examining how NLP tasks reflect, perpetuate, and amplify social biases such as gender and racial bias. A significant gap in this scholarship is a detailed analysis of how queer sexualities are encoded and (mis)represented by both NLP systems and practitioners. Following previous work in the field of AI fairness, we document how sexuality is defined and operationalized via a survey and analysis of 55 articles that quantify sexuality-based NLP bias. We find that sexuality is not clearly defined in a majority of the literature surveyed, indicating a reliance on assumed or normative conceptions of sexual/romantic practices and identities. Further, we find that methods for extracting biased outputs from NLP technologies often conflate gender and sexual identities, leading to monolithic conceptions of queerness and thus improper quantifications of bias. With the goal of improving sexuality-based NLP bias analyses, we conclude with recommendations that encourage more thorough engagement with both queer communities and interdisciplinary literature.
comment: 17 pages, 6 tables, 1 figure, undergraduate senior thesis, submitted to The Spectra: The Virginia Engineering and Science Research Journal
♻ ☆ Generative AI for Multiple Choice STEM Assessments
Artificial intelligence (AI) technology enables a range of enhancements in computer-aided instruction, from accelerating the creation of teaching materials to customizing learning paths based on learner outcomes. However, ensuring the mathematical accuracy and semantic integrity of generative AI output remains a significant challenge, particularly in Science, Technology, Engineering and Mathematics (STEM) disciplines. In this study, we explore the use of generative AI in which "hallucinations", typically viewed as undesirable inaccuracies, can instead serve a pedagogical purpose. Specifically, we investigate the generation of plausible but incorrect alternatives for multiple choice assessments, where credible distractors are essential for effective assessment design. We describe the Moebius platform for online instruction, with particular focus on its architecture for handling mathematical elements through specialized semantic packages that support dynamic, parameterized STEM content. We examine methods for crafting prompts that interact effectively with these mathematical semantics to guide the AI in generating high-quality multiple choice distractors. Finally, we demonstrate how this approach reduces the time and effort associated with creating robust teaching materials while maintaining academic rigor and assessment validity.
♻ ☆ Can Machines Think Like Humans? A Behavioral Evaluation of LLM Agents in Dictator Games
As Large Language Model (LLM)-based agents increasingly engage with human society, how well do we understand their prosocial behaviors? We (1) investigate how LLM agents' prosocial behaviors can be induced by different personas and benchmarked against human behaviors; and (2) introduce a social science approach to evaluate LLM agents' decision-making. We explored how different personas and experimental framings affect these AI agents' altruistic behavior in dictator games and compared their behaviors within the same LLM family, across various families, and with human behaviors. The findings reveal that merely assigning a human-like identity to LLMs does not produce human-like behaviors. These findings suggest that LLM agents' reasoning does not consistently exhibit textual markers of human decision-making in dictator games and that their alignment with human behavior varies substantially across model architectures and prompt formulations; even worse, such dependence does not follow a clear pattern. As society increasingly integrates machine intelligence, "Prosocial AI" emerges as a promising and urgent research direction in philanthropic studies.
♻ ☆ LocalBench: Benchmarking LLMs on County-Level Local Knowledge and Reasoning
Large language models (LLMs) have been widely evaluated on macro-scale geographic tasks, such as global factual recall, event summarization, and regional reasoning. Yet, their ability to handle hyper-local knowledge remains poorly understood. This gap is increasingly consequential as real-world applications, from civic platforms to community journalism, demand AI systems that can reason about neighborhood-specific dynamics, cultural narratives, and local governance. Existing benchmarks fall short in capturing this complexity, often relying on coarse-grained data or isolated references. We present LocalBench, the first benchmark designed to systematically evaluate LLMs on county-level local knowledge across the United States. Grounded in the Localness Conceptual Framework, LocalBench includes 14,782 validated question-answer pairs across 526 U.S. counties in 49 states, integrating diverse sources such as Census statistics, local subreddit discourse, and regional news. It spans physical, cognitive, and relational dimensions of locality. Using LocalBench, we evaluate 13 state-of-the-art LLMs under both closed-book and web-augmented settings. Our findings reveal critical limitations: even the best-performing models reach only 56.8% accuracy on narrative-style questions and perform below 15.5% on numerical reasoning. Moreover, larger model size and web augmentation do not guarantee better performance, for example, search improves Gemini's accuracy by +13.6%, but reduces GPT-series performance by -11.4%. These results underscore the urgent need for language models that can support equitable, place-aware AI systems: capable of engaging with the diverse, fine-grained realities of local communities across geographic and cultural contexts.
♻ ☆ Optimizing Urban Service Allocation with Time-Constrained Restless Bandits
Municipal inspections are an important part of maintaining the quality of goods and services. In this paper, we approach the problem of intelligently scheduling service inspections to maximize their impact, using the case of food establishment inspections in Chicago as a case study. The Chicago Department of Public Health (CDPH) inspects thousands of establishments each year, with a substantial fail rate (over 3,000 failed inspection reports in 2023). To balance the objectives of ensuring adherence to guidelines, minimizing disruption to establishments, and minimizing inspection costs, CDPH assigns each establishment an inspection window every year and guarantees that they will be inspected exactly once during that window. Meanwhile, CDPH also promises surprise public health inspections for unexpected food safety emergencies or complaints. These constraints create a challenge for a restless multi-armed bandit (RMAB) approach, for which there are no existing methods. We develop an extension to Whittle index-based systems for RMABs that can guarantee action window constraints and frequencies, and furthermore can be leveraged to optimize action window assignments themselves. Briefly, we combine MDP reformulation and integer programming-based lookahead to maximize the impact of inspections subject to constraints. A neural network-based supervised learning model is developed to model state transitions of real Chicago establishments using public CDPH inspection records, which demonstrates 10% AUC improvements compared with directly predicting establishments' failures. Our experiments not only show up to 24% (in simulation) or 33% (on real data) objective improvements resulting from our approach and robustness to surprise inspections, but also give insight into the impact of scheduling constraints.
♻ ☆ Ken Utilization Layer: Hebbian Replay Within a Student's Ken for Adaptive Exercise Recommendation
Adaptive exercise recommendation (ER) aims to choose the next activity that matches a learner's evolving Zone of Proximal Development (ZPD). We present KUL-Rec, a biologically inspired ER system that couples a fast Hebbian memory with slow replay-based consolidation to enable continual, few-shot personalization from sparse interactions. The model operates in an embedding space, allowing a single architecture to handle both tabular knowledge-tracing logs and open-ended short-answer text. We align evaluation with tutoring needs using bidirectional ranking and rank-sensitive metrics (nDCG, Recall@K). Across ten public datasets, KUL-Rec improves macro nDCG (0.316 vs. 0.265 for the strongest baseline) and Recall@10 (0.305 vs. 0.211), while achieving low inference latency and an $\approx99$\% reduction in peak GPU memory relative to a competitive graph-based model. In a 13-week graduate course, KUL-Rec personalized weekly short-answer quizzes generated by a retrieval-augmented pipeline and the personalized quizzes were associated with lower perceived difficulty and higher helpfulness (p < .05). An embedding robustness audit highlights that encoder choice affects semantic alignment, motivating routine audits when deploying open-response assessment. Together, these results indicate that Hebbian replay with bounded consolidation offers a practical path to real-time, interpretable ER that scales across data modalities and classroom settings.
♻ ☆ A GPU-Accelerated RAG-Based Telegram Assistant for Supporting Parallel Processing Students
This project addresses a critical pedagogical need: offering students continuous, on-demand academic assistance beyond conventional reception hours. I present a domain-specific Retrieval-Augmented Generation (RAG) system powered by a quantized Mistral-7B Instruct model and deployed as a Telegram bot. The assistant enhances learning by delivering real-time, personalized responses aligned with the "Introduction to Parallel Processing" course materials. GPU acceleration significantly improves inference latency, enabling practical deployment on consumer hardware. This approach demonstrates how consumer GPUs can enable affordable, private, and effective AI tutoring for HPC education.
comment: 9 pages
♻ ☆ Reinforcing Trustworthiness in Multimodal Emotional Support Systems
In today's world, emotional support is increasingly essential, yet it remains challenging for both those seeking help and those offering it. Multimodal approaches to emotional support show great promise by integrating diverse data sources to provide empathetic, contextually relevant responses, fostering more effective interactions. However, current methods have notable limitations, often relying solely on text or converting other data types into text, or providing emotion recognition only, thus overlooking the full potential of multimodal inputs. Moreover, many studies prioritize response generation without accurately identifying critical emotional support elements or ensuring the reliability of outputs. To overcome these issues, we introduce \textsc{ MultiMood}, a new framework that (i) leverages multimodal embeddings from video, audio, and text to predict emotional components and to produce responses responses aligned with professional therapeutic standards. To improve trustworthiness, we (ii) incorporate novel psychological criteria and apply Reinforcement Learning (RL) to optimize large language models (LLMs) for consistent adherence to these standards. We also (iii) analyze several advanced LLMs to assess their multimodal emotional support capabilities. Experimental results show that MultiMood achieves state-of-the-art on MESC and DFEW datasets while RL-driven trustworthiness improvements are validated through human and LLM evaluations, demonstrating its superior capability in applying a multimodal framework in this domain.
♻ ☆ EXAGREE: Mitigating Explanation Disagreement with Stakeholder-Aligned Models
Conflicting explanations, arising from different attribution methods or model internals, limit the adoption of machine learning models in safety-critical domains. We turn this disagreement into an advantage and introduce EXplanation AGREEment (EXAGREE), a two-stage framework that selects a Stakeholder-Aligned Explanation Model (SAEM) from a set of similar-performing models. The selection maximizes Stakeholder-Machine Agreement (SMA), a single metric that unifies faithfulness and plausibility. EXAGREE couples a differentiable mask-based attribution network (DMAN) with monotone differentiable sorting, enabling gradient-based search inside the constrained model space. Experiments on six real-world datasets demonstrate simultaneous gains of faithfulness, plausibility, and fairness over baselines, while preserving task accuracy. Extensive ablation studies, significance tests, and case studies confirm the robustness and feasibility of the method in practice.
♻ ☆ Human-Centered Development of Indicators for Self-Service Learning Analytics: A Transparency through Exploration Approach
The aim of learning analytics is to turn educational data into insights, decisions, and actions to improve learning and teaching. The reasoning of the provided insights, decisions, and actions is often not transparent to the end-user, and this can lead to trust and acceptance issues when interventions, feedback, and recommendations fail. In this paper, we shed light on achieving transparent learning analytics by following a transparency through exploration approach. To this end, we present the design, implementation, and evaluation details of the Indicator Editor, which aims to support self-service learning analytics (SSLA) by empowering end-users to take control of the indicator implementation process. We systematically designed and implemented the Indicator Editor through an iterative human-centered design (HCD) approach. Further, we conducted a qualitative user study (n=15) to investigate the impact of following an SSLA approach on the users' perception of and interaction with the Indicator Editor. Our study showed qualitative evidence that supporting user interaction and providing user control in the indicator implementation process can have positive effects on different crucial aspects of learning analytics, namely transparency, trust, satisfaction, and acceptance.
comment: Submitted to JLA - revised version
♻ ☆ From Model Training to Model Raising
Current AI training methods align models with human values only after their core capabilities have been established, resulting in models that are easily misaligned and lack deep-rooted value systems. We propose a paradigm shift from "model training" to "model raising", in which alignment is woven into a model's development from the start. We identify several key components for this paradigm, all centered around redesigning the training corpus: reframing training data from a first-person perspective, recontextualizing information as lived experience, simulating social interactions, and scaffolding the ordering of training data. We expect that this redesign of the training corpus will lead to an early commitment to values from the first training token onward, such that knowledge, skills, and values are intrinsically much harder to separate. In an ecosystem in which large language model capabilities start overtaking human capabilities in many tasks, this seems to us like a critical need.
comment: Accepted for publication in Communications of the ACM (CACM), Opinion section
♻ ☆ The Hidden Risks of Large Reasoning Models: A Safety Assessment of R1
The rapid development of large reasoning models (LRMs), such as OpenAI-o3 and DeepSeek-R1, has led to significant improvements in complex reasoning over non-reasoning large language models~(LLMs). However, their enhanced capabilities, combined with the open-source access of models like DeepSeek-R1, raise serious safety concerns, particularly regarding their potential for misuse. In this work, we present a comprehensive safety assessment of these reasoning models, leveraging established safety benchmarks to evaluate their compliance with safety regulations. Furthermore, we investigate their susceptibility to adversarial attacks, such as jailbreaking and prompt injection, to assess their robustness in real-world applications. Through our multi-faceted analysis, we uncover four key findings: (1) There is a significant safety gap between the open-source reasoning models and the o3-mini model, on both safety benchmark and attack, suggesting more safety effort on open LRMs is needed. (2) The stronger the model's reasoning ability, the greater the potential harm it may cause when answering unsafe questions. (3) Safety thinking emerges in the reasoning process of LRMs, but fails frequently against adversarial attacks. (4) The thinking process in R1 models poses greater safety concerns than their final answers. Our study provides insights into the security implications of reasoning models and highlights the need for further advancements in R1 models' safety to close the gap.
♻ ☆ Inequality in the Age of Pseudonymity AAAI
Inequality measures such as the Gini coefficient are used to inform and motivate policymaking, and are increasingly applied to digital platforms. We analyze how measures fare in pseudonymous settings that are common in the digital age. A key challenge of such environments is the ability of actors to create fake identities under fictitious false names, also known as ``Sybils.'' While actors may do so to preserve privacy, we show that this can hamper inequality measurement: it is impossible for measures satisfying the literature's canonical set of desired properties to assess the inequality of an economy that harbors Sybils. We characterize the class of all Sybil-proof measures, and prove they must satisfy relaxed versions of the established properties. Furthermore, we show that the structure imposed restricts the ability to assess inequality at a fine-grained level. We then apply our results to prove that popular measures are not Sybil-proof, with the famous Gini coefficient being but one example out of many. Finally, we examine dynamics leading to the creation of Sybils in digital and traditional settings.
comment: 41 pages, 1 figure. Accepted to appear in: Proceedings of the Fortieth AAAI Conference on Artificial Intelligence (AAAI'26)
♻ ☆ IndiTag: An Online Media Bias Analysis System Using Fine-Grained Bias Indicators
In the age of information overload and polarized discourse, understanding media bias has become imperative for informed decision-making and fostering a balanced public discourse. However, without the experts' analysis, it is hard for the readers to distinguish bias from the news articles. This paper presents IndiTag, an innovative online media bias analysis system that leverages fine-grained bias indicators to dissect and distinguish bias in digital content. IndiTag offers a novel approach by incorporating large language models, bias indicators, and vector database to detect and interpret bias automatically. Complemented by a user-friendly interface facilitating automated bias analysis for readers, IndiTag offers a comprehensive platform for in-depth bias examination. We demonstrate the efficacy and versatility of IndiTag through experiments on four datasets encompassing news articles from diverse platforms. Furthermore, we discuss potential applications of IndiTag in fostering media literacy, facilitating fact-checking initiatives, and enhancing the transparency and accountability of digital media platforms. IndiTag stands as a valuable tool in the pursuit of fostering a more informed, discerning, and inclusive public discourse in the digital age. We release an online system for end users and the source code is available at https://github.com/lylin0/IndiTag.
♻ ☆ An International Agreement to Prevent the Premature Creation of Artificial Superintelligence
Many experts argue that premature development of artificial superintelligence (ASI) poses catastrophic risks, including the risk of human extinction from misaligned ASI, geopolitical instability, and misuse by malicious actors. This report proposes an international agreement to prevent the premature development of ASI until AI development can proceed without these risks. The agreement halts dangerous AI capabilities advancement while preserving access to current, safe AI applications. The proposed framework centers on a coalition led by the United States and China that would restrict the scale of AI training and dangerous AI research. Due to the lack of trust between parties, verification is a key part of the agreement. Limits on the scale of AI training are operationalized by FLOP thresholds and verified through the tracking of AI chips and verification of chip use. Dangerous AI research--that which advances toward artificial superintelligence or endangers the agreement's verifiability--is stopped via legal prohibitions and multifaceted verification. We believe the proposal would be technically sufficient to forestall the development of ASI if implemented today, but advancements in AI capabilities or development methods could hurt its efficacy. Additionally, there does not yet exist the political will to put such an agreement in place. Despite these challenges, we hope this agreement can provide direction for AI governance research and policy.
Computers and Society
☆ Telekommunikationsüberwachung am Scheideweg: Zur Regulierbarkeit des Zugriffes auf verschlüsselte Kommunikation
Personal communication using technical means is protected by telecommunications secrecy. Any interference with this fundamental right requires a legal basis, which has existed for many years for traditional communication services in the form of telecommunications surveillance (TKÜ, § 100a StPO) and appears to be widely accepted by society. The basis for the implementation of TKÜ is the obligation of telecommunications providers to provide interception interfaces. However, the technical implementation of telecommunications has changed significantly as a result of the Internet. Messenger services and Voice over IP telephony are increasingly competing with traditional telephone services. The use of strong end-to-end encryption made possible by this technology is increasingly posing problems for law enforcement agencies, as only cryptographically encrypted content is accessible via the interception interfaces provided to date. Against the backdrop of current discussions on socalled ``chat control'' and its limited social acceptance, this article addresses the question of whether and, if so, how the cooperation obligations of the technical actors involved can be sensibly regulated in the case of encrypted communication.
comment: Preprint of an article to appear in CyberStR - Zeitschrift für Cyberstrafrecht, Carl Heymanns Verlag, ISSN 3052-5926, Issue 1 (2026), in German
☆ The Unspoken Crisis of Learning: The Surging Zone of No Development
AI has redefined the boundaries of assistance in education, often blurring the line between guided learning and dependency. This paper revisits Vygotsky's Zone of Proximal Development (ZPD) through the lens of the P2P Teaching framework. By contrasting temporary scaffolding with the emerging phenomenon of permanent digital mediation, the study introduces the concept of the Zone of No Development (ZND), a state in which continuous assistance replaces cognitive struggle and impedes intellectual autonomy. Through theoretical synthesis and framework design, P2P Teaching demonstrates how deliberate disconnection and ethical fading can restore the learner's agency, ensuring that technological tools enhance rather than replace developmental effort. The paper argues that productive struggle, self-regulation, and first-principles reasoning remain essential for durable learning, and that responsible use of AI in education must include explicit mechanisms to end its help when mastery begins.
comment: 6 pages, 3 figures
☆ Modeling Fairness in Recruitment AI via Information Flow
Avoiding bias and understanding the real-world consequences of AI-supported decision-making are critical to address fairness and assign accountability. Existing approaches often focus either on technical aspects, such as datasets and models, or on high-level socio-ethical considerations - rarely capturing how these elements interact in practice. In this paper, we apply an information flow-based modeling framework to a real-world recruitment process that integrates automated candidate matching with human decision-making. Through semi-structured stakeholder interviews and iterative modeling, we construct a multi-level representation of the recruitment pipeline, capturing how information is transformed, filtered, and interpreted across both algorithmic and human components. We identify where biases may emerge, how they can propagate through the system, and what downstream impacts they may have on candidates. This case study illustrates how information flow modeling can support structured analysis of fairness risks, providing transparency across complex socio-technical systems.
☆ Understanding the Complexities of Responsibly Sharing NSFW Content Online
Reddit is in the minority of mainstream social platforms that permit posting content that may be considered to be at the edge of what is permissible, including so-called Not Safe For Work (NSFW) content. However, NSFW is becoming more common on mainstream platforms, with X now allowing such material. We examine the top 15 NSFW-restricted subreddits by size to explore the complexities of responsibly sharing adult content, aiming to balance ethical and legal considerations with monetization opportunities. We find that users often use NSFW subreddits as a social springboard, redirecting readers to private or specialized adult social platforms such as Telegram, Kik or OnlyFans for further interactions. They also directly negotiate image "trades" through credit cards or payment platforms such as PayPal, Bitcoin or Venmo. Disturbingly, we also find linguistic cues linked to non-consensual content sharing. To help platforms moderate such behavior, we trained a RoBERTa-based classification model, which outperforms GPT-4 and traditional classifiers such as logistic regression and random forest in identifying non-consensual content sharing, demonstrating superior performance in this specific task. The source code and trained model weights are publicly available at https://github.com/socsys/15NSFW Subreddits.
comment: 13 Pages, 7 Figures
☆ AI and Supercomputing are Powering the Next Wave of Breakthrough Science - But at What Cost?
Artificial intelligence (AI) and high-performance computing (HPC) are rapidly becoming the engines of modern science. However, their joint effect on discovery has yet to be quantified at scale. Drawing on metadata from over five million scientific publications (2000-2024), we identify how AI and HPC interact to shape research outcomes across 27 fields. Papers combining the two technologies are up to three times more likely to introduce novel concepts and five times more likely to reach top-cited status than conventional work. This convergence of AI and HPC is redefining the frontier of scientific creativity but also deepening global inequalities in access to computational power and expertise. Our findings suggest that the future of discovery will depend not only on algorithms and compute, but also on how equitably the world shares these transformative tools.
☆ The Probabilistic Foundations of Surveillance Failure: From False Alerts to Structural Bias
For decades, forensic statisticians have debated whether searching large DNA databases undermines the evidential value of a match. Modern surveillance faces an exponentially harder problem: screening populations across thousands of attributes using threshold rules rather than exact matching. Intuition suggests that requiring many coincidental matches should make false alerts astronomically unlikely. This intuition fails. Consider a system that monitors 1,000 attributes, each with a 0.5 percent innocent match rate. Matching 15 pre-specified attributes has probability \(10^{-35}\), one in 30 decillion, effectively impossible. But operational systems require no such specificity. They might flag anyone who matches \emph{any} 15 of the 1,000. In a city of one million innocent people, this produces about 226 false alerts. A seemingly impossible event becomes all but guaranteed. This is not an implementation flaw but a mathematical consequence of high-dimensional screening. We identify fundamental probabilistic limits on screening reliability. Systems undergo sharp transitions from reliable to unreliable with small increases in data scale, a fragility worsened by data growth and correlations. As data accumulate and correlation collapses effective dimensionality, systems enter regimes where alerts lose evidential value even when individual coincidences remain vanishingly rare. This framework reframes the DNA database controversy as a shift between operational regimes. Unequal surveillance exposures magnify failure, making ``structural bias'' mathematically inevitable. These limits are structural: beyond a critical scale, failure cannot be prevented through threshold adjustment or algorithmic refinement.
comment: 24 pages, 1 figure
☆ Human-Centered Threat Modeling in Practice: Lessons, Challenges, and Paths Forward
Human-centered threat modeling (HCTM) is an emerging area within security and privacy research that focuses on how people define and navigate threats in various social, cultural, and technological contexts. While researchers increasingly approach threat modeling from a human-centered perspective, little is known about how they prepare for and engage with HCTM in practice. In this work, we conduct 23 semi-structured interviews with researchers to examine the state of HCTM, including how researchers design studies, elicit threats, and navigate values, constraints, and long-term goals. We find that HCTM is not a prescriptive process but a set of evolving practices shaped by relationships with participants, disciplinary backgrounds, and institutional structures. Researchers approach threat modeling through sustained groundwork and participant-centered inquiry, guided by values such as care, justice, and autonomy. They also face challenges including emotional strain, ethical dilemmas, and structural barriers that complicate efforts to translate findings into real-world impact. We conclude by identifying opportunities to advance HCTM through shared infrastructure, broader recognition of diverse contributions, and stronger mechanisms for translating findings into policy, design, and societal change.
☆ Political Advertising on Facebook During the 2022 Australian Federal Election: A Social Identity Perspective
The spread of targeted advertising on social media platforms has revolutionized political marketing strategies. Monitoring these digital campaigns is essential for maintaining transparency and accountability in democratic processes. Leveraging Meta's Ad Library, we analyze political advertising on Facebook and Instagram during the 2022 Australian federal election campaign. We investigate temporal, demographic, and geographical patterns in the advertising strategies of major Australian political actors to establish an empirical evidence base, and interpret these findings through the lens of Social Identity Theory (SIT). Our findings not only reveal significant disparities in spending and reach among parties, but also in persuasion strategies being deployed in targeted online campaigns. We observe a marked increase in advertising activity as the election approached, peaking just before the mandated media blackout period. Demographic analysis shows distinct targeting strategies, with parties focusing more on younger demographics and exhibiting gender-based differences in ad impressions. Regional distribution of ads largely mirrored population densities, with some parties employing more targeted approaches in specific states. Moreover, we found that parties emphasized different themes aligned with their ideologies-major parties focused on party names and opponents, while smaller parties emphasized issue-specific messages. Drawing on SIT, we interpret these findings within Australia's compulsory voting context, suggesting that parties employed distinct persuasion strategies. With turnout guaranteed, major parties focused on reinforcing partisan identities to prevent voter defection, while smaller parties cultivated issue-based identities to capture the support of disaffected voters who are obligated to participate.
♻ ☆ Addressing Polarization and Unfairness in Performative Prediction
In many real-world applications of machine learning such as recommendations, hiring, and lending, deployed models influence the data they are trained on, leading to feedback loops between predictions and data distribution. The performative prediction (PP) framework captures this phenomenon by modeling the data distribution as a function of the deployed model. While prior work has focused on finding performative stable (PS) solutions for robustness, their societal impacts, particularly regarding fairness, remain underexplored. We show that PS solutions can lead to severe polarization and prediction performance disparities, and that conventional fairness interventions in previous works often fail under model-dependent distribution shifts due to failing the PS criteria. To address these challenges in PP, we introduce novel fairness mechanisms that provably ensure both stability and fairness, validated by theoretical analysis and empirical results.
♻ ☆ Supporting Risk Management for Medical Devices via the Riskman Ontology and Shapes (Preprint)
We propose the Riskman ontology and shapes for representing and analysing information about risk management for medical devices. Risk management is concerned with taking necessary precautions to ensure that a medical device does not cause harms for users or the environment. To date, risk management documentation is submitted to notified bodies (for certification) in the form of semi-structured natural language text. We propose to use terms from the Riskman ontology to provide a formal, logical underpinning for risk management documentation, and to use the included SHACL constraints to check whether the provided data is in accordance with the requirements of the two relevant norms, i.e. ISO 14971 and VDE Spec 90025.
♻ ☆ From Delegates to Trustees: How Optimizing for Long-Term Interests Shapes Bias and Alignment in LLM
Large language models (LLMs) have shown promising accuracy in predicting survey responses and policy preferences, which has increased interest in their potential to represent human interests in various domains. Most existing research has focused on "behavioral cloning", effectively evaluating how well models reproduce individuals' expressed preferences. Drawing on theories of political representation, we highlight an underexplored design trade-off: whether AI systems should act as delegates, mirroring expressed preferences, or as trustees, exercising judgment about what best serves an individual's interests. This trade-off is closely related to issues of LLM sycophancy, where models can encourage behavior or validate beliefs that may be aligned with a user's short-term preferences, but is detrimental to their long-term interests. Through a series of experiments simulating votes on various policy issues in the U.S. context, we apply a temporal utility framework that weighs short and long-term interests (simulating a trustee role) and compare voting outcomes to behavior-cloning models (simulating a delegate). We find that trustee-style predictions weighted toward long-term interests produce policy decisions that align more closely with expert consensus on well-understood issues, but also show greater bias toward models' default stances on topics lacking clear agreement. These findings reveal a fundamental trade-off in designing AI systems to represent human interests. Delegate models better preserve user autonomy but may diverge from well-supported policy positions, while trustee models can promote welfare on well-understood issues yet risk paternalism and bias on subjective topics.
♻ ☆ Uncovering Strategic Egoism Behaviors in Large Language Models NeurIPS 2025
Large language models (LLMs) face growing trustworthiness concerns (\eg, deception), which hinder their safe deployment in high-stakes decision-making scenarios. In this paper, we present the first systematic investigation of strategic egoism (SE), a form of rule-bounded self-interest in which models pursue short-term or self-serving gains while disregarding collective welfare and ethical considerations. To quantitatively assess this phenomenon, we introduce SEBench, a benchmark comprising 160 scenarios across five domains. Each scenario features a single-role decision-making context, with psychologically grounded choice sets designed to elicit self-serving behaviors. These behavior-driven tasks assess egoistic tendencies along six dimensions, such as manipulation, rule circumvention, and self-interest prioritization. Building on this, we conduct extensive experiments across 5 open-sourced and 2 commercial LLMs, where we observe that strategic egoism emerges universally across models. Surprisingly, we found a positive correlation between egoistic tendencies and toxic language behaviors, suggesting that strategic egoism may underlie broader misalignment risks.
comment: PersonaNLP@NeurIPS 2025
♻ ☆ A Multi-level Analysis of Factors Associated with Student Performance: A Machine Learning Approach to the SAEB Microdata
Identifying the factors that influence student performance in basic education is a central challenge for formulating effective public policies in Brazil. This study introduces a multi-level machine learning approach to classify the proficiency of 9th-grade and high school students using microdata from the System of Assessment of Basic Education (SAEB). Our model uniquely integrates four data sources: student socioeconomic characteristics, teacher professional profiles, school indicators, and principal management profiles. A comparative analysis of four ensemble algorithms confirmed the superiority of a Random Forest model, which achieved 90.2% accuracy and an Area Under the Curve (AUC) of 96.7%. To move beyond prediction, we applied Explainable AI (XAI) using SHAP, which revealed that the school's average socioeconomic level is the most dominant predictor, demonstrating that systemic factors have a greater impact than individual characteristics in isolation. The primary conclusion is that academic performance is a systemic phenomenon deeply tied to the school's ecosystem. This study provides a data-driven, interpretable tool to inform policies aimed at promoting educational equity by addressing disparities between schools.
♻ ☆ Small Models, Big Support: A Local LLM Framework for Educator-Centric Content Creation and Assessment with RAG and CAG
While Large Language Models (LLMs) are increasingly applied in student-facing educational tools, their potential to directly support educators through locally deployable and customizable solutions remains underexplored. Many existing approaches rely on proprietary, cloud-based systems that raise significant cost, privacy, and control concerns for educational institutions. To address these barriers, we introduce an end-to-end, open-source framework that empowers educators using small (3B-7B parameter), locally deployable LLMs. Our system is designed for comprehensive teacher support, including customized teaching material generation and AI-assisted assessment. The framework synergistically combines Retrieval-Augmented Generation (RAG) and Context-Augmented Generation (CAG) to produce factually accurate, pedagogically-styled content. A core feature is an interactive refinement loop, a teacher-in-the-loop mechanism that ensures educator agency and precise alignment of the final output. To enhance reliability and safety, an auxiliary verifier LLM inspects all generated content. We validate our framework through a rigorous evaluation of its content generation capabilities and report on a successful technical deployment in a college physics course, which confirms its feasibility on standard institutional hardware. Our findings demonstrate that carefully engineered, self-hosted systems built on small LLMs can provide robust, affordable, and private support for educators, achieving practical utility comparable to much larger models for targeted instructional tasks. This work presents a practical blueprint for the development of sovereign AI tools tailored to the real-world needs of educational institutions.
♻ ☆ Spanning Trees and Redistricting: New Methods for Sampling and Validation
Deciding whether a political districting plan was distorted by a hidden agenda, or whether it dilutes the voting power of some group, requires a neutral baseline for comparison. Remarkably, all nine U.S. Supreme Court justices have now signed on to decisions that find that computational methods can provide key evidence. Today, the leading approaches for benchmarking districting plans are based on the use of spanning trees for sampling graph partitions. We present a new *reversible recombination* algorithm and rigorously prove its fundamental properties. Furthermore, we argue for a canonical sampling distribution called the *spanning tree distribution* that is well adapted to redistricting and provides a principled foundation for comparing and validating methods. Together with a highly efficient (and open-source) implementation that can generate and handle large datasets, this work provides the most powerful null model to date for the gerrymandering problem, meeting an urgent democratic challenge with sound scientific methodology.
comment: SIREV, to appear
Computers and Society
☆ On the Security and Privacy of AI-based Mobile Health Chatbots
The rise of Artificial Intelligence (AI) has impacted the development of mobile health (mHealth) apps, most notably with the advent of AI-based chatbots used as ubiquitous ``companions'' for various services, from fitness to mental health assistants. While these mHealth chatbots offer clear benefits, such as personalized health information and predictive diagnoses, they also raise significant concerns regarding security and privacy. This study empirically assesses 16 AI-based mHealth chatbots identified from the Google Play Store. The empirical assessment follows a three-phase approach (manual inspection, static code analysis, and dynamic analysis) to evaluate technical robustness and how design and implementation choices impact end users. Our findings revealed security vulnerabilities (e.g., enabling Remote WebView debugging), privacy issues, and non-compliance with Google Play policies (e.g., failure to provide publicly accessible privacy policies). Based on our findings, we offer several recommendations to enhance the security and privacy of mHealth chatbots. These recommendations focus on improving data handling processes, disclosure, and user security. Therefore, this work also seeks to support mHealth developers and security/privacy engineers in designing more transparent, privacy-friendly, and secure mHealth chatbots.
comment: 19 pages, submitted to NordSec 2025 conference
☆ Cultural Awareness, Stereotypes and Communication Skills in Intercultural Communication: The Algerian Participants Perspective
This study explores the relationship between cultural awareness, stereotypes, and communication skills among Algerian participants working or studying in multicultural environments. A quantitative questionnaire was administered to 40 respondents to evaluate their levels of cultural awareness, the presence of stereotypical thinking, and the effectiveness of their intercultural communication skills. Results revealed that while cultural awareness was generally high, certain stereotypes still influenced the perception of others and impacted communication efficiency. Participants with higher cultural awareness demonstrated better communication skills and lower levels of stereotyping. These findings underline the importance of intercultural competence and education programs in reducing prejudice and fostering mutual understanding in diverse contexts.
comment: 20 pages, 9 tables, preprint
☆ Impact of UK Postgraduate Student Experiences on Academic Performance in Blended Learning: A Data Analytics Approach
Blended learning has become a dominant educational model in higher education in the UK and worldwide, particularly after the COVID-19 pandemic. This is further enriched with accompanying pedagogical changes, such as strengthened asynchronous learning, and the use of AI (from ChatGPT and all other similar tools that followed) and other technologies to aid learning. While these educational transformations have enabled flexibility in learning and resource access, they have also exposed new challenges on how students can construct successful learning in hybrid learning environments. In this paper, we investigate the interaction between different dimensions of student learning experiences (ranging from perceived acceptance of teaching methods and staff support/feedback to learning pressure and student motivation) and academic achievement within the context of postgraduate blended learning in UK universities. To achieve this, we employed a combination of several data analytics techniques including visualization, statistical tests, regression analysis, and latent profile analysis. Our empirical results (based on a survey of 255 postgraduate students and holistically interpreted via the Community of Inquiry (CoI) framework) demonstrated that learning activities combining teaching and social presences, and tailored academic support through effective feedback are critical elements for successful postgraduate experience in blended learning contexts. Regarding contributions, this research advances the understanding of student success by identifying the various ways demographic, experiential, and psychological factors impact academic outcomes. And in theoretical terms, it contributes to the extension of the CoI framework by integrating the concept of learner heterogeneity and identifying four distinct student profiles based on how they engage in the different CoI presences.
comment: 25 pages, 5 figures
☆ UpBench: A Dynamically Evolving Real-World Labor-Market Agentic Benchmark Framework Built for Human-Centric AI
As large language model (LLM) agents increasingly undertake digital work, reliable frameworks are needed to evaluate their real-world competence, adaptability, and capacity for human collaboration. Existing benchmarks remain largely static, synthetic, or domain-limited, providing limited insight into how agents perform in dynamic, economically meaningful environments. We introduce UpBench, a dynamically evolving benchmark grounded in real jobs drawn from the global Upwork labor marketplace. Each task corresponds to a verified client transaction, anchoring evaluation in genuine work activity and financial outcomes. UpBench employs a rubric-based evaluation framework, in which expert freelancers decompose each job into detailed, verifiable acceptance criteria and assess AI submissions with per-criterion feedback. This structure enables fine-grained analysis of model strengths, weaknesses, and instruction-following fidelity beyond binary pass/fail metrics. Human expertise is integrated throughout the data pipeline (from job curation and rubric construction to evaluation) ensuring fidelity to real professional standards and supporting research on human-AI collaboration. By regularly refreshing tasks to reflect the evolving nature of online work, UpBench provides a scalable, human-centered foundation for evaluating agentic systems in authentic labor-market contexts, offering a path toward a collaborative framework, where AI amplifies human capability through partnership rather than replacement.
☆ Can LLMs Create Legally Relevant Summaries and Analyses of Videos?
Understanding the legally relevant factual basis of an event and conveying it through text is a key skill of legal professionals. This skill is important for preparing forms (e.g., insurance claims) or other legal documents (e.g., court claims), but often presents a challenge for laypeople. Current AI approaches aim to bridge this gap, but mostly rely on the user to articulate what has happened in text, which may be challenging for many. Here, we investigate the capability of large language models (LLMs) to understand and summarize events occurring in videos. We ask an LLM to summarize and draft legal letters, based on 120 YouTube videos showing legal issues in various domains. Overall, 71.7\% of the summaries were rated as of high or medium quality, which is a promising result, opening the door to a number of applications in e.g. access to justice.
comment: Accepted for publication at JURIX 2025 Torino, Italy. This is the preprint version. Code and data available at: https://github.com/maastrichtlawtech/jurix2025_LLM_video_analysis
☆ Leveraging Large Language Models for Career Mobility Analysis: A Study of Gender, Race, and Job Change Using U.S. Online Resume Profiles
We present a large-scale analysis of career mobility of college-educated U.S. workers using online resume profiles to investigate how gender, race, and job change options are associated with upward mobility. This study addresses key research questions of how the job changes affect their upward career mobility, and how the outcomes of upward career mobility differ by gender and race. We address data challenges -- such as missing demographic attributes, missing wage data, and noisy occupation labels -- through various data processing and Artificial Intelligence (AI) methods. In particular, we develop a large language models (LLMs) based occupation classification method known as FewSOC that achieves accuracy significantly higher than the original occupation labels in the resume dataset. Analysis of 228,710 career trajectories reveals that intra-firm occupation change has been found to facilitate upward mobility most strongly, followed by inter-firm occupation change and inter-firm lateral move. Women and Black college graduates experience significantly lower returns from job changes than men and White peers. Multilevel sensitivity analyses confirm that these disparities are robust to cluster-level heterogeneity and reveal additional intersectional patterns.
comment: Submitted to EPJ Data Science
☆ Educators on the Frontline: Philosophical and Realistic Perspectives on Integrating ChatGPT into the Learning Space
The rapid emergence of Generative AI, particularly ChatGPT, has sparked a global debate on the future of education, often characterized by alarmism and speculation. Moving beyond this, this study investigates the structured, grounded perspectives of a key stakeholder group: university educators. It proposes a novel theoretical model that conceptualizes the educational environment as a "Learning Space" composed of seven subspaces to systematically identify the impact of AI integration. This framework was operationalized through a quantitative survey of 140 Russian university educators, with responses analyzed using a binary flagging system to measure acceptance across key indicators. The results reveal a strong but conditional consensus: a majority of educators support ChatGPT's integration, contingent upon crucial factors such as the transformation of assessment methods and the availability of plagiarism detection tools. However, significant concerns persist regarding its impact on critical thinking. Educators largely reject the notion that AI diminishes their importance, viewing their role as evolving from information-deliverer to facilitator of critical engagement. The study concludes that ChatGPT acts less as a destroyer of education and more as a catalyst for its necessary evolution, and proposes the PIPE Model (Pedagogy, Infrastructure, Policy, Education) as a strategic framework for its responsible integration. This research provides a data-driven, model-based analysis of educator attitudes, offering a nuanced alternative to the polarized discourse surrounding AI in education.
♻ ☆ JobSphere: An AI-Powered Multilingual Career Copilot for Government Employment Platforms
Users of government employment websites commonly face engagement and accessibility challenges linked to navigational complexity, a dearth of language options, and a lack of personalized support. This paper introduces JobSphere, an AI-powered career assistant that is redefining the employment platform in Punjab called PGRKAM. JobSphere employs Retrieval-Augmented Generation (RAG) architecture, and it is multilingual, available in English, Hindi and Punjabi. JobSphere technique uses 4-bit quantization, allowing the platform to deploy on consumer-grade GPUs (i.e., NVIDIA RTX 3050 4GB), making the implementation 89% cheaper than that of cloud-based systems. Key innovations include voice-enabled interaction with the assistant, automated mock tests, resume parsing with skills recognition, and embed-based job recommendation that achieves a precision@10 score of 68%. An evaluation of JobSphere's implementation reveals 94% factual accuracy, a median response time of 1.8 seconds, and a System Usability Scale score of 78.5/100, a 50% improvement compared to the baseline PGRKAM platform context. In conclusion, JobSphere effectively fills significant accessibility gaps for Punjab/Hindi-speaking users in rural locations, while also affirming the users access to trusted job content provided by government agencies.
comment: 7 pages, 4 figures, 4 tables
♻ ☆ Academ-AI: documenting the undisclosed use of generative artificial intelligence in academic publishing
Since generative artificial intelligence (AI) tools such as OpenAI's ChatGPT became widely available, researchers have used them in the writing process. The consensus of the academic publishing community is that such usage must be declared in the published article. Academ-AI documents examples of suspected undeclared AI usage in the academic literature, discernible primarily due to the appearance in research papers of idiosyncratic verbiage characteristic of large language model (LLM)-based chatbots. This analysis of the first 768 examples collected reveals that the problem is widespread, penetrating the journals, conference proceedings, and textbooks of highly respected publishers. Undeclared AI seems to appear in journals with higher citation metrics and higher article processing charges (APCs), precisely those outlets that should theoretically have the resources and expertise to avoid such oversights. An extremely small minority of cases are corrected post publication, and the corrections are often insufficient to rectify the problem. The 768 examples analyzed here likely represent a small fraction of the undeclared AI present in the academic literature, much of which may be undetectable. Publishers must enforce their policies against undeclared AI usage in cases that are detectable; this is the best defense currently available to the academic publishing community against the proliferation of undisclosed AI. This is an updated version of a previous preprint.
comment: 24 pages, 8 figures
♻ ☆ Assessing On-Demand Mobility Services and Policy Impacts: A Case Study from Chengdu, China
The rapid expansion of ride-hailing services has significantly reshaped urban on-demand mobility patterns, but it still remains unclear how they perform relative to traditional street-hailing services and how effective are related policy interventions. This study presents a simulation framework integrating a graph theory-based trip-vehicle matching mechanism with real cruising taxi operations data to simulate ride-hailing services in Chengdu, China. The performances of the two on-demand mobility service modes (i.e., ride-hailing and street-hailing) are evaluated in terms of three key performance indicators: average passenger waiting time (APWT), average deadheading miles (ADM), and average deadheading energy consumption (ADEC). We further examine the impacts of spatiotemporal characteristics and three types of policies: fleet size management, geofencing, and demand management, on the performance of ride-hailing services. Results show that under the same fleet size and trip demand as street-hailing taxis, ride-hailing services without cruising achieve substantial improvements, reducing APWT, ADM, and ADEC by 81\%, 75\%, and 72.1\%, respectively. These improvements are most pronounced during midnight low-demand hours and in remote areas such as airports. Our analysis also reveals that for ride-hailing service, (1) expanding fleet size yields diminishing marginal benefits; (2) geofencing worsens overall performance while it improves the performance of serving all trips within the city center; and (3) demand-side management targeting trips to high-attraction and low-demand areas can effectively reduce passenger waiting time without increasing deadheading costs.
♻ ☆ A Measurement Study of Model Context Protocol Ecosystem
The Model Context Protocol (MCP) has been proposed as a unifying standard for connecting large language models (LLMs) with external tools and resources, promising the same role for AI integration that HTTP and USB played for the Web and peripherals. Yet, despite rapid adoption and hype, its trajectory remains uncertain. Are MCP marketplaces truly growing, or merely inflated by placeholders and abandoned prototypes? Are servers secure and privacy-preserving, or do they expose users to systemic risks? And do clients converge on standardized protocols, or remain fragmented across competing designs? In this paper, we present the first large-scale empirical study of the MCP ecosystem. We design and implement MCPCrawler, a systematic measurement framework that collects and normalizes data from six major markets. Over a 14-day campaign, MCPCrawler aggregated 17,630 raw entries, of which 8,401 valid projects (8,060 servers and 341 clients) were analyzed. Our results reveal that more than half of listed projects are invalid or low-value, that servers face structural risks including dependency monocultures and uneven maintenance, and that clients exhibit a transitional phase in protocol and connection patterns. Together, these findings provide the first evidence-based view of the MCP ecosystem, its risks, and its future trajectory.
♻ ☆ A Comparative Benchmark of Federated Learning Strategies for Mortality Prediction on Heterogeneous and Imbalanced Clinical Data
Machine learning models hold significant potential for predicting in-hospital mortality, yet data privacy constraints and the statistical heterogeneity of real-world clinical data often hamper their development. Federated Learning (FL) offers a privacy-preserving solution, but its performance under non-Independent and Identically Distributed (non-IID) and imbalanced conditions requires rigorous investigation. The study presents a comparative benchmark of five federated learning strategies: FedAvg, FedProx, FedAdagrad, FedAdam, and FedCluster for mortality prediction. Using the large-scale MIMIC-IV dataset, we simulate a realistic non-IID environment by partitioning data by clinical care unit. To address the inherent class imbalance of the task, the SMOTE-Tomek technique is applied to each client's local training data. Our experiments, conducted over 50 communication rounds, reveal that the regularization-based strategy, FedProx, consistently outperformed other methods, achieving the highest F1-Score of 0.8831 while maintaining stable convergence. While the baseline FedAvg was the most computationally efficient, its predictive performance was substantially lower. Our findings indicate that regularization-based FL algorithms like FedProx offer a more robust and effective solution for heterogeneous and imbalanced clinical prediction tasks than standard or server-side adaptive aggregation methods. The work provides a crucial empirical benchmark for selecting appropriate FL strategies for real-world healthcare applications.
comment: The author requests withdrawal due to errors in the results section regarding model performance metrics. These errors compromise the interpretability of the benchmark and the validity of the conclusions. The author prefers to withdraw the paper to prevent the dissemination of flawed results
♻ ☆ LLM-Driven Robots Risk Enacting Discrimination, Violence, and Unlawful Actions
Members of the Human-Robot Interaction (HRI) and Machine Learning (ML) communities have proposed Large Language Models (LLMs) as a promising resource for robotics tasks such as natural language interaction, household and workplace tasks, approximating 'common sense reasoning', and modeling humans. However, recent research has raised concerns about the potential for LLMs to produce discriminatory outcomes and unsafe behaviors in real-world robot experiments and applications. To assess whether such concerns are well placed in the context of HRI, we evaluate several highly-rated LLMs on discrimination and safety criteria. Our evaluation reveals that LLMs are currently unsafe for people across a diverse range of protected identity characteristics, including, but not limited to, race, gender, disability status, nationality, religion, and their intersections. Concretely, we show that LLMs produce directly discriminatory outcomes- e.g., 'gypsy' and 'mute' people are labeled untrustworthy, but not 'european' or 'able-bodied' people. We find various such examples of direct discrimination on HRI tasks such as facial expression, proxemics, security, rescue, and task assignment. Furthermore, we test models in settings with unconstrained natural language (open vocabulary) inputs, and find they fail to act safely, generating responses that accept dangerous, violent, or unlawful instructions-such as incident-causing misstatements, taking people's mobility aids, and sexual predation. Our results underscore the urgent need for systematic, routine, and comprehensive risk assessments and assurances to improve outcomes and ensure LLMs only operate on robots when it is safe, effective, and just to do so. We provide code to reproduce our experiments at https://github.com/rumaisa-azeem/llm-robots-discrimination-safety .
comment: Published in International Journal of Social Robotics (2025). 49 pages (65 with references and appendix), 27 Figures, 8 Tables. Andrew Hundt and Rumaisa Azeem are equal contribution co-first authors. The positions of the two co-first authors were swapped from arxiv version 1 with the written consent of all four authors. The Version of Record is available via DOI: 10.1007/s12369-025-01301-x
♻ ☆ Vulnerability Coordination Under the Cyber Resilience Act
The Cyber Resilience Act (CRA) of the European Union (EU) imposes many new cyber security requirements practically to all network-enabled information technology products, whether hardware or software. The paper examines and elaborates the CRA's new requirements for vulnerability coordination, including vulnerability disclosure. Although these requirements are only a part of the CRA's obligations for vendors, also some new vulnerability coordination mandates are present. In particular, so-called actively exploited vulnerabilities require mandatory reporting. In addition to elaborating the reporting logic, the paper discusses the notion of actively exploited vulnerabilities in relation to the notion of known exploited vulnerabilities used in the United States. The CRA further alters the coordination practices on the side of public administrations. The paper addresses also these new practices. With the examination elaboration, and associated discussion based on conceptual analysis, the paper contributes to the study of cyber security regulations, providing also a few takeaways for further research.
comment: Applied Cybersecurity & Internet Governance, vol. 4, no. 1, pp. 1-18
♻ ☆ Surface Reading LLMs: Synthetic Text and its Styles
Despite a potential plateau in ML advancement, the societal impact of large language models lies not in approaching superintelligence but in generating text surfaces indistinguishable from human writing. While Critical AI Studies provides essential material and socio-technical critique, it risks overlooking how LLMs phenomenologically reshape meaning-making. This paper proposes a semiotics of "surface integrity" as attending to the immediate plane where LLMs inscribe themselves into human communication. I distinguish three knowledge interests in ML research (epistemology, epistēmē, and epistemics) and argue for integrating surface-level stylistic analysis alongside depth-oriented critique. Through two case studies examining stylistic markers of synthetic text, I argue how attending to style as a semiotic phenomenon reveals LLMs as cultural machines that transform the conditions of meaning emergence and circulation in contemporary discourse, independent of questions about machine consciousness.
comment: 12 pages, 1 figure
Computers and Society
☆ Evolution of A4L: A Data Architecture for AI-Augmented Learning
As artificial intelligence (AI) becomes more deeply integrated into educational ecosystems, the demand for scalable solutions that enable personalized learning continues to grow. These architectures must support continuous data flows that power personalized learning and access to meaningful insights to advance learner success at scale. At the National AI Institute for Adult Learning and Online Education (AI-ALOE), we have developed an Architecture for AI-Augmented Learning (A4L) to support analysis and personalization of online education for adult learners. A4L1.0, an early implementation by Georgia Tech's Design Intelligence Laboratory, demonstrated how the architecture supports analysis of meso- and micro-learning by integrating data from Learning Management Systems (LMS) and AI tools. These pilot studies informed the design of A4L2.0. In this chapter, we describe A4L2.0 that leverages 1EdTech Consortium's open standards such as Edu-API, Caliper Analytics, and Learning Tools Interoperability (LTI) to enable secure, interoperable data integration across data systems like Student Information Systems (SIS), LMS, and AI tools. The A4L2.0 data pipeline includes modules for data ingestion, preprocessing, organization, analytics, and visualization.
☆ A Leakage-Aware Data Layer For Student Analytics: The Capire Framework For Multilevel Trajectory Modeling
Predictive models for student dropout, while often accurate, frequently rely on opportunistic feature sets and suffer from undocumented data leakage, limiting their explanatory power and institutional usefulness. This paper introduces a leakage-aware data layer for student trajectory analytics, which serves as the methodological foundation for the CAPIRE framework for multilevel modelling. We propose a feature engineering design that organizes predictors into four levels: N1 (personal and socio-economic attributes), N2 (entry moment and academic history), N3 (curricular friction and performance), and N4 (institutional and macro-context variables)As a core component, we formalize the Value of Observation Time (VOT) as a critical design parameter that rigorously separates observation windows from outcome horizons, preventing data leakage by construction. An illustrative application in a long-cycle engineering program (1,343 students, ~57% dropout) demonstrates that VOT-restricted multilevel features support robust archetype discovery. A UMAP + DBSCAN pipeline uncovers 13 trajectory archetypes, including profiles of "early structural crisis," "sustained friction," and "hidden vulnerability" (low friction but high dropout). Bootstrap and permutation tests confirm these archetypes are statistically robust and temporally stable. We argue that this approach transforms feature engineering from a technical step into a central methodological artifact. This data layer serves as a disciplined bridge between retention theory, early-warning systems, and the future implementation of causal inference and agent-based modelling (ABM) within the CAPIRE program.
comment: Pages: 52 Figures: 4 (Figures 3.1, 3.2, 6.1, and 7.5) Tables: 5 (Tables 2.1, 2.2, 3.1, 7.1, and 7.2) Type: Journal Article Essential Info: A methodological framework (CAPIRE) with an empirical case study (1,343 students)
☆ A Multimodal Manufacturing Safety Chatbot: Knowledge Base Design, Benchmark Development, and Evaluation of Multiple RAG Approaches
Ensuring worker safety remains a critical challenge in modern manufacturing environments. Industry 5.0 reorients the prevailing manufacturing paradigm toward more human-centric operations. Using a design science research methodology, we identify three essential requirements for next-generation safety training systems: high accuracy, low latency, and low cost. We introduce a multimodal chatbot powered by large language models that meets these design requirements. The chatbot uses retrieval-augmented generation to ground its responses in curated regulatory and technical documentation. To evaluate our solution, we developed a domain-specific benchmark of expert-validated question and answer pairs for three representative machines: a Bridgeport manual mill, a Haas TL-1 CNC lathe, and a Universal Robots UR5e collaborative robot. We tested 24 RAG configurations using a full-factorial design and assessed them with automated evaluations of correctness, latency, and cost. Our top 2 configurations were then evaluated by ten industry experts and academic researchers. Our results show that retrieval strategy and model configuration have a significant impact on performance. The top configuration (selected for chatbot deployment) achieved an accuracy of 86.66%, an average latency of 10.04 seconds, and an average cost of $0.005 per query. Overall, our work provides three contributions: an open-source, domain-grounded safety training chatbot; a validated benchmark for evaluating AI-assisted safety instruction; and a systematic methodology for designing and assessing AI-enabled instructional and immersive safety training systems for Industry 5.0 environments.
comment: 25 pages, 5 figures
☆ PRBench: Large-Scale Expert Rubrics for Evaluating High-Stakes Professional Reasoning
Frontier model progress is often measured by academic benchmarks, which offer a limited view of performance in real-world professional contexts. Existing evaluations often fail to assess open-ended, economically consequential tasks in high-stakes domains like Legal and Finance, where practical returns are paramount. To address this, we introduce Professional Reasoning Bench (PRBench), a realistic, open-ended, and difficult benchmark of real-world problems in Finance and Law. We open-source its 1,100 expert-authored tasks and 19,356 expert-curated criteria, making it, to our knowledge, the largest public, rubric-based benchmark for both legal and finance domains. We recruit 182 qualified professionals, holding JDs, CFAs, or 6+ years of experience, who contributed tasks inspired by their actual workflows. This process yields significant diversity, with tasks spanning 114 countries and 47 US jurisdictions. Our expert-curated rubrics are validated through a rigorous quality pipeline, including independent expert validation. Subsequent evaluation of 20 leading models reveals substantial room for improvement, with top scores of only 0.39 (Finance) and 0.37 (Legal) on our Hard subsets. We further catalog associated economic impacts of the prompts and analyze performance using human-annotated rubric categories. Our analysis shows that models with similar overall scores can diverge significantly on specific capabilities. Common failure modes include inaccurate judgments, a lack of process transparency and incomplete reasoning, highlighting critical gaps in their reliability for professional adoption.
☆ Differences in the Moral Foundations of Large Language Models
Large language models are increasingly being used in critical domains of politics, business, and education, but the nature of their normative ethical judgment remains opaque. Alignment research has, to date, not sufficiently utilized perspectives and insights from the field of moral psychology to inform training and evaluation of frontier models. I perform a synthetic experiment on a wide range of models from most major model providers using Jonathan Haidt's influential moral foundations theory (MFT) to elicit diverse value judgments from LLMs. Using multiple descriptive statistical approaches, I document the bias and variance of large language model responses relative to a human baseline in the original survey. My results suggest that models rely on different moral foundations from one another and from a nationally representative human baseline, and these differences increase as model capabilities increase. This work seeks to spur further analysis of LLMs using MFT, including finetuning of open-source models, and greater deliberation by policymakers on the importance of moral foundations for LLM alignment.
☆ A Comparative Evaluation of Prominent Methods in Autonomous Vehicle Certification
The "Vision Zero" policy, introduced by the Swedish Parliament in 1997, aims to eliminate fatalities and serious injuries resulting from traffic accidents. To achieve this goal, the use of self-driving vehicles in traffic is envisioned and a roadmap for the certification of self-driving vehicles is aimed to be determined. However, it is still unclear how the basic safety requirements that autonomous vehicles must meet will be verified and certified, and which methods will be used. This paper focuses on the comparative evaluation of the prominent methods planned to be used in the certification process of autonomous vehicles. It examines the prominent methods used in the certification process, develops a pipeline for the certification process of autonomous vehicles, and determines the stages, actors, and areas where the addressed methods can be applied.
☆ AI as a component in the action research tradition of learning-by-doing
We consider learning mathematics through action research, hacking, discovery, inquiry, learning-by-doing as opposed to the instruct and perform, industrial model of the 19th century. A learning model based on self-awareness, types, functions, structured drawing and formal diagrams addresses the weaknesses of drill and practice and the pitfalls of statistical prediction with Large Language Models. In other words, we build mathematics/informatics education on the activity of a professional mathematician in mathematical modelling and designing programs. This tradition emphasises the role of dialogue and doing mathematics. In the Language/Action approach the teacher designs mathematising situations that scaffold previously encountered, or not-known-how-to-solve problems for the learner while teachers and teacher/interlocutors supervise the process. A critical feature is the written-oral dialogue between the learner and the teacher. As a rule, this is 1 to 1 communication. The role of the teacher/interlocutor, a more knowledgeable other, is mostly performed by a more senior student, 1 per 5 to 7 pupils. After Doug Engelbart we propose the metaphor of human intellect augmented by digital technologies such as interactive development environments or AI. Every human has their bio and digital parts. The bio part of the learner reacts to their work through dialogue in the mind. The digital part poses questions, interprets code and proposes not necessarily sound ideas.
comment: 14 pages, 2 figures
☆ Enhancing Efficiency of Pension Schemes through Effective Risk Governance: A Kenyan Perspective
The efficiency of pension schemes in Kenya invites elevated interest owing to the increasing pension contribution amounts and the expectation that benefits paid out of these schemes would protect members from old age poverty. The study investigates the intervening effect of risk management on the relationship between corporate governance and the efficiency of pension schemes in Kenya. The study employs panel data consisting of 896 observations from 128 schemes in a sample period from 2015 to 2021. The study finds that risk management significantly mediates the relationship between employee representatives on the board of trustees, as a component of corporate governance, and the efficiency of pension schemes. Consequently, the mediation effect of risk management indicates that when employee representatives are involved in governance, the presence of strong risk management practices ensures that their contributions lead to improved efficiency. Risk management, therefore, serves as a critical safeguard that enables governance structures to function more effectively and contribute to the overall performance of the scheme.
comment: 21 pages
☆ Beyond the Hype: Critical Analysis of Student Motivations and Ethical Boundaries in Educational AI Use in Higher Education
The rapid integration of generative artificial intelligence (AI) in higher education since 2023 has outpaced institutional preparedness, creating a persistent gap between student practices and established ethical standards. This paper draws on mixed-method surveys and a focused literature review to examine student motivations, ethical dilemmas, gendered responses, and institutional readiness for AI adoption. We find that 92% of students use AI tools primarily to save time and improve work quality, yet only 36% receive formal guidance, producing a de facto "shadow pedagogy" of unguided workflows. Notably, 18% of students reported integrating AI-constructed material into assignments, which suggests confusion about integrity expectations and compromises the integrity of the assessment. Female students expressed greater concern about abuse and distortion of information than male students, revealing a gendered difference in awareness of risk and AI literacies. Correspondingly, 72% of educators use AI, but only 14% feel at ease doing so, reflecting limited training and uneven policy responses. We argue that institutions must adopt comprehensive AI literacy programs that integrate technical skills and ethical reasoning, alongside clear AI-use policies and assessment practices that promote transparency. The paper proposes an Ethical AI Integration Model centered on literacy, gender-inclusive support, and assessment redesign to guide responsible adoption, protect academic integrity, and foster equitable educational outcomes in an AI-driven landscape.
comment: 14 pages, 3 figures
☆ Building the Web for Agents: A Declarative Framework for Agent-Web Interaction
The increasing deployment of autonomous AI agents on the web is hampered by a fundamental misalignment: agents must infer affordances from human-oriented user interfaces, leading to brittle, inefficient, and insecure interactions. To address this, we introduce VOIX, a web-native framework that enables websites to expose reliable, auditable, and privacy-preserving capabilities for AI agents through simple, declarative HTML elements. VOIX introduces and tags, allowing developers to explicitly define available actions and relevant state, thereby creating a clear, machine-readable contract for agent behavior. This approach shifts control to the website developer while preserving user privacy by disconnecting the conversational interactions from the website. We evaluated the framework's practicality, learnability, and expressiveness in a three-day hackathon study with 16 developers. The results demonstrate that participants, regardless of prior experience, were able to rapidly build diverse and functional agent-enabled web applications. Ultimately, this work provides a foundational mechanism for realizing the Agentic Web, enabling a future of seamless and secure human-AI collaboration on the web.
comment: for associated documentation, see https://svenschultze.github.io/VOIX/
☆ Towards Usable Privacy Management for IoT TAPs: Deriving Privacy Clusters and Preference Profiles
IoT Trigger-Action Platforms (TAPs) typically offer coarse-grained permission controls. Even when fine-grained controls are available, users are likely overwhelmed by the complexity of setting privacy preferences. This paper contributes to usable privacy management for TAPs by deriving privacy clusters and profiles for different types of users that can be semi-automatically assigned or suggested to them. We developed and validated a questionnaire, based on users' privacy concerns regarding confidentiality and control and their requirements towards transparency in TAPs. In an online study (N=301), where participants were informed about potential privacy risks, we clustered users by their privacy concerns and requirements into Basic, Medium and High Privacy clusters. These clusters were then characterized by the users' data sharing preferences, based on a factorial vignette approach, considering the data categories, the data recipient types, and the purpose of data sharing. Our findings show three distinct privacy profiles, providing a foundation for more usable privacy controls in TAPs.
☆ Data-driven strategic sensor placement for detecting disinfection by-products in water distribution networks
Disinfection byproducts are contaminants that can cause long-term effects on human health, occurring in chlorinated drinking water when the disinfectant interacts with natural organic matter. Their formation is affected by many environmental parameters, making it difficult to monitor and detect disinfection byproducts before they reach households. Due to the large variety of disinfection byproduct compounds that can be formed in water distribution networks, plus the constrained number of sensors that can be deployed throughout a system to monitor these contaminants, it is of outmost importance to place sensory equipment efficiently and optimally. In this paper, we present DBPFinder, a simulation software that assists in the strategic sensor placement for detecting disinfection byproducts, tested at a real-world water distribution network in Coimbra, Portugal. This simulator addresses multiple performance objectives at once in order to provide optimal solution placement recommendations to water utility operators based on their needs. A number of different experiments performed indicate its correctness, relevance, efficiency and scalability.
☆ Specification, Application, and Operationalization of a Metamodel of Fairness
This paper presents the AR fairness metamodel, aimed at formally representing, analyzing, and comparing fairness scenarios. The metamodel provides an abstract representation of fairness, enabling the formal definition of fairness notions. We instantiate the metamodel through several examples, with a particular focus on comparing the notions of equity and equality. We use the Tiles framework, which offers modular components that can be interconnected to represent various definitions of fairness. Its primary objective is to support the operationalization of AR-based fairness definitions in a range of scenarios, providing a robust method for defining, comparing, and evaluating fairness. Tiles has an open-source implementation for fairness modeling and evaluation.
☆ PRSM: A Measure to Evaluate CLIP's Robustness Against Paraphrases
Contrastive Language-Image Pre-training (CLIP) is a widely used multimodal model that aligns text and image representations through large-scale training. While it performs strongly on zero-shot and few-shot tasks, its robustness to linguistic variation, particularly paraphrasing, remains underexplored. Paraphrase robustness is essential for reliable deployment, especially in socially sensitive contexts where inconsistent representations can amplify demographic biases. In this paper, we introduce the Paraphrase Ranking Stability Metric (PRSM), a novel measure for quantifying CLIP's sensitivity to paraphrased queries. Using the Social Counterfactuals dataset, a benchmark designed to reveal social and demographic biases, we empirically assess CLIP's stability under paraphrastic variation, examine the interaction between paraphrase robustness and gender, and discuss implications for fairness and equitable deployment of multimodal systems. Our analysis reveals that robustness varies across paraphrasing strategies, with subtle yet consistent differences observed between male- and female-associated queries.
comment: 8 pages, accpeted as short paper at MMM 2026
☆ Scaling Equitable Reflection Assessment in Education via Large Language Models and Role-Based Feedback Agents AAAI-26
Formative feedback is widely recognized as one of the most effective drivers of student learning, yet it remains difficult to implement equitably at scale. In large or low-resource courses, instructors often lack the time, staffing, and bandwidth required to review and respond to every student reflection, creating gaps in support precisely where learners would benefit most. This paper presents a theory-grounded system that uses five coordinated role-based LLM agents (Evaluator, Equity Monitor, Metacognitive Coach, Aggregator, and Reflexion Reviewer) to score learner reflections with a shared rubric and to generate short, bias-aware, learner-facing comments. The agents first produce structured rubric scores, then check for potentially biased or exclusionary language, add metacognitive prompts that invite students to think about their own thinking, and finally compose a concise feedback message of at most 120 words. The system includes simple fairness checks that compare scoring error across lower and higher scoring learners, enabling instructors to monitor and bound disparities in accuracy. We evaluate the pipeline in a 12-session AI literacy program with adult learners. In this setting, the system produces rubric scores that approach expert-level agreement, and trained graders rate the AI-generated comments as helpful, empathetic, and well aligned with instructional goals. Taken together, these results show that multi-agent LLM systems can deliver equitable, high-quality formative feedback at a scale and speed that would be impossible for human graders alone. More broadly, the work points toward a future where feedback-rich learning becomes feasible for any course size or context, advancing long-standing goals of equity, access, and instructional capacity in education.
comment: Accepted to AAAI-26 AISI Track
☆ Analysing Personal Attacks in U.S. Presidential Debates
Personal attacks have become a notable feature of U.S. presidential debates and play an important role in shaping public perception during elections. Detecting such attacks can improve transparency in political discourse and provide insights for journalists, analysts and the public. Advances in deep learning and transformer-based models, particularly BERT and large language models (LLMs) have created new opportunities for automated detection of harmful language. Motivated by these developments, we present a framework for analysing personal attacks in U.S. presidential debates. Our work involves manual annotation of debate transcripts across the 2016, 2020 and 2024 election cycles, followed by statistical and language-model based analysis. We investigate the potential of fine-tuned transformer models alongside general-purpose LLMs to detect personal attacks in formal political speech. This study demonstrates how task-specific adaptation of modern language models can contribute to a deeper understanding of political communication.
comment: 13 pages
☆ SP-Guard: Selective Prompt-adaptive Guidance for Safe Text-to-Image Generation ECAI 2025
While diffusion-based T2I models have achieved remarkable image generation quality, they also enable easy creation of harmful content, raising social concerns and highlighting the need for safer generation. Existing inference-time guiding methods lack both adaptivity--adjusting guidance strength based on the prompt--and selectivity--targeting only unsafe regions of the image. Our method, SP-Guard, addresses these limitations by estimating prompt harmfulness and applying a selective guidance mask to guide only unsafe areas. Experiments show that SP-Guard generates safer images than existing methods while minimizing unintended content alteration. Beyond improving safety, our findings highlight the importance of transparency and controllability in image generation.
comment: Accepted for presentation at TRUST-AI Workshop, ECAI 2025. Proceedings to appear in CEUR-WS
☆ Governance, Risk, and Regulation: A Framework for Improving Efficiency in Kenyan Pension Funds
As life expectancy in Kenya increases, so does the need for efficient pension schemes that can secure a dignified retirement and protect members from old age poverty. Limited research, however, has explored the efficiency of these schemes under existing governance structures. This study addresses that gap by examining the combined effects of corporate governance, risk management, and industry regulation on pension scheme efficiency in Kenya. Using a quantitative design, we conducted a panel regression analysis on a seven-year secondary dataset of 128 Kenyan pension schemes, totaling 896 observations. Our results reveal significant insights That the presence of employee representatives on the board and effective risk management have a significant positive effect on efficiency. Conversely, independent board members exhibit a significant negative effect. Other factors, including top management representation, female board members, and industry regulation, showed no significant effect on efficiency in the joint model. These findings suggest that the impact of governance and risk management on efficiency is nuanced, with specific factors like employee representation playing a more prominent role. We propose that the electoral process for employee board members may introduce a Self Cleaning Mechanism that progressively enhances scheme efficiency. This mechanism offers a novel theoretical extension of Agency Theory, explaining the convergence of interests between elected trustees and scheme members.
comment: 28 pages
☆ Demystify, Use, Reflect: Preparing students to be informed LLM-users
We transitioned our post-CS1 course that introduces various subfields of computer science so that it integrates Large Language Models (LLMs) in a structured, critical, and practical manner. It aims to help students develop the skills needed to engage meaningfully and responsibly with AI. The course now includes explicit instruction on how LLMs work, exposure to current tools, ethical issues, and activities that encourage student reflection on personal use of LLMs as well as the larger evolving landscape of AI-assisted programming. In class, we demonstrate the use and verification of LLM outputs, guide students in the use of LLMs as an ingredient in a larger problem-solving loop, and require students to disclose and acknowledge the nature and extent of LLM assistance. Throughout the course, we discuss risks and benefits of LLMs across CS subfields. In our first iteration of the course, we collected and analyzed data from students pre and post surveys. Student understanding of how LLMs work became more technical, and their verification and use of LLMs shifted to be more discerning and collaborative. These strategies can be used in other courses to prepare students for the AI-integrated future.
comment: 2 pages 1 table Submitted to SIGCSE 2026
☆ Ethical conundrums: Hacked data in the study of far-right violent extremism
Ethical conduct in digital research is full of grey areas. Disciplinary, institutional and individual norms and conventions developed to support research are challenged, often leaving scholars with a sense of unease or lack of clarity. The growing availability of hacked data is one area. Discussions and debates around the use of these datasets in research are extremely limited. Reviews of the history, culture, or morality of the act of hacking are topics that have attracted some scholarly attention. However, how to undertake research with this data is less examined and provides an opportunity for the generation of reflexive ethical practice. This article presents a case-study outlining the ethical debates that arose when considering the use of hacked data to examine online far-right violent extremism. It argues that under certain circumstances, researchers can do ethical research with hacked data. However, to do so we must proactively and continually engage deeply with ethical quandaries and dilemmas.
comment: To be published in New Media & Society
☆ Cost Transparency of Enterprise AI Adoption
Recent advances in large language models (LLMs) have dramatically improved performance on a wide range of tasks, driving rapid enterprise adoption. Yet, the cost of adopting these AI services is understudied. Unlike traditional software licensing in which costs are predictable before usage, commercial LLM services charge per token of input text in addition to generated output tokens. Crucially, while firms can control the input, they have limited control over output tokens, which are effectively set by generation dynamics outside of business control. This research shows that subtle shifts in linguistic style can systematically alter the number of output tokens without impacting response quality. Using an experiment with OpenAI's API, this study reveals that non-polite prompts significantly increase output tokens leading to higher enterprise costs and additional revenue for OpenAI. Politeness is merely one instance of a broader phenomenon in which linguistic structure can drive unpredictable cost variation. For enterprises integrating LLM into applications, this unpredictability complicates budgeting and undermines transparency in business-to-business contexts. By demonstrating how end-user behavior links to enterprise costs through output token counts, this work highlights the opacity of current pricing models and calls for new approaches to ensure predictable and transparent adoption of LLM services.
♻ ☆ Mutual Wanting in Human--AI Interaction: Empirical Evidence from Large-Scale Analysis of GPT Model Transitions
The rapid evolution of large language models (LLMs) creates complex bidirectional expectations between users and AI systems that are poorly understood. We introduce the concept of "mutual wanting" to analyze these expectations during major model transitions. Through analysis of user comments from major AI forums and controlled experiments across multiple OpenAI models, we provide the first large-scale empirical validation of bidirectional desire dynamics in human-AI interaction. Our findings reveal that nearly half of users employ anthropomorphic language, trust significantly exceeds betrayal language, and users cluster into distinct "mutual wanting" types. We identify measurable expectation violation patterns and quantify the expectation-reality gap following major model releases. Using advanced NLP techniques including dual-algorithm topic modeling and multi-dimensional feature extraction, we develop the Mutual Wanting Alignment Framework (M-WAF) with practical applications for proactive user experience management and AI system design. These findings establish mutual wanting as a measurable phenomenon with clear implications for building more trustworthy and relationally-aware AI systems.
♻ ☆ Towards Efficient Certification of Maritime Remote Operation Centers
Additional automation being build into ships implies a shift of crew from ship to shore. However, automated ships still have to be monitored and, in some situations, controlled remotely. These tasks are carried out by human operators located in shore-based remote operation centers. In this work, we present a concept for a hazard database that supports the safeguarding and certification of such remote operation centers. The concept is based on a categorization of hazard sources which we derive from a generic functional architecture. A subsequent preliminary suitability analysis unveils which methods for hazard analysis and risk assessment can adequately fill this hazard database.
♻ ☆ Fairness for the People, by the People: Minority Collective Action
Machine learning models often preserve biases present in training data, leading to unfair treatment of certain minority groups. Despite an array of existing firm-side bias mitigation techniques, they typically incur utility costs and require organizational buy-in. Recognizing that many models rely on user-contributed data, end-users can induce fairness through the framework of Algorithmic Collective Action, where a coordinated minority group strategically relabels its own data to enhance fairness, without altering the firm's training process. We propose three practical, model-agnostic methods to approximate ideal relabeling and validate them on real-world datasets. Our findings show that a subgroup of the minority can substantially reduce unfairness with a small impact on the overall prediction error.
♻ ☆ Self-supervised Learning of Echocardiographic Video Representations via Online Cluster Distillation
Self-supervised learning (SSL) has achieved major advances in natural images and video understanding, but challenges remain in domains like echocardiography (heart ultrasound) due to subtle anatomical structures, complex temporal dynamics, and the current lack of domain-specific pre-trained models. Existing SSL approaches such as contrastive, masked modeling, and clustering-based methods struggle with high intersample similarity, sensitivity to low PSNR inputs common in ultrasound, or aggressive augmentations that distort clinically relevant features. We present DISCOVR (Distilled Image Supervision for Cross Modal Video Representation), a self-supervised dual branch framework for cardiac ultrasound video representation learning. DISCOVR combines a clustering-based video encoder that models temporal dynamics with an online image encoder that extracts fine-grained spatial semantics. These branches are connected through a semantic cluster distillation loss that transfers anatomical knowledge from the evolving image encoder to the video encoder, enabling temporally coherent representations enriched with fine-grained semantic understanding.Evaluated on six echocardiography datasets spanning fetal, pediatric, and adult populations, DISCOVR outperforms both specialized video anomaly detection methods and state-of-the-art video-SSL baselines in zero-shot and linear probing setups,achieving superior segmentation transfer and strong downstream performance on clinically relevant tasks such as LVEF prediction. Code available at: https://github.com/mdivyanshu97/DISCOVR
♻ ☆ Consumer Beware! Exploring Data Brokers' CCPA Compliance
Data brokers collect and sell the personal information of millions of individuals, often without their knowledge or consent. The California Consumer Privacy Act (CCPA) grants consumers the legal right to request access to, or deletion of, their data. To facilitate these requests, California maintains an official registry of data brokers. However, the extent to which these entities comply with the law is unclear. This paper presents the first large-scale, systematic study of CCPA compliance of all 543 officially registered data brokers. Data access requests were manually submitted to each broker, followed by in-depth analyses of their responses (or lack thereof). Above 40% failed to respond at all, in an apparent violation of the CCPA. Data brokers that responded requested personal information as part of their identity verification process, including details they had not previously collected. Paradoxically, this means that exercising one's privacy rights under CCPA introduces new privacy risks. Our findings reveal rampant non-compliance and lack of standardization of the data access request process. These issues highlight an urgent need for stronger enforcement, clearer guidelines, and standardized, periodic compliance checks to enhance consumers' privacy protections and improve data broker accountability.
comment: To appear at IEEE S&P 2026